Voice Agents

Contents

Why voice agents
The pipeline
Speech recognition
LLM selection
Text-to-speech
Latency optimization
LiveKit infrastructure

01 — Motivation

Why Voice Agents Matter

Voice is natural. Humans prefer talking to typing. Voice agents enable: hands-free interaction (driving, cooking), accessibility (blindness, dyslexia), and faster iteration than typing. Phone systems, smart speakers, and customer service bots all use voice. The barrier: latency and naturalness. Delay > 500ms feels unnatural.

Use Cases

Customer service: Phone support agent that answers questions, books appointments. Smart speakers: Alexa-like devices. Hands-free assistants: Voice control in cars, homes. Accessibility: Voice-first interfaces for vision-impaired users. Call centers: AI handling routine calls, escalating complex ones.

💡 Key insight: Voice agents are harder than text because of latency and naturalness constraints. Text LLM can think for 2s; voice can't — you need fast, coherent responses in <500ms.

02 — Architecture

The Voice Agent Pipeline

Three stages: speech-to-text (STT) converts audio → text. LLM processes text, generates response. Text-to-speech (TTS) converts response → audio. Parallelization and streaming minimize latency.

Full Pipeline

1. Audio capture: User speaks. Mic captures at 16 kHz mono. 2. VAD (Voice Activity Detection): Detect when user stops speaking (turn detection). 3. STT: Convert audio to transcript. 4. LLM: Process transcript, generate response. 5. TTS streaming: Convert response to audio on-the-fly, stream chunks as they're generated. 6. Audio playback: Play synthesized audio while still generating. 7. Conversation state: Track context for multi-turn.

Latency Budget (Target: <500ms TTFB)

Stage	Latency budget	Notes
VAD (turn detection)	50–100ms	Should be fast; can be local
STT (Whisper/Deepgram)	100–300ms	Varies; local is fast but lower quality
LLM inference	100–200ms	Use fast models (Haiku, Gemini Flash)
TTS generation	100–200ms	Stream to minimize wait before playback
Network/overhead	50–100ms	Accumulates across calls

✓ Streaming is critical: Don't wait for full TTS output. Start streaming audio to user as soon as first chunk is ready (barge-in). User hears response while LLM still generating.

03 — Speech Recognition

STT Options: Local vs Cloud

Tradeoff: accuracy vs latency vs cost.

STT Providers

Local

Whisper (OpenAI)

Accurate multilingual model. Can run locally (CPU/GPU). ~10–20s for 1min audio.

Cloud

Deepgram

Real-time streaming, ~200ms latency. Good accuracy. ~$0.0043 per min.

Cloud

Assembly AI

Streaming & batch, real-time, ~400ms latency. Good for quality.

Local

Faster-Whisper

Optimized Whisper implementation. ~2–5s for 1min (GPU).

Cloud

Google Cloud STT

Streaming support, good multilingual. ~$0.006 per 15sec.

Edge

Silero VAD

Lightweight VAD for turn detection. Works fully offline.

Comparison

Option	Latency	Accuracy	Cost	Best for
Deepgram	~200ms	Very good	$0.0043/min	Real-time, budget-friendly
Whisper (local)	5–20s	Excellent	Free	Batch, high accuracy
Assembly AI	~400ms	Excellent	Higher	Quality + real-time
Google Cloud STT	~300ms	Good	$0.006/15s	Enterprise, multilingual

⚠️ Real-time streaming cost: Deepgram (streaming) vs Assembly (batch). For continuous voice agents, streaming is essential but more expensive. Budget: ~$0.01–0.02 per minute for full pipeline (STT + TTS).

04 — LLM Choice

LLM Selection for Voice

For voice, choose speed over size. Latency matters more than perfect accuracy.

Model Recommendations

Claude 3.5 Haiku — Best balance

Fast, good quality. ~50–100ms inference time. Recommended default.

Gemini 2.0 Flash — Fastest

Google's fastest model. ~30–80ms. Good for latency-critical apps.

GPT-4o mini — Alternative

OpenAI's lightweight. ~100–150ms. Solid fallback.

Llama 2/3 (local) — Privacy

Self-hosted. ~200–500ms on CPU. Use for sensitive data.

✓ Streaming is key: Use streaming responses. Start sending TTS chunks before LLM finishes. Users hear response ~100ms earlier.

05 — Text-to-Speech

TTS Options

TTS must be fast and natural. APIs are easier; self-hosted models give control.

TTS Providers

API

ElevenLabs

Most natural voices. Streaming support. ~150–200ms latency. ~$0.015 per 1k chars.

API

OpenAI TTS

Good quality, streaming. ~200–300ms. ~$0.015 per 1k chars.

API

Google Cloud TTS

Multilingual, natural. ~$0.004 per 1k chars.

Local

Glow-TTS

Fast synthesis. ~50–100ms. Lower quality than APIs.

Local

XTTS-v2

Voice cloning, ~200–400ms. Good quality locally.

API

Azure Speech

Enterprise, multilingual. ~$0.004–0.008 per 1k chars.

Streaming is Essential

Don't wait for full TTS. Stream chunks as available. ElevenLabs and OpenAI support streaming. Gives users audio ~100ms faster.

Provider	Latency (first chunk)	Streaming	Voice quality
ElevenLabs	~150ms	Yes	Excellent
OpenAI TTS	~200ms	Yes	Good
Google Cloud	~150ms	Yes	Good
Glow-TTS (local)	~50ms	N/A	Fair

06 — Latency Optimization

Achieving Sub-500ms TTFB

Getting all three stages fast requires parallelization and streaming.

Optimization Techniques

Streaming LLM responses: Don't wait for full response. Send first token to TTS within ~100ms. Streaming TTS: Don't wait for full audio. Play first chunk while generating rest. Local VAD: Turn detection should be local (Silero), not API call. Prompt caching: Cache system prompts, conversation history (if supported). Connection pooling: Reuse API connections. Batching where possible: But not for conversational latency.

Timing breakdown for optimized pipeline: User says: "What's the weather?" Timeline: T=0ms: User stops speaking (VAD detects silence) T=50ms: Audio sent to STT T=200ms: STT returns "What's the weather?" T=250ms: LLM starts generating response T=280ms: First LLM token arrives T=300ms: First TTS chunk generated T=330ms: First audio chunk plays (User hears: "T...") T=600ms: Full response generated and played Total: 600ms from user stop to hearing response. With optimization: ~400–500ms possible.

⚠️ Budget latency conservatively: Network jitter, retries, and model variability add buffer. Target <400ms to leave headroom.

07 — Infrastructure

LiveKit: WebRTC for Voice Agents

LiveKit is the infrastructure layer for real-time voice/video. Handles WebRTC connections, audio routing, and participant management. Lets you build voice agents without managing STUN servers, ICE candidates, or codec negotiation.

What LiveKit Provides

WebRTC handling: Peer-to-peer audio/video with fallback to TURN servers. Agent API: Connect an AI agent to a room; agent can listen and speak. Recording: Built-in recording, transcription hooks. Multi-party: Multiple agents/users in same conversation. Managed service or self-hosted: Cloud or on-prem.

LiveKit Voice Agent Pattern

1. User joins room: Browser/mobile connects via WebRTC. 2. Agent joins room: Programmatic agent connects, listens to audio. 3. Audio flow: User's microphone → agent's input stream. Agent processes → LLM → TTS → plays to user. 4. Recording: Optionally record conversation for audit/training. 5. Disconnect: Agent leaves room when done.

LiveKit Agents SDK

LiveKit provides SDKs for building agents (Python, Go). Define an agent with STT, LLM, TTS components. LiveKit handles the WebRTC plumbing.

LiveKit agent pseudocode: from livekit.agents import VoiceAssistant, STT, LLM, TTS from livekit.agents.stm import STMAgent agent = VoiceAssistant( name="Support Bot", vad=Silero(), # Voice activity detection stt=Deepgram(), # Speech-to-text llm=ChatAnthropic(), # Language model tts=ElevenLabs(), # Text-to-speech ) # Join LiveKit room and listen for participants agent.run(room_url, room_token) # Agent handles all conversation automatically

✓ Best practice: Use LiveKit if you need multi-party, recording, or managed infrastructure. Build custom if you're optimizing for ultra-low latency (edge deployment).

# Full real-time voice agent with LiveKit # pip install livekit-agents livekit-plugins-openai livekit-plugins-silero from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli, llm from livekit.agents.voice_assistant import VoiceAssistant from livekit.plugins import openai, silero async def entrypoint(ctx: JobContext): initial_ctx = llm.ChatContext().append( role="system", text=( "You are a voice assistant for Acme customer support. " "Keep responses under 40 words. Never say 'certainly' or 'absolutely'." ), ) await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY) assistant = VoiceAssistant( vad=silero.VAD.load(), # voice activity detection stt=openai.STT(model="whisper-1"), # speech-to-text llm=openai.LLM(model="gpt-4o-mini"), # language model tts=openai.TTS(voice="nova", speed=1.1), # text-to-speech chat_ctx=initial_ctx, allow_interruptions=True, interrupt_speech_duration=0.5, # interrupt after 0.5s of speech min_endpointing_delay=0.3, # wait 0.3s before LLM call ) assistant.start(ctx.room) await assistant.say("Hi, how can I help you today?", allow_interruptions=True) if __name__ == "__main__": cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint)) # Run: python agent.py start --url wss://your-livekit-server.livekit.cloud --api-key ... --api-secret ...

08 — Further Reading

References and Related Concepts

Child Concepts

LiveKit Agents — WebRTC infrastructure for real-time voice
ElevenLabs — Natural voice synthesis for agents

Related Concepts

Agents — Autonomous agent architecture and execution
Streaming — Token and chunk streaming for low latency
State & Sessions — Conversation state management

Papers & Resources

Docs LiveKit Documentation — docs.livekit.io ↗
Docs LiveKit Agents SDK — github.com/livekit/agents ↗
Blog Voice Agent Latency Benchmarks — Real-time performance analysis

Voice Agents

Why Voice Agents Matter

Use Cases

The Voice Agent Pipeline

Full Pipeline

Latency Budget (Target: <500ms TTFB)

STT Options: Local vs Cloud

STT Providers

Comparison

LLM Selection for Voice

Model Recommendations

Claude 3.5 Haiku — Best balance

Gemini 2.0 Flash — Fastest

GPT-4o mini — Alternative

Llama 2/3 (local) — Privacy

TTS Options

TTS Providers

Streaming is Essential

Achieving Sub-500ms TTFB

Optimization Techniques

LiveKit: WebRTC for Voice Agents

What LiveKit Provides

LiveKit Voice Agent Pattern

LiveKit Agents SDK

References and Related Concepts

Related concepts