APPLICATIONS & SYSTEMS

Voice Agents

Real-time conversational AI with natural speech input and output. STT→LLM→TTS pipelines optimized for sub-500ms latency.

STT → LLM → TTS the architecture
< 500ms TTFB time to first byte
turn detection conversation management
Contents
  1. Why voice agents
  2. The pipeline
  3. Speech recognition
  4. LLM selection
  5. Text-to-speech
  6. Latency optimization
  7. LiveKit infrastructure
01 — Motivation

Why Voice Agents Matter

Voice is natural. Humans prefer talking to typing. Voice agents enable: hands-free interaction (driving, cooking), accessibility (blindness, dyslexia), and faster iteration than typing. Phone systems, smart speakers, and customer service bots all use voice. The barrier: latency and naturalness. Delay > 500ms feels unnatural.

Use Cases

Customer service: Phone support agent that answers questions, books appointments. Smart speakers: Alexa-like devices. Hands-free assistants: Voice control in cars, homes. Accessibility: Voice-first interfaces for vision-impaired users. Call centers: AI handling routine calls, escalating complex ones.

💡 Key insight: Voice agents are harder than text because of latency and naturalness constraints. Text LLM can think for 2s; voice can't — you need fast, coherent responses in <500ms.
02 — Architecture

The Voice Agent Pipeline

Three stages: speech-to-text (STT) converts audio → text. LLM processes text, generates response. Text-to-speech (TTS) converts response → audio. Parallelization and streaming minimize latency.

Full Pipeline

1. Audio capture: User speaks. Mic captures at 16 kHz mono. 2. VAD (Voice Activity Detection): Detect when user stops speaking (turn detection). 3. STT: Convert audio to transcript. 4. LLM: Process transcript, generate response. 5. TTS streaming: Convert response to audio on-the-fly, stream chunks as they're generated. 6. Audio playback: Play synthesized audio while still generating. 7. Conversation state: Track context for multi-turn.

Latency Budget (Target: <500ms TTFB)

StageLatency budgetNotes
VAD (turn detection)50–100msShould be fast; can be local
STT (Whisper/Deepgram)100–300msVaries; local is fast but lower quality
LLM inference100–200msUse fast models (Haiku, Gemini Flash)
TTS generation100–200msStream to minimize wait before playback
Network/overhead50–100msAccumulates across calls
Streaming is critical: Don't wait for full TTS output. Start streaming audio to user as soon as first chunk is ready (barge-in). User hears response while LLM still generating.
03 — Speech Recognition

STT Options: Local vs Cloud

Tradeoff: accuracy vs latency vs cost.

STT Providers

Local
Whisper (OpenAI)
Accurate multilingual model. Can run locally (CPU/GPU). ~10–20s for 1min audio.
Cloud
Deepgram
Real-time streaming, ~200ms latency. Good accuracy. ~$0.0043 per min.
Cloud
Assembly AI
Streaming & batch, real-time, ~400ms latency. Good for quality.
Local
Faster-Whisper
Optimized Whisper implementation. ~2–5s for 1min (GPU).
Cloud
Google Cloud STT
Streaming support, good multilingual. ~$0.006 per 15sec.
Edge
Silero VAD
Lightweight VAD for turn detection. Works fully offline.

Comparison

OptionLatencyAccuracyCostBest for
Deepgram~200msVery good$0.0043/minReal-time, budget-friendly
Whisper (local)5–20sExcellentFreeBatch, high accuracy
Assembly AI~400msExcellentHigherQuality + real-time
Google Cloud STT~300msGood$0.006/15sEnterprise, multilingual
⚠️ Real-time streaming cost: Deepgram (streaming) vs Assembly (batch). For continuous voice agents, streaming is essential but more expensive. Budget: ~$0.01–0.02 per minute for full pipeline (STT + TTS).
04 — LLM Choice

LLM Selection for Voice

For voice, choose speed over size. Latency matters more than perfect accuracy.

Model Recommendations

1

Claude 3.5 Haiku — Best balance

Fast, good quality. ~50–100ms inference time. Recommended default.

2

Gemini 2.0 Flash — Fastest

Google's fastest model. ~30–80ms. Good for latency-critical apps.

3

GPT-4o mini — Alternative

OpenAI's lightweight. ~100–150ms. Solid fallback.

4

Llama 2/3 (local) — Privacy

Self-hosted. ~200–500ms on CPU. Use for sensitive data.

Streaming is key: Use streaming responses. Start sending TTS chunks before LLM finishes. Users hear response ~100ms earlier.
05 — Text-to-Speech

TTS Options

TTS must be fast and natural. APIs are easier; self-hosted models give control.

TTS Providers

API
ElevenLabs
Most natural voices. Streaming support. ~150–200ms latency. ~$0.015 per 1k chars.
API
OpenAI TTS
Good quality, streaming. ~200–300ms. ~$0.015 per 1k chars.
API
Google Cloud TTS
Multilingual, natural. ~$0.004 per 1k chars.
Local
Glow-TTS
Fast synthesis. ~50–100ms. Lower quality than APIs.
Local
XTTS-v2
Voice cloning, ~200–400ms. Good quality locally.
API
Azure Speech
Enterprise, multilingual. ~$0.004–0.008 per 1k chars.

Streaming is Essential

Don't wait for full TTS. Stream chunks as available. ElevenLabs and OpenAI support streaming. Gives users audio ~100ms faster.

ProviderLatency (first chunk)StreamingVoice quality
ElevenLabs~150msYesExcellent
OpenAI TTS~200msYesGood
Google Cloud~150msYesGood
Glow-TTS (local)~50msN/AFair
06 — Latency Optimization

Achieving Sub-500ms TTFB

Getting all three stages fast requires parallelization and streaming.

Optimization Techniques

Streaming LLM responses: Don't wait for full response. Send first token to TTS within ~100ms. Streaming TTS: Don't wait for full audio. Play first chunk while generating rest. Local VAD: Turn detection should be local (Silero), not API call. Prompt caching: Cache system prompts, conversation history (if supported). Connection pooling: Reuse API connections. Batching where possible: But not for conversational latency.

Timing breakdown for optimized pipeline: User says: "What's the weather?" Timeline: T=0ms: User stops speaking (VAD detects silence) T=50ms: Audio sent to STT T=200ms: STT returns "What's the weather?" T=250ms: LLM starts generating response T=280ms: First LLM token arrives T=300ms: First TTS chunk generated T=330ms: First audio chunk plays (User hears: "T...") T=600ms: Full response generated and played Total: 600ms from user stop to hearing response. With optimization: ~400–500ms possible.
⚠️ Budget latency conservatively: Network jitter, retries, and model variability add buffer. Target <400ms to leave headroom.
07 — Infrastructure

LiveKit: WebRTC for Voice Agents

LiveKit is the infrastructure layer for real-time voice/video. Handles WebRTC connections, audio routing, and participant management. Lets you build voice agents without managing STUN servers, ICE candidates, or codec negotiation.

What LiveKit Provides

WebRTC handling: Peer-to-peer audio/video with fallback to TURN servers. Agent API: Connect an AI agent to a room; agent can listen and speak. Recording: Built-in recording, transcription hooks. Multi-party: Multiple agents/users in same conversation. Managed service or self-hosted: Cloud or on-prem.

LiveKit Voice Agent Pattern

1. User joins room: Browser/mobile connects via WebRTC. 2. Agent joins room: Programmatic agent connects, listens to audio. 3. Audio flow: User's microphone → agent's input stream. Agent processes → LLM → TTS → plays to user. 4. Recording: Optionally record conversation for audit/training. 5. Disconnect: Agent leaves room when done.

LiveKit Agents SDK

LiveKit provides SDKs for building agents (Python, Go). Define an agent with STT, LLM, TTS components. LiveKit handles the WebRTC plumbing.

LiveKit agent pseudocode: from livekit.agents import VoiceAssistant, STT, LLM, TTS from livekit.agents.stm import STMAgent agent = VoiceAssistant( name="Support Bot", vad=Silero(), # Voice activity detection stt=Deepgram(), # Speech-to-text llm=ChatAnthropic(), # Language model tts=ElevenLabs(), # Text-to-speech ) # Join LiveKit room and listen for participants agent.run(room_url, room_token) # Agent handles all conversation automatically
Best practice: Use LiveKit if you need multi-party, recording, or managed infrastructure. Build custom if you're optimizing for ultra-low latency (edge deployment).
# Full real-time voice agent with LiveKit # pip install livekit-agents livekit-plugins-openai livekit-plugins-silero from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli, llm from livekit.agents.voice_assistant import VoiceAssistant from livekit.plugins import openai, silero async def entrypoint(ctx: JobContext): initial_ctx = llm.ChatContext().append( role="system", text=( "You are a voice assistant for Acme customer support. " "Keep responses under 40 words. Never say 'certainly' or 'absolutely'." ), ) await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY) assistant = VoiceAssistant( vad=silero.VAD.load(), # voice activity detection stt=openai.STT(model="whisper-1"), # speech-to-text llm=openai.LLM(model="gpt-4o-mini"), # language model tts=openai.TTS(voice="nova", speed=1.1), # text-to-speech chat_ctx=initial_ctx, allow_interruptions=True, interrupt_speech_duration=0.5, # interrupt after 0.5s of speech min_endpointing_delay=0.3, # wait 0.3s before LLM call ) assistant.start(ctx.room) await assistant.say("Hi, how can I help you today?", allow_interruptions=True) if __name__ == "__main__": cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint)) # Run: python agent.py start --url wss://your-livekit-server.livekit.cloud --api-key ... --api-secret ...
08 — Further Reading

References and Related Concepts

Child Concepts
Related Concepts
Papers & Resources