Real-time conversational AI with natural speech input and output. STT→LLM→TTS pipelines optimized for sub-500ms latency.
Voice is natural. Humans prefer talking to typing. Voice agents enable: hands-free interaction (driving, cooking), accessibility (blindness, dyslexia), and faster iteration than typing. Phone systems, smart speakers, and customer service bots all use voice. The barrier: latency and naturalness. Delay > 500ms feels unnatural.
Customer service: Phone support agent that answers questions, books appointments. Smart speakers: Alexa-like devices. Hands-free assistants: Voice control in cars, homes. Accessibility: Voice-first interfaces for vision-impaired users. Call centers: AI handling routine calls, escalating complex ones.
Three stages: speech-to-text (STT) converts audio → text. LLM processes text, generates response. Text-to-speech (TTS) converts response → audio. Parallelization and streaming minimize latency.
1. Audio capture: User speaks. Mic captures at 16 kHz mono. 2. VAD (Voice Activity Detection): Detect when user stops speaking (turn detection). 3. STT: Convert audio to transcript. 4. LLM: Process transcript, generate response. 5. TTS streaming: Convert response to audio on-the-fly, stream chunks as they're generated. 6. Audio playback: Play synthesized audio while still generating. 7. Conversation state: Track context for multi-turn.
| Stage | Latency budget | Notes |
|---|---|---|
| VAD (turn detection) | 50–100ms | Should be fast; can be local |
| STT (Whisper/Deepgram) | 100–300ms | Varies; local is fast but lower quality |
| LLM inference | 100–200ms | Use fast models (Haiku, Gemini Flash) |
| TTS generation | 100–200ms | Stream to minimize wait before playback |
| Network/overhead | 50–100ms | Accumulates across calls |
Tradeoff: accuracy vs latency vs cost.
| Option | Latency | Accuracy | Cost | Best for |
|---|---|---|---|---|
| Deepgram | ~200ms | Very good | $0.0043/min | Real-time, budget-friendly |
| Whisper (local) | 5–20s | Excellent | Free | Batch, high accuracy |
| Assembly AI | ~400ms | Excellent | Higher | Quality + real-time |
| Google Cloud STT | ~300ms | Good | $0.006/15s | Enterprise, multilingual |
For voice, choose speed over size. Latency matters more than perfect accuracy.
Fast, good quality. ~50–100ms inference time. Recommended default.
Google's fastest model. ~30–80ms. Good for latency-critical apps.
OpenAI's lightweight. ~100–150ms. Solid fallback.
Self-hosted. ~200–500ms on CPU. Use for sensitive data.
TTS must be fast and natural. APIs are easier; self-hosted models give control.
Don't wait for full TTS. Stream chunks as available. ElevenLabs and OpenAI support streaming. Gives users audio ~100ms faster.
| Provider | Latency (first chunk) | Streaming | Voice quality |
|---|---|---|---|
| ElevenLabs | ~150ms | Yes | Excellent |
| OpenAI TTS | ~200ms | Yes | Good |
| Google Cloud | ~150ms | Yes | Good |
| Glow-TTS (local) | ~50ms | N/A | Fair |
Getting all three stages fast requires parallelization and streaming.
Streaming LLM responses: Don't wait for full response. Send first token to TTS within ~100ms. Streaming TTS: Don't wait for full audio. Play first chunk while generating rest. Local VAD: Turn detection should be local (Silero), not API call. Prompt caching: Cache system prompts, conversation history (if supported). Connection pooling: Reuse API connections. Batching where possible: But not for conversational latency.
LiveKit is the infrastructure layer for real-time voice/video. Handles WebRTC connections, audio routing, and participant management. Lets you build voice agents without managing STUN servers, ICE candidates, or codec negotiation.
WebRTC handling: Peer-to-peer audio/video with fallback to TURN servers. Agent API: Connect an AI agent to a room; agent can listen and speak. Recording: Built-in recording, transcription hooks. Multi-party: Multiple agents/users in same conversation. Managed service or self-hosted: Cloud or on-prem.
1. User joins room: Browser/mobile connects via WebRTC. 2. Agent joins room: Programmatic agent connects, listens to audio. 3. Audio flow: User's microphone → agent's input stream. Agent processes → LLM → TTS → plays to user. 4. Recording: Optionally record conversation for audit/training. 5. Disconnect: Agent leaves room when done.
LiveKit provides SDKs for building agents (Python, Go). Define an agent with STT, LLM, TTS components. LiveKit handles the WebRTC plumbing.