Multimodal

TTS Models

Text-to-Speech synthesis models — from traditional neural TTS (FastSpeech, VITS) to modern zero-shot and voice-cloning systems (ElevenLabs, Kokoro, Parler-TTS).

Quality metric
MOS (Mean Opinion Score)
Latency (streaming)
300–800 ms TTFB
Key models
ElevenLabs, Kokoro, CSSTalker

Table of Contents

SECTION 01

TTS Architecture Evolution

Traditional TTS (Tacotron, 2017): sequence-to-sequence with attention, converts text to mel spectrograms, then uses a vocoder for audio. FastSpeech2 (2020): non-autoregressive, much faster inference, explicit duration/pitch/energy control. VITS (2021): end-to-end VAE+flow model, directly generates waveform, natural prosody, efficient. Modern zero-shot TTS (2023+): diffusion or LLM-based, clone any voice from 3–10 seconds of reference audio.

SECTION 02

Acoustic Models

The acoustic model converts text (phonemes or characters) to an intermediate representation (mel spectrogram or continuous features). Key design choices: autoregressive (better prosody, slower) vs non-autoregressive (faster, slightly less natural). Modern systems skip the intermediate representation entirely (end-to-end).

SECTION 03

Neural Vocoders

Vocoders convert acoustic model output to a waveform. HiFi-GAN: GAN-based, real-time on CPU, excellent quality. WaveGrad/DiffWave: diffusion-based, higher quality but slower. EnCodec (Meta): neural codec that compresses audio to discrete tokens, used as the audio representation in VALL-E and related LLM-based TTS.

SECTION 04

Zero-Shot TTS

Modern zero-shot TTS clones any voice from a short reference clip without fine-tuning. XTTS v2 (Coqui): open-source, 17 languages, voice clone from 6s audio. ElevenLabs: best commercial quality, 32 languages. Parler-TTS: instruction-following TTS ('speak slowly with a British accent'). Kokoro-82M: tiny open model (82M params), high quality, runs on CPU.

SECTION 05

Open-Source Options

Kokoro — fast, high-quality, Apache 2.0 licence:

# pip install kokoro soundfile
from kokoro import KPipeline
import soundfile as sf
import numpy as np
pipeline = KPipeline(lang_code="a")  # 'a' = American English
generator = pipeline("Hello, this is Kokoro text to speech!", voice="af_heart")
audio_chunks = [chunk for chunk in generator]
audio = np.concatenate([c[2] for c in audio_chunks])
sf.write("output.wav", audio, 24000)
# XTTS for voice cloning:
# from TTS.api import TTS
# tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
# tts.tts_to_file(text="Hello!", speaker_wav="ref.wav", language="en", file_path="out.wav")
SECTION 06

Evaluation Metrics

MOS (Mean Opinion Score): human listeners rate naturalness 1–5. WER (Word Error Rate): run ASR on generated audio, compare to original text. Character Error Rate (CER): finer-grained than WER. Speaker Similarity: cosine distance between speaker embeddings of reference and generated audio. For production: WER < 2% and speaker similarity > 0.85 are common thresholds.

Real-Time TTS Systems

Production TTS systems require sub-100ms latency for interactive applications. Techniques include streaming vocoder inference, KV-cache optimization, and spectrogram batching. Modern systems achieve 50-80ms end-to-end latency by breaking synthesis into streaming chunks, enabling real-time conversation without user-perceptible delay.

# Streaming TTS implementation with chunking
class StreamingTTSEngine:
    def __init__(self, model_path, vocoder_path):
        self.acoustic_model = self.load_model(model_path)
        self.vocoder = self.load_model(vocoder_path)
        self.chunk_size = 256  # tokens
    
    async def synthesize_streaming(self, text, chunk_callback):
        """Synthesize with streaming output"""
        # Split text into chunks
        tokens = self.tokenize(text)
        
        for i in range(0, len(tokens), self.chunk_size):
            chunk_tokens = tokens[i:i + self.chunk_size]
            
            # Generate mel-spectrogram
            mel = self.acoustic_model.infer(chunk_tokens)
            
            # Convert to audio with streaming vocoder
            audio_chunk = self.vocoder.infer_streaming(mel)
            
            # Stream to client immediately
            await chunk_callback(audio_chunk)

# Usage
async def tts_request_handler(text):
    engine = StreamingTTSEngine("models/acoustic", "models/vocoder")
    
    async def send_chunk(audio):
        await websocket.send(audio)
    
    await engine.synthesize_streaming(text, send_chunk)
Model Latency (ms) Quality Language Support
Tacotron2 + WaveGlow 150-200 High Limited
Glow-TTS + HiFi-GAN 80-120 High Limited
FastSpeech2 50-80 Medium-High Limited
XTTS (Coqui) 200-300 High 28+ languages
# Voice cloning with minimal samples
class VoiceCloneController:
    def __init__(self, xtts_model):
        self.model = xtts_model
    
    def clone_voice(self, reference_audio_path, target_text, num_samples=1):
        """Clone voice from 5-30 second reference"""
        # Extract speaker embedding from reference
        speaker_embedding = self.model.get_speaker_embedding(reference_audio_path)
        
        # Generate synthesis with cloned voice
        audio = self.model.synthesize(
            text=target_text,
            speaker_embedding=speaker_embedding,
            temperature=0.75  # Control voice variation
        )
        
        return audio
    
    def multi_speaker_synthesis(self, speakers_dict, script):
        """Synthesize multi-speaker dialogue"""
        output_segments = []
        
        for speaker_name, reference_audio in speakers_dict.items():
            embedding = self.model.get_speaker_embedding(reference_audio)
            speaker_lines = script.get(speaker_name, [])
            
            for line in speaker_lines:
                audio = self.model.synthesize(line, speaker_embedding=embedding)
                output_segments.append(audio)
        
        return self.concatenate_audio(output_segments)

Multi-Lingual and Accent Preservation

Modern TTS systems handle 50+ languages through multilingual acoustic models and language-aware grapheme-to-phoneme converters. Accent preservation requires fine-tuning on speaker-specific data or using speaker embeddings. Commercial systems achieve naturalness ratings of 4.2/5 on MOS (Mean Opinion Score) across diverse languages and accents.

TTS model selection depends on specific use case constraints. Real-time conversational AI systems require under 100ms latency, demanding lightweight models like FastPitch or TacotronLite with 1-2 parameter models. Offline batch synthesis for content creation can tolerate 200-500ms latency, enabling use of higher-quality models like Glow-TTS + HiFi-GAN with 50M+ parameters that produce more natural prosody. Multilingual systems require careful model selection: end-to-end multilingual models (e.g., XTTS) sacrifice per-language quality versus language-specific models, trading 5-10% MOS (Mean Opinion Score) for simplicity. Voice cloning quality improves with reference audio duration: 5-10 seconds provides basic quality, 30-60 seconds enables accent preservation and speaker identity capture. Vocoder choice significantly impacts quality and latency: neural vocoders (HiFi-GAN, UnivNet) provide superior quality but require 50-200ms, while parametric vocoders (Griffin-Lim) enable under 10ms inference at cost of lower quality. Production systems use model quantization (int8, fp16) reducing model size 60-75% with under 2% MOS degradation. Multi-speaker TTS for dialogue systems requires speaker turn detection and voice switching logic: implement speaker scheduling to avoid abrupt voice changes mid-sentence.

Production TTS systems handle extreme scale and diversity. Real-world requirements: synthesize 10B+ characters daily across 50+ languages, maintain <100ms latency for interactive systems, support custom voices from user-provided samples. Acoustic model architectures process text → linguistic features → mel-spectrograms: Glow-TTS uses invertible neural networks for fast inference (50ms typical), FastSpeech2 parallelizes computation (40ms typical), Tacotron2 achieves higher quality but slower (150ms typical). Vocoder selection is critical for quality vs latency: WaveGlow neural vocoder produces high-quality audio but requires 200ms, HiFi-GAN trades slight quality loss for 30ms inference, WaveRNN lightweight vocoder enables real-time on mobile. Streaming TTS divides synthesis into chunks: generate mel-spectrogram in 256-token chunks, vocoder processes in parallel, enables 200ms latency for full response. Multi-speaker TTS requires speaker embedding extraction: from reference audio (15-30 seconds), compute embedding capturing speaker identity, condition acoustic model on embedding. Training stability: large datasets (100M+ training steps) require careful learning rate scheduling, batch norm tracking, validation metric selection. Evaluation metrics include: MOS (Mean Opinion Score, 1-5 scale), speaker similarity (cosine similarity of embeddings), pronunciation accuracy (phoneme-level WER), naturalness ratings. Production systems maintain quality at massive scale: speaker pool of 50+ voices, 20+ languages, custom voices from user samples. Handling edge cases: code switching (mixing languages), uncommon words (proper nouns), emotional speech.

Voice cloning technology enables unlimited speaker diversity without collecting huge speaker datasets. Zero-shot cloning: from 5-30 second reference audio, extract speaker embedding, synthesize arbitrary text in that voice. Speaker embedding extraction uses embedder networks trained on speaker verification tasks: convert audio waveform → speaker embedding capturing voice identity while discarding content. Few-shot cloning improves quality: with 5+ reference utterances, embeddings become more robust. Multi-speaker datasets: embed 1000+ speakers, learn shared embedding space, interpolation enables smooth speaker transitions. Voice conversion (speech-to-speech, music-to-speech) uses similar embedders: convert one speaker to another while preserving content. Real-world applications: personalization (user-provided voice), brand voice (company spokesperson), localization (maintain original speaker for multiple languages). Quality scaling: 5-second clip gives 3/5 quality, 30-second clip gives 4.5/5 quality, 3-minute clip approaches 5/5. Challenges: background noise in reference audio hurts quality (signal processing preprocessing helps), emotional prosody transfer difficult (preserve original emotion while changing voice), speaker similarity metric not perfect (sometimes sounds off despite high embedding similarity). Production optimization: cache embeddings for common voices, parallelize embedding extraction, use streaming embedding extraction for long audio. Cost implications: voice cloning enabled new business models (personalized audiobooks, multilingual content). Regulatory considerations: require consent for voice cloning, prevent impersonation, implement usage restrictions.

Cross-lingual TTS generalizes to new languages with minimal data through transfer learning and multilingual models. Multilingual acoustic models trained on 50+ languages learn shared representations of phonemes, prosody, speaker characteristics. Zero-shot language synthesis: model trained on languages A-Z, can synthesize language W (not in training) by providing phoneme transcription + language ID. Quality degrades 10-20% without fine-tuning, enables product launch in new markets before collecting local speaker data. Few-shot fine-tuning improves quality: 1-2 hours of native speaker data, 10 GPU-hours fine-tuning recovers 90%+ quality. Multilingual speaker embeddings enable voice cloning across languages: extract speaker embedding from English sample, apply to French text generation. Code-switching (mixed language utterances) presents challenges: "I speak english and français très bien" needs language markers per word/phrase. End-to-end multilingual models struggle here; hybrid approaches (language detector + separate models) improve code-switching quality 30%. Accent adaptation uses fine-tuning on accented speaker data: Indian English, Brazilian Portuguese variants. TTS quality metrics per language show large variance: Mandarin achieves 4.8/5 MOS (tonal system well-captured), English achieves 4.6/5, low-resource languages achieve 3.5-4.0/5. Language-specific challenges: Arabic diacritics affect pronunciation, Japanese mora-based timing differs from English, Hindi consonant clusters. Production multilingual systems maintain separate quality tiers: Tier 1 (high-resource languages: English, Mandarin, Spanish): 4.5+ MOS, Tier 2 (medium-resource: French, German, Japanese): 4.0+ MOS, Tier 3 (low-resource): 3.5+ MOS. Scaling to 100+ languages economical through transfer learning rather than collecting 1000+ hours per language.

ModelLatencyVoice cloningLicense
ElevenLabs~200ms TTFBYes (3-second clone)Commercial API
Coqui TTS (XTTS-v2)~500msYes (6-second clone)Open source (CPML)
Kokoro TTS~100msNoApache 2.0
OpenAI TTS~300msNo (preset voices)Commercial API