Audio & Speech AI Models

Contents

Task taxonomy
Representations
Speech recognition
Text-to-speech
Audio generation
Diarisation & Q&A
References

01 — Landscape

Audio Task Taxonomy

Automatic Speech Recognition (ASR): Audio → text. Transcription. Whisper is the standard. Works across languages, accents, noise levels.

Text-to-Speech (TTS): Text → audio. Synthesis. Focus on naturalness, speed, voice control (speaker, pitch, rate).

Speaker Diarisation: Who spoke when? Segment audio by speaker identity. Answers "which person said what."

Music Generation: Text/melody → audio. MusicGen, AudioLM. Conditional or unconditional synthesis.

Audio Q&A: Audio + question → answer. Multimodal understanding. Emerging with AudioGPT, Qwen-Audio.

💡 Key distinction: ASR and TTS are inverse: ASR removes redundancy (audio → symbolic text), TTS adds it back (text → detailed audio). Diarisation is alignment (who + when). Generation is unconditional synthesis (from nothing or seed).

02 — Encodings

Audio Representations

Raw audio is continuous time-domain samples (44.1 kHz = 44,100 samples/sec). Transforms make it usable:

Representation	Method	Pros	Cons
Mel spectrogram	STFT → log scale	Human perceptual scale	Lossy, fixed time resolution
Log-mel	Mel + log amplitude	Good dynamic range	Sensitive to noise floor
MFCC	Mel → DCT reduction	Compact, traditional	Information loss, outdated
Raw waveform	No transform	Full fidelity	High dimensionality, slow
Codec tokens	EnCodec quantisation	Compact, preserves info	Requires pre-trained codec

Modern approach: Use mel spectrograms for ASR, codec tokens for generation (AudioLM, MusicGen), raw waveform for fine-grained synthesis.

03 — ASR

Speech Recognition

Whisper (OpenAI): Transformer encoder-decoder. Encoder processes mel spectrograms, decoder generates text tokens. Trained on 680k hours of multilingual audio. Robust to accents, background noise, technical language.

Architecture: Mel spectrogram → Conv downsampler → Transformer encoder (24 layers, 1024 hidden) → Decoder (24 layers) → text. Decoding is autoregressive.

Wav2Vec 2.0 (Meta): Self-supervised learning on raw waveforms. No labels needed for pretraining — learns latent representations by contrastive learning. Then fine-tune on ASR (with labels) for excellent performance.

Python: Whisper ASR

import whisper model = whisper.load_model("base") # tiny, base, small, medium, large result = model.transcribe( "audio.mp3", language="en", task="transcribe", # or "translate" (to English) ) print(result["text"]) # Full transcript # result["segments"] = [{"text": "...", "start": 0, "end": 2.5}]

Whisper Performance

Base model (74M params): ~4% error on clean English, ~10% on noisy. Larger models improve ~1–2% per doubling. Speed: base model ~10× realtime on CPU, 100× on GPU.

⚠️ Whisper quirk: Often hallucinates repeated text or entire sentences. Use longer audio segments (batch processing) to reduce hallucination. For critical applications, validate output or use Wav2Vec fine-tuned on domain-specific data.

04 — TTS

Text-to-Speech Models

Model	Naturalness	Speed	Voice Control	Best For
SpeechT5	High	Fast (mel)	Speaker embedding	General purpose, voice cloning
Bark	Very high	Slow (diffusion)	Multilingual, emotions	High quality, non-realtime
XTTS	High	Medium	Voice clone (6s sample)	Multilingual voice cloning
Parler TTS	Very high	Fast	Pitch, speed, emotion	Controllable speech

Attention: TTS models trade quality for speed. Diffusion-based (Bark) sounds best but is slow (10–30s for 10s audio). Feed-forward (XTTS, Parler) is faster but slightly less natural.

Voice Cloning

SpeechT5, XTTS: Give 6–10 second sample of target voice, model learns speaker embedding, synthesises new text in that voice. Works across speakers with minimal data.

05 — Generation

Audio & Music Generation

MusicGen (Meta): Transformer language model on codec tokens. Text description → music. "Upbeat piano jazz, 120 BPM" → audio. Also supports melody conditioning (hum → full arrangement).

AudioLM (Google): Language model on codec tokens. Unconditional generation from seed. Can continue partial audio. Very flexible.

Architecture: Audio → EnCodec (learned codec) → discrete tokens (integers, like text). Transformer predicts next token. Decode tokens → waveform. Inference is auto-regressive: ~2–5 seconds per forward pass.

Python: MusicGen

from audiocraft.models import MusicGen model = MusicGen.get_pretrained('facebook/musicgen-large') model.set_generation_params(duration=8) # seconds descriptions = [ "upbeat electronic dance", "classical piano, sad" ] wav = model.generate(descriptions) # wav shape: (batch=2, channels=1, samples=32000*8)

Quality & Control

MusicGen achieves reasonable diversity and structure (~8–30 second clips). Quality degrades with length — conditioning (text + melody) helps. Unconditional is lower quality but faster.

06 — Understanding

Diarisation & Audio Q&A

PyAnnote (Herve Bredin): Speaker diarisation. Segments audio, labels each segment by speaker ID. Pipeline: voice activity detection → speaker embedding → clustering. No labels needed.

Use case: Meeting transcript. Input: audio. Output: "Speaker 1 [0:00–0:15]: ..., Speaker 2 [0:15–0:30]: ..." Enables structured meeting minutes.

Python: PyAnnote Diarisation

from pyannote.audio import Pipeline pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.0" ) diarization = pipeline("audio.wav") for turn, _, speaker in diarization.itertracks(yield_label=True): print(f"{turn.start:.2f}s {turn.end:.2f}s: {speaker}")

Audio Q&A (Emerging)

Qwen-Audio, AudioGPT: Multimodal LLMs that understand audio. "What instrument plays at 0:30?" → answer. Audio + text prompt → text answer.

Still early, but combined with ASR enables end-to-end audio understanding: transcribe → embed → question answering, all in one system.

💡 Audio is underexplored: While vision and text have massive multimodal models, audio-LLM systems are nascent. Opportunity: combine Whisper (ASR) + LLM (reasoning) + TTS (output) for voice assistants.

07 — Ecosystem

Tools & Libraries

ASR

Whisper

OpenAI's speech recognition. Multilingual, robust, easy to use.

ASR

faster-whisper

CTransformers + quantisation. ~4× faster than Whisper.

Speech

SpeechBrain

PyTorch toolkit. ASR, speaker ID, speech enhancement.

Diarisation

PyAnnote

Speaker diarisation. State-of-the-art, easy API.

TTS

Bark

Text-to-speech with emotion. High quality, HF compatible.

TTS

MusicGen

Meta's music generation. Text-to-music, multitrack.

Audio

EnCodec

Neural codec. Compress audio losslessly to tokens.

Audio

Transformers

HuggingFace audio models. Whisper, MusicGen, etc.

08 — Further Reading

References

Academic Papers

Paper Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI Whisper. arXiv:2212.04356. — arxiv:2212.04356 ↗
Paper Baevski, A. et al. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Meta. arXiv:2006.11477. — arxiv:2006.11477 ↗
Paper Copet, J. et al. (2023). Simple and Controllable Music Generation by Exploiting the Prior Network Latent Space. MusicGen. arXiv:2306.05284. — arxiv:2306.05284 ↗
Paper Bredin, H. & Laurent, A. (2021). End-to-end speaker segmentation for overlap-aware resegmentation. PyAnnote. arXiv:2104.04045. — arxiv:2104.04045 ↗

Documentation & Tools

Docs Whisper (OpenAI GitHub). github.com ↗
Docs AudioCraft (Meta). github.com ↗
Docs PyAnnote Diarisation. github.com ↗
Guide HuggingFace Audio Documentation. huggingface.co ↗

Practitioner Writing

Blog OpenAI. (2022). Introducing Whisper. — openai.com ↗
Blog Meta. (2023). Introducing MusicGen: Simple and Controllable Music Generation. — meta.com ↗

Audio & Speech AI Models

Audio Task Taxonomy

Audio Representations

Speech Recognition

Python: Whisper ASR

Whisper Performance

Text-to-Speech Models

Voice Cloning

Audio & Music Generation

Python: MusicGen

Quality & Control

Diarisation & Audio Q&A

Python: PyAnnote Diarisation

Audio Q&A (Emerging)

Tools & Libraries

References

Related concepts