Multimodal · Audio

Audio & Speech AI Models

Whisper, Wav2Vec, MusicGen, and AudioLM — architectures and pipelines for speech, music, and audio generation

3 task types
6 sections
8 models
Contents
  1. Task taxonomy
  2. Representations
  3. Speech recognition
  4. Text-to-speech
  5. Audio generation
  6. Diarisation & Q&A
  7. References
01 — Landscape

Audio Task Taxonomy

Automatic Speech Recognition (ASR): Audio → text. Transcription. Whisper is the standard. Works across languages, accents, noise levels.

Text-to-Speech (TTS): Text → audio. Synthesis. Focus on naturalness, speed, voice control (speaker, pitch, rate).

Speaker Diarisation: Who spoke when? Segment audio by speaker identity. Answers "which person said what."

Music Generation: Text/melody → audio. MusicGen, AudioLM. Conditional or unconditional synthesis.

Audio Q&A: Audio + question → answer. Multimodal understanding. Emerging with AudioGPT, Qwen-Audio.

💡 Key distinction: ASR and TTS are inverse: ASR removes redundancy (audio → symbolic text), TTS adds it back (text → detailed audio). Diarisation is alignment (who + when). Generation is unconditional synthesis (from nothing or seed).
02 — Encodings

Audio Representations

Raw audio is continuous time-domain samples (44.1 kHz = 44,100 samples/sec). Transforms make it usable:

Representation Method Pros Cons
Mel spectrogram STFT → log scale Human perceptual scale Lossy, fixed time resolution
Log-mel Mel + log amplitude Good dynamic range Sensitive to noise floor
MFCC Mel → DCT reduction Compact, traditional Information loss, outdated
Raw waveform No transform Full fidelity High dimensionality, slow
Codec tokens EnCodec quantisation Compact, preserves info Requires pre-trained codec

Modern approach: Use mel spectrograms for ASR, codec tokens for generation (AudioLM, MusicGen), raw waveform for fine-grained synthesis.

03 — ASR

Speech Recognition

Whisper (OpenAI): Transformer encoder-decoder. Encoder processes mel spectrograms, decoder generates text tokens. Trained on 680k hours of multilingual audio. Robust to accents, background noise, technical language.

Architecture: Mel spectrogram → Conv downsampler → Transformer encoder (24 layers, 1024 hidden) → Decoder (24 layers) → text. Decoding is autoregressive.

Wav2Vec 2.0 (Meta): Self-supervised learning on raw waveforms. No labels needed for pretraining — learns latent representations by contrastive learning. Then fine-tune on ASR (with labels) for excellent performance.

Python: Whisper ASR

import whisper model = whisper.load_model("base") # tiny, base, small, medium, large result = model.transcribe( "audio.mp3", language="en", task="transcribe", # or "translate" (to English) ) print(result["text"]) # Full transcript # result["segments"] = [{"text": "...", "start": 0, "end": 2.5}]

Whisper Performance

Base model (74M params): ~4% error on clean English, ~10% on noisy. Larger models improve ~1–2% per doubling. Speed: base model ~10× realtime on CPU, 100× on GPU.

⚠️ Whisper quirk: Often hallucinates repeated text or entire sentences. Use longer audio segments (batch processing) to reduce hallucination. For critical applications, validate output or use Wav2Vec fine-tuned on domain-specific data.
04 — TTS

Text-to-Speech Models

Model Naturalness Speed Voice Control Best For
SpeechT5 High Fast (mel) Speaker embedding General purpose, voice cloning
Bark Very high Slow (diffusion) Multilingual, emotions High quality, non-realtime
XTTS High Medium Voice clone (6s sample) Multilingual voice cloning
Parler TTS Very high Fast Pitch, speed, emotion Controllable speech

Attention: TTS models trade quality for speed. Diffusion-based (Bark) sounds best but is slow (10–30s for 10s audio). Feed-forward (XTTS, Parler) is faster but slightly less natural.

Voice Cloning

SpeechT5, XTTS: Give 6–10 second sample of target voice, model learns speaker embedding, synthesises new text in that voice. Works across speakers with minimal data.

05 — Generation

Audio & Music Generation

MusicGen (Meta): Transformer language model on codec tokens. Text description → music. "Upbeat piano jazz, 120 BPM" → audio. Also supports melody conditioning (hum → full arrangement).

AudioLM (Google): Language model on codec tokens. Unconditional generation from seed. Can continue partial audio. Very flexible.

Architecture: Audio → EnCodec (learned codec) → discrete tokens (integers, like text). Transformer predicts next token. Decode tokens → waveform. Inference is auto-regressive: ~2–5 seconds per forward pass.

Python: MusicGen

from audiocraft.models import MusicGen model = MusicGen.get_pretrained('facebook/musicgen-large') model.set_generation_params(duration=8) # seconds descriptions = [ "upbeat electronic dance", "classical piano, sad" ] wav = model.generate(descriptions) # wav shape: (batch=2, channels=1, samples=32000*8)

Quality & Control

MusicGen achieves reasonable diversity and structure (~8–30 second clips). Quality degrades with length — conditioning (text + melody) helps. Unconditional is lower quality but faster.

06 — Understanding

Diarisation & Audio Q&A

PyAnnote (Herve Bredin): Speaker diarisation. Segments audio, labels each segment by speaker ID. Pipeline: voice activity detection → speaker embedding → clustering. No labels needed.

Use case: Meeting transcript. Input: audio. Output: "Speaker 1 [0:00–0:15]: ..., Speaker 2 [0:15–0:30]: ..." Enables structured meeting minutes.

Python: PyAnnote Diarisation

from pyannote.audio import Pipeline pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.0" ) diarization = pipeline("audio.wav") for turn, _, speaker in diarization.itertracks(yield_label=True): print(f"{turn.start:.2f}s {turn.end:.2f}s: {speaker}")

Audio Q&A (Emerging)

Qwen-Audio, AudioGPT: Multimodal LLMs that understand audio. "What instrument plays at 0:30?" → answer. Audio + text prompt → text answer.

Still early, but combined with ASR enables end-to-end audio understanding: transcribe → embed → question answering, all in one system.

💡 Audio is underexplored: While vision and text have massive multimodal models, audio-LLM systems are nascent. Opportunity: combine Whisper (ASR) + LLM (reasoning) + TTS (output) for voice assistants.
07 — Ecosystem

Tools & Libraries

ASR
Whisper
OpenAI's speech recognition. Multilingual, robust, easy to use.
ASR
faster-whisper
CTransformers + quantisation. ~4× faster than Whisper.
Speech
SpeechBrain
PyTorch toolkit. ASR, speaker ID, speech enhancement.
Diarisation
PyAnnote
Speaker diarisation. State-of-the-art, easy API.
TTS
Bark
Text-to-speech with emotion. High quality, HF compatible.
TTS
MusicGen
Meta's music generation. Text-to-music, multitrack.
Audio
EnCodec
Neural codec. Compress audio losslessly to tokens.
Audio
Transformers
HuggingFace audio models. Whisper, MusicGen, etc.
08 — Further Reading

References

Academic Papers
Documentation & Tools
Practitioner Writing