Whisper, Wav2Vec, MusicGen, and AudioLM — architectures and pipelines for speech, music, and audio generation
Automatic Speech Recognition (ASR): Audio → text. Transcription. Whisper is the standard. Works across languages, accents, noise levels.
Text-to-Speech (TTS): Text → audio. Synthesis. Focus on naturalness, speed, voice control (speaker, pitch, rate).
Speaker Diarisation: Who spoke when? Segment audio by speaker identity. Answers "which person said what."
Music Generation: Text/melody → audio. MusicGen, AudioLM. Conditional or unconditional synthesis.
Audio Q&A: Audio + question → answer. Multimodal understanding. Emerging with AudioGPT, Qwen-Audio.
Raw audio is continuous time-domain samples (44.1 kHz = 44,100 samples/sec). Transforms make it usable:
| Representation | Method | Pros | Cons |
|---|---|---|---|
| Mel spectrogram | STFT → log scale | Human perceptual scale | Lossy, fixed time resolution |
| Log-mel | Mel + log amplitude | Good dynamic range | Sensitive to noise floor |
| MFCC | Mel → DCT reduction | Compact, traditional | Information loss, outdated |
| Raw waveform | No transform | Full fidelity | High dimensionality, slow |
| Codec tokens | EnCodec quantisation | Compact, preserves info | Requires pre-trained codec |
Modern approach: Use mel spectrograms for ASR, codec tokens for generation (AudioLM, MusicGen), raw waveform for fine-grained synthesis.
Whisper (OpenAI): Transformer encoder-decoder. Encoder processes mel spectrograms, decoder generates text tokens. Trained on 680k hours of multilingual audio. Robust to accents, background noise, technical language.
Architecture: Mel spectrogram → Conv downsampler → Transformer encoder (24 layers, 1024 hidden) → Decoder (24 layers) → text. Decoding is autoregressive.
Wav2Vec 2.0 (Meta): Self-supervised learning on raw waveforms. No labels needed for pretraining — learns latent representations by contrastive learning. Then fine-tune on ASR (with labels) for excellent performance.
Base model (74M params): ~4% error on clean English, ~10% on noisy. Larger models improve ~1–2% per doubling. Speed: base model ~10× realtime on CPU, 100× on GPU.
| Model | Naturalness | Speed | Voice Control | Best For |
|---|---|---|---|---|
| SpeechT5 | High | Fast (mel) | Speaker embedding | General purpose, voice cloning |
| Bark | Very high | Slow (diffusion) | Multilingual, emotions | High quality, non-realtime |
| XTTS | High | Medium | Voice clone (6s sample) | Multilingual voice cloning |
| Parler TTS | Very high | Fast | Pitch, speed, emotion | Controllable speech |
Attention: TTS models trade quality for speed. Diffusion-based (Bark) sounds best but is slow (10–30s for 10s audio). Feed-forward (XTTS, Parler) is faster but slightly less natural.
SpeechT5, XTTS: Give 6–10 second sample of target voice, model learns speaker embedding, synthesises new text in that voice. Works across speakers with minimal data.
MusicGen (Meta): Transformer language model on codec tokens. Text description → music. "Upbeat piano jazz, 120 BPM" → audio. Also supports melody conditioning (hum → full arrangement).
AudioLM (Google): Language model on codec tokens. Unconditional generation from seed. Can continue partial audio. Very flexible.
Architecture: Audio → EnCodec (learned codec) → discrete tokens (integers, like text). Transformer predicts next token. Decode tokens → waveform. Inference is auto-regressive: ~2–5 seconds per forward pass.
MusicGen achieves reasonable diversity and structure (~8–30 second clips). Quality degrades with length — conditioning (text + melody) helps. Unconditional is lower quality but faster.
PyAnnote (Herve Bredin): Speaker diarisation. Segments audio, labels each segment by speaker ID. Pipeline: voice activity detection → speaker embedding → clustering. No labels needed.
Use case: Meeting transcript. Input: audio. Output: "Speaker 1 [0:00–0:15]: ..., Speaker 2 [0:15–0:30]: ..." Enables structured meeting minutes.
Qwen-Audio, AudioGPT: Multimodal LLMs that understand audio. "What instrument plays at 0:30?" → answer. Audio + text prompt → text answer.
Still early, but combined with ASR enables end-to-end audio understanding: transcribe → embed → question answering, all in one system.