Whisper is OpenAI's open-source automatic speech recognition (ASR) model, trained on 680,000 hours of multilingual audio. Near-human accuracy on English, strong on 96 other languages, with timestamps and speaker diarization support.
Whisper is an encoder-decoder transformer trained by OpenAI on a massive 680,000-hour dataset of internet audio, spanning 96 languages. It handles a wide range of audio conditions (background noise, accents, technical vocabulary) without fine-tuning.
The model comes in five sizes: tiny (39M params), base (74M), small (244M), medium (769M), large-v3 (1.5B). Each successive size improves accuracy at the cost of inference speed and memory.
| Model | Params | VRAM | Speed (relative) | WER (English) |
|---|---|---|---|---|
| tiny | 39M | ~1GB | 32Ć | ~5% |
| base | 74M | ~1GB | 16Ć | ~4% |
| small | 244M | ~2GB | 6Ć | ~3% |
| medium | 769M | ~5GB | 2Ć | ~2.5% |
| large-v3 | 1.5B | ~10GB | 1Ć | ~2% |
import whisper
# Load model ā downloads automatically on first use
model = whisper.load_model("large-v3")
# Transcribe audio file
result = model.transcribe("audio.mp3")
print(result["text"])
# With language hint for better accuracy on non-English audio
result = model.transcribe("audio.mp3", language="fr", task="transcribe")
# Or translate to English
result = model.transcribe("audio.mp3", language="fr", task="translate")
# Get segment-level timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.1f}s ā {segment['end']:.1f}s] {segment['text']}")
# CLI usage
pip install openai-whisper
whisper audio.mp3 --model large-v3 --output_dir ./transcripts
whisper audio.mp3 --model medium --language French --task translate
The original openai-whisper is PyTorch-based and not optimised for throughput. faster-whisper reimplements Whisper using CTranslate2, achieving 4Ć faster inference with half the memory footprint.
from faster_whisper import WhisperModel
# Load model (downloads automatically)
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Transcribe
segments, info = model.transcribe("audio.mp3", beam_size=5)
print(f"Detected language: {info.language} ({info.language_probability:.0%})")
for segment in segments:
print(f"[{segment.start:.2f}s ā {segment.end:.2f}s] {segment.text}")
# Batch transcription
audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
for f in audio_files:
segments, _ = model.transcribe(f, language="en", beam_size=1) # beam_size=1 fastest
text = " ".join([s.text for s in segments])
print(f"{f}: {text[:100]}...")
For CPU-only servers: compute_type="int8" gives further 2Ć speedup with minimal quality loss.
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
# Enable word-level timestamps
segments, info = model.transcribe(
"interview.mp3",
word_timestamps=True,
beam_size=5,
)
# Generate SRT subtitles with word highlighting
with open("subtitles.srt", "w") as f:
for i, segment in enumerate(segments, 1):
f.write(f"{i}
")
f.write(f"{format_time(segment.start)} --> {format_time(segment.end)}
")
f.write(segment.text.strip() + "
")
# Word-level data for karaoke / highlighting
for segment in segments:
for word in segment.words:
print(f" '{word.word}': {word.start:.2f}sā{word.end:.2f}s (prob={word.probability:.2f})")
Word timestamps are useful for subtitle generation, audio search indexing, and finding specific moments in long recordings.
Whisper's multilingual capability comes from its diverse training data. The large-v3 model performs near-human on English (2% WER), French, German, Spanish, and performs well on 90+ other languages. Performance degrades on low-resource languages with limited training data.
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Auto-detect language
segments, info = model.transcribe("unknown_language.mp3")
print(f"Detected: {info.language} ({info.language_probability:.0%})")
# Transcribe code-switched audio (multiple languages in one recording)
# Whisper handles this naturally ā just set language to None for auto-detect
# Translate non-English to English
from whisper import load_model
model = load_model("large-v3")
result = model.transcribe("spanish_meeting.mp3", task="translate")
print(result["text"]) # English translation
Three common patterns for deploying Whisper in production:
Synchronous API: FastAPI endpoint that accepts audio files and returns transcript. Good for <60s audio and low-latency requirements.
Async queue: For long recordings (>5 min), use a job queue (Celery, Redis Queue). Client submits audio, gets a job ID, polls for completion. Better for batch transcription pipelines.
Streaming (WhisperLive): For real-time transcription, WhisperLive or whisper-streaming transcribe audio as it arrives using Voice Activity Detection (VAD) to segment the stream. Latency: 1ā3s typically.
from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
app = FastAPI()
model = WhisperModel("medium", device="cuda", compute_type="float16")
@app.post("/transcribe")
async def transcribe(file: UploadFile):
audio_bytes = await file.read()
# Write to temp file (faster-whisper requires file path)
with open("/tmp/audio.wav", "wb") as f:
f.write(audio_bytes)
segments, info = model.transcribe("/tmp/audio.wav")
return {"text": " ".join(s.text for s in segments), "language": info.language}
Audio preprocessing: Whisper expects 16kHz mono WAV. For other formats (MP3, MP4, M4A), it uses ffmpeg internally. Ensure ffmpeg is installed. For very noisy audio, pre-process with noise reduction (e.g. noisereduce library) to improve accuracy.
Long audio chunking: Whisper processes 30-second chunks. For files longer than 30 minutes, faster-whisper handles chunking automatically. The original whisper library can have issues with very long files ā use faster-whisper for anything > 5 minutes.
Hallucination on silence: Whisper sometimes generates phantom text for silent or very quiet audio segments. Use VAD (Voice Activity Detection) to skip silent segments: model.transcribe(audio, vad_filter=True) in faster-whisper.
GPU memory: large-v3 needs ~10GB VRAM. For <8GB GPUs, use medium (5GB) or small (2GB). On CPU, transcription is ~15ā50Ć slower than real-time ā only practical for offline batch processing.
OpenAI Whisper's model family spans five sizes from tiny (39M parameters) to large-v3 (1.5B parameters), each offering different trade-offs between transcription accuracy, language support, and inference speed. Selecting the right size for a use case avoids both over-engineering (using large on short English clips) and under-engineering (using tiny on accented multilingual audio).
| Model | Parameters | English WER | Multilingual | RTF (CPU) |
|---|---|---|---|---|
| tiny | 39M | ~5.7% | Basic | ~32Ć |
| base | 74M | ~4.3% | Good | ~16Ć |
| small | 244M | ~3.4% | Very good | ~6Ć |
| medium | 769M | ~3.0% | Excellent | ~2Ć |
| large-v3 | 1550M | ~2.7% | Best | ~1Ć |
Faster-Whisper and WhisperX are optimized inference implementations that achieve 4ā8Ć speed improvements over the original Whisper implementation through CTranslate2 backend optimization, batched audio processing, and word-level timestamp alignment. For production transcription pipelines processing hours of audio, these optimizations reduce infrastructure costs dramatically. WhisperX additionally provides accurate word-level timestamps by aligning Whisper's transcription with a forced alignment model, enabling precise subtitle generation and speaker diarization integration.
Whisper's preprocessing pipeline converts raw audio to 80-channel log-mel spectrograms computed over 30-second windows, regardless of the actual audio content in each window. For short audio clips under 30 seconds, zero-padding fills the remaining window. This fixed 30-second window design means that a 5-second utterance consumes the same compute as a full 30-second segment ā an inefficiency that faster-whisper addresses by detecting voice activity and only processing segments that contain speech, skipping silent padding entirely.
Whisper's language identification capability runs on the first 30-second window of audio and outputs a probability distribution over 99 supported languages. For multilingual transcription pipelines, enabling automatic language detection eliminates the need to route audio to language-specific models and allows the same Whisper model to handle mixed-language content. The language detection decision is made per-30-second segment, so the model can switch languages mid-transcript for content where speakers alternate between languages.
Hallucination in Whisper transcription is a known failure mode that manifests as plausible-sounding but incorrect words inserted into otherwise silent or very quiet audio segments. This occurs because Whisper was trained on noisy internet data and learned to fill silence with contextually plausible phrases rather than producing empty transcripts. Voice activity detection (VAD) preprocessing ā filtering out segments below an energy threshold before feeding to Whisper ā significantly reduces hallucination by preventing the model from transcribing background noise and silence as speech.
Batch transcription throughput scales almost linearly with batch size on GPU, making offline transcription pipelines dramatically faster than processing audio files sequentially. The limiting factor is GPU memory, which constrains how many audio segments can be processed simultaneously. For the medium Whisper model, batch sizes of 8ā16 on a single A10G GPU achieve near-full GPU utilization. Dynamic batching that groups audio segments of similar length minimizes padding waste and maximizes throughput for heterogeneous audio files.
Temperature sampling in Whisper transcription controls how the model handles uncertain phonemes. At temperature 0, Whisper always selects the most probable token at each step, which is fast but occasionally produces confident wrong transcriptions. At higher temperatures, the model explores alternative transcriptions more aggressively, enabling the beam search to recover from early wrong guesses. Whisper's default inference pipeline uses temperature fallback: starting at 0 and increasing temperature if the compression ratio or log probability score suggests the transcription may be unreliable.