Multimodal AI

AudioCraft

Meta's open-source audio generation suite — MusicGen, AudioGen, and EnCodec — enabling high-quality music and sound synthesis from text prompts using transformer-based models.

Model
MusicGen / AudioGen
Max duration
30s (base), longer with continuation
Codecs
EnCodec @ 32kHz
License
CC-BY-NC (research)

Table of Contents

SECTION 01

What Is AudioCraft?

AudioCraft is Meta AI's unified framework for audio generation released in 2023. It bundles three components: MusicGen for music synthesis, AudioGen for general sound effects, and EnCodec as the shared neural audio codec that converts waveforms to discrete tokens and back. All models are transformer-based and conditioned on text descriptions.

SECTION 02

MusicGen

MusicGen generates music from text prompts (and optionally a melody reference). It uses a single-stage auto-regressive transformer over EnCodec tokens — no cascaded diffusion. Model sizes range from 300M to 3.3B parameters. The small model fits on a consumer GPU; the large model requires ~16 GB VRAM. Generation is typically 10–30 seconds at 32 kHz.

# Install and generate a short music clip
# pip install audiocraft
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('facebook/musicgen-small')
model.set_generation_params(duration=10)
wav = model.generate(['upbeat jazz piano, 120 BPM, live recording feel'])
audio_write('output', wav[0].cpu(), model.sample_rate, strategy='loudness')
SECTION 03

AudioGen

AudioGen is trained on environmental and sound-effect datasets rather than music. It synthesizes footsteps, rain, city ambience, alarms, or any described acoustic scene. Architecture mirrors MusicGen but uses a different EnCodec checkpoint tuned for general audio rather than music-specific frequencies.

from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write
model = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)
wav = model.generate(['dog barking in an empty hallway'])
audio_write('dog_bark', wav[0].cpu(), model.sample_rate, strategy='clip')
SECTION 04

EnCodec

EnCodec is Meta's neural audio codec — it compresses waveforms into discrete codebook tokens at multiple bitrates (1.5 to 24 kbps). Compression is residual-quantization-based, enabling reconstruction at high quality. Both MusicGen and AudioGen use EnCodec tokens as their modeling vocabulary. EnCodec can also be used standalone for audio compression tasks.

SECTION 05

Usage

AudioCraft is available on HuggingFace Hub under the facebook/ namespace. The audiocraft Python package wraps model loading, generation, and audio I/O. For production pipelines, models can be served with TorchServe or integrated into real-time applications via streaming generation (experimental). There is also a Gradio demo hosted on HF Spaces for no-code experimentation.

SECTION 06

Practical Limits

AudioCraft models are released under CC-BY-NC 4.0 — commercial use requires a separate license agreement with Meta. Generation is compute-intensive: the large MusicGen model takes ~10 seconds to generate 10 seconds of audio on an A100. Temporal coherence beyond 30 seconds degrades; for longer pieces, use windowed continuation. The models are not real-time by default — for live applications consider ONNX export or streaming decoding with smaller model variants.

SECTION 07

Musicgen Model Architecture

Musicgen uses a three-stage approach: a text encoder (via T5) converts prompts to embeddings, a frozen acoustic model compresses audio into discrete codes, and a transformer-based language model generates those codes autoregressively. The discrete approach allows training on commodity GPUs while maintaining high fidelity.

from audiocraft.models import MusicGen
from audiocraft.data.audio_utils import convert_audio
import torch

# Load base model
model = MusicGen.get_pretrained('facebook/musicgen-medium')
# Conditions (all optional)
descriptions = ['upbeat synth pop', 'chill lofi hip-hop']
melody = convert_audio(waveform, sample_rate, model.sample_rate)

# Generate 30s of music per description
model.generate_continuation(melody, prompt_sample_rate=16000, length=30)

Fine-Tuning on Custom Data

While pre-trained models cover broad musical styles, fine-tuning on domain-specific data (e.g., game soundtrack, corporate jingles) improves coherence. Audiocraft provides training scripts and datasets management.

# Fine-tune on custom audio dataset
from audiocraft.models import MusicGen
from audiocraft.solvers import musicgen as musicgen_solver

# Prepare training config
solver = musicgen_solver.MusicGenSolver(
    cfg=cfg,
    data=custom_audio_dataset,
    model=MusicGen.get_pretrained('facebook/musicgen-small'),
    num_steps=10000
)
solver.train()  # Requires GPU
SECTION 08

Audio Quality & Inference Optimization

Musicgen generates at 32 kHz (16-bit PCM). Higher sampling rates (44.1 kHz for CD quality) can be achieved via post-processing resampling, though the model's training is fixed to 32 kHz. Inference speed scales with sequence length; a 60-second track takes roughly 3–5× longer than a 10-second clip on the same GPU due to the autoregressive nature of generation.

Model Size VRAM (GB) 10s Inference (s) 60s Inference (s) Best For
Small (300M) 2 3–4 18–25 Real-time web apps
Medium (1.5B) 5 5–7 30–50 Batch generation, balancing quality
Large (3.5B) 10 10–15 60–90 Highest quality, offline rendering
Large+ (7B) 18 20–30 120+ Research, large-scale batch jobs

Practical Deployment Strategy: Deploy Musicgen as a long-running gRPC service behind a queue. Users submit generation requests, which are enqueued and processed by a pool of GPU workers. The Small model on a single V100 can generate ~50 unique 10-second clips per hour in batch mode. For cost-sensitive applications, use model quantization (via torchao or bitsandbytes) to reduce memory, though quality degradation is noticeable on longer sequences.

Consider caching generated audio indexed by prompt hash; identical requests within a 24-hour window can be served from cache at microsecond latency. Document output formats clearly: Audiocraft generates mono or stereo PCM WAV by default, but conversion to MP3 or OGG via ffmpeg is recommended before serving to browsers to reduce bandwidth.

Scaling Audiocraft Inference: Musicgen inference is CPU-bound initially (encoding input description, tokenization) then GPU-bound during generation. For a service handling concurrent requests, consider batch generation: collect multiple user requests and generate a batch of tracks, then dispatch results back. This amortizes encoding overhead and improves GPU utilization. Use a queue system (Redis + Celery) to coordinate batch submissions; users get results when ready, not necessarily in FIFO order. Set generation timeouts: if a 60-second track isn't generated in 2 minutes, return an error and retry later. Prevent runaway generation jobs from consuming all GPU memory.

For cost optimization on cloud, use spot instances for batch jobs and on-demand for real-time requests. Musicgen's long generation time makes spot preemption less painful than for low-latency services; just restart the job on a new instance. Implement request deduplication: if two users request identical prompts within 5 minutes, return the cached audio instead of re-generating. Monitor generation quality metrics (spectral flatness, harmonic coherence) via automated tests; if quality drops, investigate inference parameters or model updates.

Monitoring and observability are essential for production systems. Set up comprehensive logging at every layer: API requests, model predictions, database queries, cache hits/misses. Use structured logging (JSON) to enable filtering and aggregation across thousands of servers. For production deployments, track not just errors but also latency percentiles (p50, p95, p99); if p99 latency suddenly doubles, something is wrong even if error rates are normal. Set up alerting based on SLO violations: if a service is supposed to have 99.9% availability and it drops to 99.5%, alert immediately. Use distributed tracing (Jaeger, Lightstep) to track requests across multiple services; a slow end-to-end latency might be hidden in one deep service call, invisible in aggregate metrics.

For long-running ML jobs (training, batch inference), implement checkpoint recovery and graceful degradation. If a training job crashes after 2 weeks, you want to resume from the last checkpoint, not restart from scratch. Implement job orchestration with Kubernetes or Airflow to handle retries, resource allocation, and dependency management. Use feature flags for safe deployment: deploy new model versions behind a flag that's off by default, gradually roll out to 1% of users, 10%, then 100%, monitoring metrics at each step. If something goes wrong, flip the flag back instantly. This approach reduces risk and enables fast rollback.

Finally, build a culture of incident response and post-mortems. When something breaks (and it will), document the incident: timeline, root cause, mitigation steps, and preventive measures. Use incidents as learning opportunities; blameless post-mortems focus on systems, not people. Share findings across teams to prevent repeat incidents. A well-documented incident history is an organization's institutional knowledge about system failures and how to avoid them.

The rapid evolution of AI infrastructure requires continuous learning and adaptation. Teams should establish regular tech talks and knowledge-sharing sessions where engineers present lessons learned from production deployments, performance optimization work, and incident postmortems. Create internal wiki pages documenting best practices specific to your organization: how to debug common failure modes, performance tuning guides for your hardware, and checklists for safe deployments. This prevents repeating mistakes and accelerates onboarding of new team members.

Build relationships with vendors and open-source communities. If you encounter bugs in frameworks (PyTorch, JAX), file detailed reports. If you have questions, ask on forums; community members often have encountered similar issues. For mission-critical infrastructure, consider purchasing support contracts with vendors (PyTorch, HuggingFace, cloud providers). Support gives you direct access to engineers who understand your system and can prioritize fixes. This is insurance against production outages caused by third-party software bugs.

Finally, remember that optimization is a journey, not a destination. Today's cutting-edge technique becomes tomorrow's baseline. Allocate 10-15% of engineering time to exploration and experimentation. Some experiments will fail, but successful ones compound into significant efficiency gains. Foster a culture of continuous improvement: measure, analyze, iterate, and share results. The teams that stay ahead are those that invest in understanding their systems deeply and adapting proactively to new technologies and changing demands.

Key Takeaway: Success in GenAI infrastructure depends on mastering fundamentals: understand your hardware constraints, profile your workloads, measure everything, and iterate. The most sophisticated techniques (dynamic batching, mixed precision, distributed training) build on solid foundations of clear thinking and empirical validation. Avoid cargo-cult engineering: if you don't understand why a technique helps your specific use case, it probably won't. Invest time in understanding root causes, not just applying trendy solutions. Over time, this rigor will compound into significant competitive advantage.