Weights & Biases (W&B) is the leading MLOps platform for experiment tracking, model versioning, dataset management, and LLM evaluation. Log metrics, compare runs, track prompts, and monitor production models from a single dashboard.
W&B organises work around three primitives: runs (a single training job or experiment), projects (a collection of related runs), and artifacts (versioned files: datasets, models, evaluation results). Every run has a unique ID, logs metrics over time, and can be compared visually in the W&B dashboard.
For LLM work, W&B adds Weave — a tracing and evaluation framework purpose-built for LLM pipelines. Weave tracks prompts, completions, token counts, latency, and evaluation results at the call level, not just the run level. This is the right abstraction for debugging LLM chains and evaluating model outputs.
pip install wandb weave
wandb login # Enter your API key from wandb.ai/authorize
import wandb
import torch
from torch import nn
# Initialise a run
run = wandb.init(
project="llm-finetuning",
name="llama3-8b-qlora-v1",
config={
"model": "meta-llama/Meta-Llama-3-8B",
"lora_r": 16,
"lora_alpha": 32,
"learning_rate": 2e-4,
"batch_size": 8,
"epochs": 3,
},
)
# Log metrics during training
for step in range(1000):
loss = train_step()
if step % 10 == 0:
wandb.log({
"train/loss": loss,
"train/learning_rate": scheduler.get_last_lr()[0],
"train/grad_norm": compute_grad_norm(model),
}, step=step)
if step % 100 == 0:
val_loss, val_acc = evaluate()
wandb.log({"val/loss": val_loss, "val/accuracy": val_acc}, step=step)
# Finish the run
run.finish()
HuggingFace Trainer integrates W&B natively: set report_to="wandb" in TrainingArguments.
import weave
import openai
weave.init("my-llm-project") # Creates a W&B project for LLM traces
@weave.op()
def generate_response(prompt: str, model: str = "gpt-4o-mini") -> str:
client = openai.OpenAI()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
)
return response.choices[0].message.content
@weave.op()
def rag_pipeline(question: str) -> str:
# Weave traces the entire call tree
context = retrieve_context(question) # also traced if decorated
prompt = f"Context: {context}
Question: {question}
Answer:"
return generate_response(prompt)
# Every call is automatically logged: input, output, latency, token usage
answer = rag_pipeline("What is retrieval-augmented generation?")
# View traces at: https://wandb.ai/{entity}/{project}/weave
Weave automatically captures nested call trees, making it easy to trace exactly which sub-call caused a quality issue.
import wandb
run = wandb.init(project="llm-finetuning")
# Log a dataset as an artifact
dataset_artifact = wandb.Artifact("training-data-v2", type="dataset")
dataset_artifact.add_file("data/train.jsonl")
dataset_artifact.add_file("data/val.jsonl")
run.log_artifact(dataset_artifact)
# Log a fine-tuned model
model_artifact = wandb.Artifact("llama3-8b-finetuned", type="model",
metadata={"val_loss": 0.42, "lora_r": 16})
model_artifact.add_dir("./output/checkpoint-final")
run.log_artifact(model_artifact)
# Later: load a specific version
api = wandb.Api()
artifact = api.artifact("my-project/llama3-8b-finetuned:v3")
artifact.download("./loaded_model")
Artifacts create a lineage graph: you can trace which dataset version was used for each model version, enabling reproducible experiments and rollback.
# sweep_config.yaml
sweep_config = {
"method": "bayes", # or "grid", "random"
"metric": {"goal": "minimize", "name": "val/loss"},
"parameters": {
"learning_rate": {"distribution": "log_uniform_values", "min": 1e-5, "max": 1e-3},
"lora_r": {"values": [8, 16, 32, 64]},
"lora_alpha": {"values": [16, 32, 64]},
"batch_size": {"values": [4, 8, 16]},
},
}
sweep_id = wandb.sweep(sweep_config, project="llm-finetuning")
def train():
with wandb.init() as run:
config = run.config
model = setup_model(lora_r=config.lora_r, lora_alpha=config.lora_alpha)
trainer = train_model(model, lr=config.learning_rate, batch_size=config.batch_size)
val_loss = evaluate(model)
wandb.log({"val/loss": val_loss})
# Launch 20 runs with Bayesian optimisation
wandb.agent(sweep_id, function=train, count=20)
import wandb
import pandas as pd
run = wandb.init(project="llm-evaluation")
# Log model predictions as a W&B Table
predictions = [
{"question": "What is RAG?", "expected": "Retrieval-augmented generation...",
"predicted": model_output, "score": evaluate_score(expected, predicted)}
for question, expected, predicted in eval_data
]
df = pd.DataFrame(predictions)
# Create a W&B Table for interactive inspection in the UI
table = wandb.Table(dataframe=df)
run.log({"eval/predictions": table})
# The W&B UI lets you filter, sort, and compare predictions interactively
# Great for finding systematic failure modes
run.finish()
W&B Tables are particularly useful for LLM evaluation: log 100–500 examples with their scores, then use the UI to filter for low-scoring examples to find patterns.
API key storage: wandb login stores your key in ~/.netrc. In containerised environments, use the WANDB_API_KEY environment variable instead.
Offline mode: If the training server has no internet access, use WANDB_MODE=offline. Runs are cached locally and can be synced later with wandb sync.
Step alignment: If you log some metrics every step and others every N steps, ensure you always pass the same step argument to wandb.log() to avoid misaligned charts.
Large artifact uploads: For multi-GB model checkpoints, use artifact.add_reference("s3://bucket/path") to store a reference rather than uploading the file to W&B servers, which is slower and costs storage.
Weights & Biases (W&B) provides experiment tracking, model evaluation, and dataset versioning tailored for machine learning workflows. Its LLM-specific features — prompt versioning, trace logging, and the Weave evaluation framework — make it a comprehensive platform for managing the full LLM development cycle from fine-tuning to production monitoring.
| Feature | Purpose | LLM Use Case |
|---|---|---|
| Runs | Experiment tracking | Fine-tuning hyperparameter comparison |
| Artifacts | Dataset/model versioning | Training dataset provenance |
| Tables | Data visualization | Sample-level output comparison |
| Sweeps | Hyperparameter search | LoRA rank, learning rate optimization |
| Weave | LLM tracing + evaluation | RAG pipeline observability |
W&B Weave provides LLM-specific observability built on top of the core W&B infrastructure. Weave traces capture the full call tree of an LLM application — every model call, tool invocation, and retrieval operation — stored in a structured format that enables filtering by input characteristics, latency, token count, and quality scores. Unlike general-purpose logging solutions, Weave understands LLM-native concepts like token counts, model names, and conversation structure, providing purpose-built analytics for AI application debugging.
Fine-tuning experiment tracking in W&B logs training and validation loss curves, learning rate schedules, and token-level metrics for each run. Comparing runs side by side in the W&B dashboard reveals which hyperparameter combinations converge faster, overfit less, or achieve better benchmark scores. The artifact versioning system links each fine-tuned checkpoint to the exact training dataset version and configuration used, providing complete provenance that is essential for reproducing results and auditing model development decisions in regulated industries.
W&B Artifacts provide content-addressed storage for datasets, model checkpoints, and evaluation results, tracking the exact version of each artifact consumed and produced by each training run. A run that trains on dataset v3 and produces checkpoint v7 creates a versioned lineage graph showing which data produced which models. This lineage tracking is essential for debugging quality regressions: when a model starts performing worse, the artifact graph makes it straightforward to identify whether the regression correlates with a dataset change, a hyperparameter change, or a code change.
W&B Sweeps implements Bayesian hyperparameter optimization that learns from previous run results to focus the search on promising regions of the hyperparameter space. Compared to random search, Bayesian optimization typically finds good hyperparameter configurations in 30–50% fewer runs, significantly reducing the compute cost of hyperparameter tuning for expensive fine-tuning jobs. The sweep controller manages parallel run allocation, early stopping of unpromising runs, and result aggregation, enabling distributed hyperparameter search with minimal infrastructure setup.
Custom W&B charts and reports enable sharing LLM evaluation results with non-technical stakeholders in accessible formats. A W&B report can embed interactive charts of quality metric trends, sample output comparisons, and statistical significance tests alongside narrative explanation. Reports are shareable via URL and update automatically when new evaluation runs complete, creating a living document of model quality that tracks progress over time rather than static snapshots that immediately become outdated as development continues.
W&B integration with popular training frameworks reduces instrumentation overhead to near-zero. The Hugging Face Trainer automatically logs metrics to W&B when the WANDB_PROJECT environment variable is set, requiring no additional code. PyTorch Lightning's WandbLogger similarly provides automatic metric logging with a one-line configuration change. These framework-level integrations mean most training pipelines can adopt W&B tracking without modifying training loops, reducing adoption friction for teams adding observability to existing codebases.
W&B's model registry provides a centralized store for promoting experiment checkpoints to named model versions with semantic versioning. Linking the model registry entry to the training run artifact creates a complete audit trail from registered model back to training data, hyperparameters, and evaluation metrics. This traceability supports compliance requirements and simplifies debugging when a deployed model behaves unexpectedly, because the full training provenance is accessible from the registry entry without searching through run history.