Open-source MLOps platform for experiment tracking, model registry, and deployment. Track LLM experiments with prompt versions, hyperparameters, and evaluation metrics. The standard for reproducible ML pipelines.
MLflow is an open-source MLOps platform originally from Databricks. It has four main components: Tracking — log parameters, metrics, artifacts, and code versions for each experiment run; Projects — package ML code for reproducibility; Models — standard format for model packaging and deployment; Registry — model versioning and lifecycle management (staging, production, archived).
For LLM workflows, MLflow 2.8+ added native LLM tracing (like LangSmith, but open-source) and the mlflow.openai / mlflow.langchain autologgers.
pip install mlflow
import mlflow
import openai
mlflow.set_experiment("llm-prompt-experiments")
def evaluate_prompt(system_prompt: str, user_prompt: str, model: str = "gpt-4o-mini") -> dict:
with mlflow.start_run():
# Log parameters
mlflow.log_params({
"model": model,
"system_prompt_version": "v3",
"temperature": 0.7,
"max_tokens": 512,
})
client = openai.OpenAI()
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
max_tokens=512, temperature=0.7,
)
output = resp.choices[0].message.content
# Log metrics
mlflow.log_metrics({
"prompt_tokens": resp.usage.prompt_tokens,
"completion_tokens": resp.usage.completion_tokens,
"total_cost_usd": (resp.usage.prompt_tokens * 0.00015 +
resp.usage.completion_tokens * 0.0006) / 1000,
})
# Log artifacts
mlflow.log_text(system_prompt, "system_prompt.txt")
mlflow.log_text(user_prompt, "user_prompt.txt")
mlflow.log_text(output, "model_output.txt")
return {"output": output, "run_id": mlflow.active_run().info.run_id}
result = evaluate_prompt(
system_prompt="You are a concise technical writer.",
user_prompt="Explain transformers in 2 sentences.",
)
print(result["output"])
import mlflow
import json
# Log a full prompt-response dataset as an artifact
def log_evaluation_dataset(results: list[dict], run_id: str):
with mlflow.start_run(run_id=run_id):
# Log as JSON artifact
mlflow.log_dict(results, "evaluation_results.json")
# Log aggregate metrics
scores = [r.get("score", 0) for r in results]
mlflow.log_metrics({
"mean_score": sum(scores) / len(scores),
"pass_rate": sum(1 for s in scores if s >= 7) / len(scores),
"n_evaluated": len(results),
})
# Auto-logging for OpenAI (MLflow 2.8+)
mlflow.openai.autolog()
# Now all openai.chat.completions.create() calls are automatically traced
import mlflow
# Enable tracing for LangChain
mlflow.langchain.autolog()
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
with mlflow.start_run():
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{question}"),
])
chain = prompt | llm
# This call is automatically traced — inputs, outputs, latency logged
response = chain.invoke({"question": "What is MLflow?"})
print(response.content)
# View traces in MLflow UI: mlflow ui --port 5000
# Navigate to http://localhost:5000 → Experiments → select run → Traces tab
import mlflow
import mlflow.pyfunc
# Register a model after training/evaluation
with mlflow.start_run() as run:
# Log your fine-tuned model
mlflow.log_params({"base_model": "gpt-4o-mini", "ft_steps": 1000})
mlflow.log_metrics({"eval_score": 8.3})
# Create a simple wrapper for your LLM endpoint
class LLMWrapper(mlflow.pyfunc.PythonModel):
def predict(self, context, model_input):
import openai
client = openai.OpenAI()
prompts = model_input["prompt"].tolist()
return [client.chat.completions.create(
model="ft:gpt-4o-mini:...",
messages=[{"role": "user", "content": p}],
max_tokens=256
).choices[0].message.content for p in prompts]
mlflow.pyfunc.log_model("llm_model", python_model=LLMWrapper())
# Register in model registry
model_uri = f"runs:/{run.info.run_id}/llm_model"
mlflow.register_model(model_uri, "MyLLMApp")
# Transition to production: client.transition_model_version_stage(...)
MLflow can serve registered models as REST APIs:
# CLI: serve a model from the registry
# mlflow models serve -m "models:/MyLLMApp/Production" -p 5001
# Or programmatically:
import subprocess
subprocess.Popen([
"mlflow", "models", "serve",
"-m", "models:/MyLLMApp/1",
"-p", "5001",
"--no-conda",
])
# Client call:
import requests, json
response = requests.post(
"http://localhost:5001/invocations",
headers={"Content-Type": "application/json"},
data=json.dumps({"dataframe_records": [{"prompt": "Explain MLflow"}]}),
)
print(response.json())
mlflow.start_run(nested=True) inside an outer run context. If you forget nested=True, MLflow creates a sibling run instead of a child, which can be confusing in the UI.MLflow provides a suite of components that address different stages of the LLM development lifecycle. Not all components are equally relevant for every workflow: teams building RAG pipelines use tracking and evaluation heavily but may have minimal use for the model registry; teams deploying fine-tuned models benefit from the full registry and serving stack. Understanding which components address which pain points guides adoption decisions.
| Component | Purpose | LLM-specific use case |
|---|---|---|
| Tracking | Log metrics, params, artifacts per run | Prompt versions, eval scores, token costs |
| Evaluation | Compute judge metrics against datasets | RAG Triad, toxicity, faithfulness |
| Model Registry | Version and stage-manage models | Promote fine-tuned checkpoints to production |
| Tracing | Distributed traces for LLM pipelines | Latency attribution, chain debugging |
| Serving | REST endpoint from registered model | Serve fine-tuned models via MLserver |
MLflow Evaluation is the component with the highest return on investment for LLM teams. The mlflow.evaluate() function runs a suite of LLM-as-judge metrics against a dataset of model outputs and logs all results as a named run, enabling side-by-side comparison of metric scores across model versions in the MLflow UI. The tight integration between evaluation runs and the experiment tracking system means quality regressions are immediately visible alongside the parameter changes that caused them.
MLflow experiment organization for LLM projects benefits from a clear naming convention that distinguishes prompt engineering experiments from model training experiments. Separate experiment namespaces for "prompt-dev", "eval-regression", and "fine-tuning" prevent conflicting parameter names and metric definitions from appearing in the same experiment view. Within each experiment, tagging runs with the model version, dataset version, and feature branch enables rapid filtering in the comparison UI. The MLflow search API supports SQL-like queries on tags and parameters, enabling programmatic identification of the best-performing run within a parameter subspace.
MLflow's LLM evaluation integration with judge models runs evaluation asynchronously and logs all judge scores, judge reasoning, and pass/fail verdicts as artifacts associated with the evaluation run. For RAG pipelines, the standard evaluation workflow calls mlflow.evaluate() with the "question-answering" model type, automatically computing answer correctness, relevance, and faithfulness using a configurable judge model. The evaluated results are written to a Pandas DataFrame that is logged as an artifact, enabling downstream analysis of which query types or difficulty categories drive score variance across model versions.
MLflow Projects provides reproducible experiment packaging that bundles code, dependencies, and entry points into a portable format. An MLproject file specifies the conda environment or Docker image, the entry points, and their parameter signatures, enabling any team member to reproduce an experiment with a single mlflow run command. For LLM fine-tuning workflows, packaging the training script as an MLflow Project enables one-click reproduction of any registered model's training run from the model registry's source run link, closing the audit trail from deployed model artifact back to reproducible training code.
MLflow's integration with popular training frameworks enables automatic metric logging without custom instrumentation. The mlflow.autolog() call before starting a training run activates framework-specific autologgers for PyTorch Lightning, Hugging Face Transformers, XGBoost, and others, capturing training loss, validation metrics, and hyperparameters automatically. For LLM fine-tuning with the Hugging Face Trainer, autologging captures train/eval loss curves, learning rate schedules, and training duration, providing a complete run record with a single additional line of code and no custom logging logic.
MLflow's comparison view enables side-by-side evaluation of multiple runs with parallel coordinate plots that reveal correlations between hyperparameters and metrics. For LLM prompt optimization experiments, comparing prompt versions by their eval scores across multiple metrics in the parallel coordinates view quickly identifies which prompt changes improve one metric while degrading another — a common pattern when optimizing for precision versus recall tradeoffs in retrieval quality or verbosity versus accuracy tradeoffs in generation quality.
MLflow's artifact logging API supports arbitrary file formats, making it possible to attach rendered evaluation reports, confusion matrices, and annotated output samples to experiment runs. For LLM evaluation pipelines, logging the full evaluation dataframe as a CSV artifact alongside aggregate metric scores enables post-hoc analysis of which query categories or difficulty levels drove overall metric changes. When a model regression is detected, drilling into the logged artifact for the regressed run versus the previous run reveals the specific examples where quality dropped, accelerating root cause analysis of the regression.