MLflow

MLflow components
Experiment tracking for LLMs
Logging prompts and outputs
MLflow LLM tracing
Model registry
Serving models with MLflow
Gotchas

SECTION 01

MLflow components

MLflow is an open-source MLOps platform originally from Databricks. It has four main components: Tracking — log parameters, metrics, artifacts, and code versions for each experiment run; Projects — package ML code for reproducibility; Models — standard format for model packaging and deployment; Registry — model versioning and lifecycle management (staging, production, archived).

For LLM workflows, MLflow 2.8+ added native LLM tracing (like LangSmith, but open-source) and the mlflow.openai / mlflow.langchain autologgers.

SECTION 02

Experiment tracking for LLMs

pip install mlflow

import mlflow
import openai

mlflow.set_experiment("llm-prompt-experiments")

def evaluate_prompt(system_prompt: str, user_prompt: str, model: str = "gpt-4o-mini") -> dict:
    with mlflow.start_run():
        # Log parameters
        mlflow.log_params({
            "model": model,
            "system_prompt_version": "v3",
            "temperature": 0.7,
            "max_tokens": 512,
        })

        client = openai.OpenAI()
        resp = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt},
            ],
            max_tokens=512, temperature=0.7,
        )
        output = resp.choices[0].message.content

        # Log metrics
        mlflow.log_metrics({
            "prompt_tokens": resp.usage.prompt_tokens,
            "completion_tokens": resp.usage.completion_tokens,
            "total_cost_usd": (resp.usage.prompt_tokens * 0.00015 +
                               resp.usage.completion_tokens * 0.0006) / 1000,
        })

        # Log artifacts
        mlflow.log_text(system_prompt, "system_prompt.txt")
        mlflow.log_text(user_prompt, "user_prompt.txt")
        mlflow.log_text(output, "model_output.txt")

        return {"output": output, "run_id": mlflow.active_run().info.run_id}

result = evaluate_prompt(
    system_prompt="You are a concise technical writer.",
    user_prompt="Explain transformers in 2 sentences.",
)
print(result["output"])

SECTION 03

Logging prompts and outputs

import mlflow
import json

# Log a full prompt-response dataset as an artifact
def log_evaluation_dataset(results: list[dict], run_id: str):
    with mlflow.start_run(run_id=run_id):
        # Log as JSON artifact
        mlflow.log_dict(results, "evaluation_results.json")

        # Log aggregate metrics
        scores = [r.get("score", 0) for r in results]
        mlflow.log_metrics({
            "mean_score": sum(scores) / len(scores),
            "pass_rate": sum(1 for s in scores if s >= 7) / len(scores),
            "n_evaluated": len(results),
        })

# Auto-logging for OpenAI (MLflow 2.8+)
mlflow.openai.autolog()
# Now all openai.chat.completions.create() calls are automatically traced

SECTION 04

MLflow LLM tracing

import mlflow

# Enable tracing for LangChain
mlflow.langchain.autolog()

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

with mlflow.start_run():
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a helpful assistant."),
        ("user", "{question}"),
    ])
    chain = prompt | llm
    # This call is automatically traced — inputs, outputs, latency logged
    response = chain.invoke({"question": "What is MLflow?"})
    print(response.content)

# View traces in MLflow UI: mlflow ui --port 5000
# Navigate to http://localhost:5000 → Experiments → select run → Traces tab

SECTION 05

Model registry

import mlflow
import mlflow.pyfunc

# Register a model after training/evaluation
with mlflow.start_run() as run:
    # Log your fine-tuned model
    mlflow.log_params({"base_model": "gpt-4o-mini", "ft_steps": 1000})
    mlflow.log_metrics({"eval_score": 8.3})

    # Create a simple wrapper for your LLM endpoint
    class LLMWrapper(mlflow.pyfunc.PythonModel):
        def predict(self, context, model_input):
            import openai
            client = openai.OpenAI()
            prompts = model_input["prompt"].tolist()
            return [client.chat.completions.create(
                model="ft:gpt-4o-mini:...",
                messages=[{"role": "user", "content": p}],
                max_tokens=256
            ).choices[0].message.content for p in prompts]

    mlflow.pyfunc.log_model("llm_model", python_model=LLMWrapper())

# Register in model registry
model_uri = f"runs:/{run.info.run_id}/llm_model"
mlflow.register_model(model_uri, "MyLLMApp")
# Transition to production: client.transition_model_version_stage(...)

SECTION 06

Serving models with MLflow

MLflow can serve registered models as REST APIs:

# CLI: serve a model from the registry
# mlflow models serve -m "models:/MyLLMApp/Production" -p 5001

# Or programmatically:
import subprocess
subprocess.Popen([
    "mlflow", "models", "serve",
    "-m", "models:/MyLLMApp/1",
    "-p", "5001",
    "--no-conda",
])

# Client call:
import requests, json
response = requests.post(
    "http://localhost:5001/invocations",
    headers={"Content-Type": "application/json"},
    data=json.dumps({"dataframe_records": [{"prompt": "Explain MLflow"}]}),
)
print(response.json())

SECTION 07

Gotchas

Run nesting: MLflow supports nested runs for hyperparameter sweeps. Use mlflow.start_run(nested=True) inside an outer run context. If you forget nested=True, MLflow creates a sibling run instead of a child, which can be confusing in the UI.
Artifact size limits: The default local filesystem backend has no size limit, but managed MLflow (Databricks, MLflow on AWS) may have limits. Don't log multi-GB model checkpoints as MLflow artifacts — use a model registry backed by S3 or GCS.
LLM tracing vs experiment tracking: Autolog tracing and manual experiment logging are separate concerns in MLflow 2.8+. Tracing captures individual inference traces; experiment tracking captures aggregate run-level metrics. Use both together for complete observability.

MLflow components for LLM workflows

MLflow provides a suite of components that address different stages of the LLM development lifecycle. Not all components are equally relevant for every workflow: teams building RAG pipelines use tracking and evaluation heavily but may have minimal use for the model registry; teams deploying fine-tuned models benefit from the full registry and serving stack. Understanding which components address which pain points guides adoption decisions.

Component	Purpose	LLM-specific use case
Tracking	Log metrics, params, artifacts per run	Prompt versions, eval scores, token costs
Evaluation	Compute judge metrics against datasets	RAG Triad, toxicity, faithfulness
Model Registry	Version and stage-manage models	Promote fine-tuned checkpoints to production
Tracing	Distributed traces for LLM pipelines	Latency attribution, chain debugging
Serving	REST endpoint from registered model	Serve fine-tuned models via MLserver

MLflow Evaluation is the component with the highest return on investment for LLM teams. The mlflow.evaluate() function runs a suite of LLM-as-judge metrics against a dataset of model outputs and logs all results as a named run, enabling side-by-side comparison of metric scores across model versions in the MLflow UI. The tight integration between evaluation runs and the experiment tracking system means quality regressions are immediately visible alongside the parameter changes that caused them.

MLflow experiment organization for LLM projects benefits from a clear naming convention that distinguishes prompt engineering experiments from model training experiments. Separate experiment namespaces for "prompt-dev", "eval-regression", and "fine-tuning" prevent conflicting parameter names and metric definitions from appearing in the same experiment view. Within each experiment, tagging runs with the model version, dataset version, and feature branch enables rapid filtering in the comparison UI. The MLflow search API supports SQL-like queries on tags and parameters, enabling programmatic identification of the best-performing run within a parameter subspace.

MLflow's LLM evaluation integration with judge models runs evaluation asynchronously and logs all judge scores, judge reasoning, and pass/fail verdicts as artifacts associated with the evaluation run. For RAG pipelines, the standard evaluation workflow calls mlflow.evaluate() with the "question-answering" model type, automatically computing answer correctness, relevance, and faithfulness using a configurable judge model. The evaluated results are written to a Pandas DataFrame that is logged as an artifact, enabling downstream analysis of which query types or difficulty categories drive score variance across model versions.

MLflow Projects provides reproducible experiment packaging that bundles code, dependencies, and entry points into a portable format. An MLproject file specifies the conda environment or Docker image, the entry points, and their parameter signatures, enabling any team member to reproduce an experiment with a single mlflow run command. For LLM fine-tuning workflows, packaging the training script as an MLflow Project enables one-click reproduction of any registered model's training run from the model registry's source run link, closing the audit trail from deployed model artifact back to reproducible training code.

MLflow's integration with popular training frameworks enables automatic metric logging without custom instrumentation. The mlflow.autolog() call before starting a training run activates framework-specific autologgers for PyTorch Lightning, Hugging Face Transformers, XGBoost, and others, capturing training loss, validation metrics, and hyperparameters automatically. For LLM fine-tuning with the Hugging Face Trainer, autologging captures train/eval loss curves, learning rate schedules, and training duration, providing a complete run record with a single additional line of code and no custom logging logic.

MLflow's comparison view enables side-by-side evaluation of multiple runs with parallel coordinate plots that reveal correlations between hyperparameters and metrics. For LLM prompt optimization experiments, comparing prompt versions by their eval scores across multiple metrics in the parallel coordinates view quickly identifies which prompt changes improve one metric while degrading another — a common pattern when optimizing for precision versus recall tradeoffs in retrieval quality or verbosity versus accuracy tradeoffs in generation quality.

MLflow's artifact logging API supports arbitrary file formats, making it possible to attach rendered evaluation reports, confusion matrices, and annotated output samples to experiment runs. For LLM evaluation pipelines, logging the full evaluation dataframe as a CSV artifact alongside aggregate metric scores enables post-hoc analysis of which query categories or difficulty levels drove overall metric changes. When a model regression is detected, drilling into the logged artifact for the regressed run versus the previous run reveals the specific examples where quality dropped, accelerating root cause analysis of the regression.