MLOps

HuggingFace Hub

Model and dataset registry with 500K+ public models, Git-LFS versioning, and programmatic access via huggingface_hub. The central distribution platform for open-source LLMs.

500K+ models
Public registry
Git-LFS
Large file versioning
Private repos
Gated models supported

Table of Contents

SECTION 01

HuggingFace Hub overview

HuggingFace Hub is the central distribution platform for open-source AI models, datasets, and Spaces (demo apps). It hosts 500K+ models and 100K+ datasets, all versioned with Git and Git-LFS. Models can be loaded directly into transformers with a model ID string. The huggingface_hub Python library provides programmatic access: downloading, uploading, searching, and managing repositories. Access is free for public repos; private repos require a paid plan.

SECTION 02

Downloading models and datasets

from huggingface_hub import snapshot_download, hf_hub_download
import os

# Set your token for gated models (Llama 3, Gemma etc)
os.environ["HF_TOKEN"] = "hf_..."  # or huggingface-cli login

# Download entire model repo to local cache
local_dir = snapshot_download(
    repo_id="meta-llama/Llama-3-8B-Instruct",
    ignore_patterns=["*.gguf", "original/*"],  # skip large alternate formats
    cache_dir="~/.cache/huggingface/hub",       # default cache location
)
print(f"Downloaded to: {local_dir}")

# Download a single file
config_path = hf_hub_download(
    repo_id="meta-llama/Llama-3-8B-Instruct",
    filename="config.json",
)

# Load with transformers (handles caching automatically)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B-Instruct")
# First call downloads; subsequent calls load from cache
SECTION 03

Uploading your model

from huggingface_hub import HfApi, create_repo
import os

api = HfApi(token=os.environ["HF_TOKEN"])

# Create a new repository
repo_url = create_repo(
    repo_id="your-username/my-finetuned-llama3",
    private=True,         # private until ready to share
    exist_ok=True,
)

# Upload an entire directory
api.upload_folder(
    folder_path="./fine-tuned-llama3/merged",
    repo_id="your-username/my-finetuned-llama3",
    repo_type="model",
    commit_message="Upload fine-tuned Llama 3 8B",
    ignore_patterns=["*.tmp", "__pycache__/*"],
)

# Or use transformers' push_to_hub directly
from transformers import AutoModelForCausalLM, AutoTokenizer
model.push_to_hub("your-username/my-finetuned-llama3")
tokenizer.push_to_hub("your-username/my-finetuned-llama3")
SECTION 04

Model cards and metadata

from huggingface_hub import ModelCard, ModelCardData

# Create a model card programmatically
card_data = ModelCardData(
    language=["en"],
    license="llama3",
    base_model="meta-llama/Llama-3-8B-Instruct",
    tags=["llama", "fine-tuned", "qlora"],
    datasets=["my-org/my-dataset"],
    metrics=[{"type": "accuracy", "value": 0.87, "dataset": {"name": "my-eval"}}],
)
card = ModelCard(
    content=(
        "---
" + card_data.to_yaml() + "
---
"
        "# My Fine-tuned Llama 3 8B
"
        "Fine-tuned on domain-specific data using QLoRA.
"
        "## Training details
"
        "- Base: meta-llama/Llama-3-8B-Instruct
"
        "- Method: QLoRA r=16
"
        "- Dataset: 10k instruction pairs
"
    ),
)
card.push_to_hub("your-username/my-finetuned-llama3")
SECTION 05

Private and gated models

import os
from huggingface_hub import login

# Login (stores token in ~/.cache/huggingface/token)
login(token=os.environ["HF_TOKEN"])

# For gated models (Llama 3, Gemma): first accept the license on the model page
# then your token grants download access
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    token=os.environ["HF_TOKEN"],  # explicit token if not logged in
)

# Make your own model gated (requires org plan):
from huggingface_hub import HfApi
api = HfApi()
api.update_repo_settings(
    repo_id="your-org/proprietary-model",
    gated=True,  # users must accept terms to download
)
SECTION 06

Using Hub in CI/CD pipelines

import os
from huggingface_hub import HfApi

def push_to_hub_on_eval_pass(model_path: str, eval_score: float, threshold: float = 0.85):
    if eval_score < threshold:
        print(f"Eval {eval_score:.2f} below threshold {threshold} — skipping push")
        return

    api = HfApi(token=os.environ["HF_TOKEN"])
    api.upload_folder(
        folder_path=model_path,
        repo_id="my-org/production-model",
        commit_message=f"Auto-push: eval_score={eval_score:.3f}",
        create_pr=True,  # create a PR rather than pushing directly to main
    )
    print(f"Pushed to Hub with eval score {eval_score:.2f}")

# In your CI pipeline (GitHub Actions, etc):
# HF_TOKEN: ${{ secrets.HF_TOKEN }}
# python -c "from train import evaluate; score = evaluate(); push_to_hub_on_eval_pass('./model', score)"
SECTION 07

Gotchas

Hub model discovery and evaluation

The Hugging Face Hub hosts over 500,000 models with metadata including task type, training data, evaluation scores, and license information. The Hub's filtering and sorting capabilities enable discovery of models for specific tasks — filtering by task (text-generation, token-classification, translation), language, library (transformers, diffusers), and license (apache-2.0, mit, llama) identifies candidate models efficiently. The Model Leaderboard aggregates evaluation scores from Open LLM Leaderboard and other sources, enabling direct quality comparison before downloading models.

Resource typeHub URL patternContent
Modelhuggingface.co/org/model-nameWeights, config, tokenizer, model card
Datasethuggingface.co/datasets/org/nameData files, dataset card, viewer
Spacehuggingface.co/spaces/org/nameDemo app (Gradio/Streamlit)
Collectionhuggingface.co/collections/...Curated model/dataset groups

Model caching behavior on the Hub client affects both storage efficiency and download performance. By default, huggingface_hub downloads models to ~/.cache/huggingface/hub/ and caches each revision with a content-addressable file system that avoids duplicate storage of identical files across model versions. Setting HF_HOME environment variable redirects the cache to a custom path, which is necessary when the default home directory lacks sufficient space for large model weights. The cache can be shared across containers or VMs by mounting the same cache directory, avoiding repeated downloads of the same model in distributed serving environments.

Model Discovery, Hub Search, and Filtering Strategies

Hugging Face Hub's search interface (huggingface.co/models) filters >200k models by task (text-generation, image-classification, sentiment-analysis), framework (pytorch, tensorflow), language, and license. Programmatic discovery via huggingface_hub.list_models() enables filtering by ModelFilter(task="text-generation", library="transformers", sort="downloads", direction=-1); this returns popular models first, useful for finding well-maintained alternatives to obscure checkpoints. The Hub's metadata — download counts, last update date, model card completeness (scored 1–5 stars) — indicates active maintenance; models with <100 downloads or no update in 6 months are likely abandoned. Advanced search using Hugging Face API's full-text search (q="legal document classification") finds domain-specific models; combing search results with license filters (license="apache-2.0" or license="mit") identifies permissive models suitable for commercial deployment. Hub's leaderboards (e.g., Open LLM Leaderboard) rank models by benchmark performance, surfacing SOTA checkpoints; LMSYS Chatbot Arena provides crowdsourced rankings of instruction-tuned models, more reflective of real-world performance than standardized benchmarks.

Hub CLI, Model Cards, and Documentation Standards

The huggingface-cli command-line tool provides efficient model management: huggingface-cli model-info meta-llama/Llama-2-7b-hf prints full metadata including architecture, training data, quantization status, and license; huggingface-cli download meta-llama/Llama-2-7b-hf --include "*.md" downloads just the model card and README for quick review. Model cards (standardized markdown files in each repo) document intended use, training procedures, known limitations, and benchmark results — required reading before deployment. Well-maintained models include PyTorch and SafeTensors versions, detailed README with code examples, and community discussions; poor-quality repos lack these signals. The Hub CLI enables headless bulk operations: for model in $(huggingface-cli list-models --filter text-generation); do huggingface-cli scan-model $model; done automates compliance scanning across a model set. Integration with Git LFS (Large File Storage) enables efficient versioning: a model repo with 40 versions (fine-tunes, patches) uses <1GB local storage with sparse checkout; git clone --depth 1 --filter=blob:none --sparse followed by git sparse-checkout set . downloads only the latest revision, essential for CI/CD pipelines.

Caching Strategies, Local Storage, and Offline Workflows

Hugging Face models are cached in ~/.cache/huggingface/hub by default; setting HF_HOME=/mnt/fast_ssd redirects cache to faster storage during iterative development. For production inference, copying the entire model to local storage eliminates network latency: python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-2-7b', cache_dir='/model_cache')" pre-populates the cache, avoiding lazy loading on first inference request. Offline mode (export HF_DATASETS_OFFLINE=1; HF_HUB_OFFLINE=1) uses only cached models, critical for air-gapped deployments; missing models raise informative errors instead of attempting network access. For teams, configuring huggingface_hub.HfApi(token="hf_xxx", endpoint="https://hub-internal.company.com") uses private Hub instances (Hugging Face Enterprise) for access control and audit logging. Cache statistics are queryable: python -c "from huggingface_hub import scan_cache_dir; info = scan_cache_dir(); print(f'Total cache: {info.total_size_human}')" shows cache size and unused models; periodic pruning (remove revisions not accessed in 30 days) conserves disk space in shared cluster environments where dozens of models are cached.