PyTorch Basics

HF Datasets

Hugging Face Datasets library — Arrow-backed, streaming-capable dataset management for training on corpora that don't fit in RAM.

Arrow
Storage
Streaming
Large Data
Hub
1M+ Datasets

Table of Contents

SECTION 01

Why HF Datasets?

Training datasets for LLMs can be hundreds of GB. You can't load that into RAM. The HF Datasets library solves this with memory-mapped Apache Arrow files — you access data lazily without loading everything.

SECTION 02

Loading & Exploring

from datasets import load_dataset # From Hugging Face Hub ds = load_dataset("tatsu-lab/alpaca") # Full dataset ds = load_dataset("HuggingFaceH4/ultrachat_200k") # Chat dataset # Specific splits train = load_dataset("tatsu-lab/alpaca", split="train") # Or keep as DatasetDict ds = load_dataset("tatsu-lab/alpaca") # {"train": Dataset, "test": Dataset} # From local files ds = load_dataset("json", data_files="data.jsonl") ds = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"}) ds = load_dataset("parquet", data_files="data.parquet") # Inspect print(ds) # DatasetDict overview print(ds["train"]) # Dataset: num_rows, features print(ds["train"][0]) # First example (dict) print(ds["train"].features) # Schema: column names and types print(ds["train"].column_names)
SECTION 03

Filtering & Mapping

from datasets import load_dataset ds = load_dataset("tatsu-lab/alpaca", split="train") # Filter rows ds = ds.filter(lambda x: len(x["instruction"]) > 20) # Keep long instructions ds = ds.filter(lambda x: x["output"] != "", # Remove empty outputs num_proc=4) # Parallel filtering # Map — transform each example def format_prompt(example): prompt = f"### Instruction:\n{example['instruction']}" if example["input"]: prompt += f"\n\n### Input:\n{example['input']}" prompt += f"\n\n### Response:\n{example['output']}" return {"prompt": prompt} ds = ds.map(format_prompt, num_proc=4) # Batched map — more efficient for tokenization def tokenize_batch(batch): return tokenizer(batch["prompt"], truncation=True, max_length=512) ds = ds.map(tokenize_batch, batched=True, batch_size=1000, num_proc=4) # Remove old columns ds = ds.remove_columns(["instruction", "input", "output"]) # Select subset subset = ds.select(range(1000)) # First 1000 examples
SECTION 04

Streaming Mode

from datasets import load_dataset # Streaming: never download the full dataset ds = load_dataset("HuggingFaceFW/fineweb", split="train", streaming=True) # Returns IterableDataset instead of Dataset # Iterate for example in ds: text = example["text"] # process... # Take the first N examples examples = list(ds.take(1000)) # Shuffle (buffer-based — shuffles a window, not globally) ds_shuffled = ds.shuffle(buffer_size=10000, seed=42) # Map and filter work the same way ds_processed = ds.map(lambda x: {"length": len(x["text"])}) ds_filtered = ds_processed.filter(lambda x: x["length"] > 100) # With DataLoader — use IterableDataset from torch.utils.data import DataLoader loader = DataLoader(ds, batch_size=8, num_workers=2)
When to use streaming: Any dataset > 10GB. Train on The Pile, RedPajama, or FineWeb without downloading 100GB first. Essential for pre-training.
SECTION 05

Tokenizing & Formatting

from datasets import load_dataset from transformers import AutoTokenizer ds = load_dataset("tatsu-lab/alpaca", split="train") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Tokenize with padding/truncation def tokenize(batch): return tokenizer( batch["prompt"], truncation=True, max_length=512, padding="max_length", # Pad to fixed length for DataLoader return_tensors=None # Return lists (better for .map caching) ) ds = ds.map(tokenize, batched=True, batch_size=1000, num_proc=4) # Set format for PyTorch ds.set_format("torch", columns=["input_ids", "attention_mask", "labels"]) # Create labels = input_ids (for causal LM — predict next token) def add_labels(example): example["labels"] = example["input_ids"].clone() return example ds = ds.map(add_labels) # Use with DataLoader from torch.utils.data import DataLoader loader = DataLoader(ds, batch_size=8, shuffle=True, num_workers=4) batch = next(iter(loader)) print(batch["input_ids"].shape) # (8, 512)
SECTION 06

Custom & Hub Datasets

from datasets import Dataset, DatasetDict, Features, Value # Create from Python dict data = { "instruction": ["Summarize this text.", "Translate to French."], "output": ["This is a summary.", "C'est une traduction."], } ds = Dataset.from_dict(data) # Create from pandas DataFrame import pandas as pd df = pd.read_csv("my_data.csv") ds = Dataset.from_pandas(df, preserve_index=False) # Save to disk (Arrow format — fast re-loading) ds.save_to_disk("/path/to/my_dataset") ds = load_dataset("/path/to/my_dataset") # Push to HuggingFace Hub ds.push_to_hub("your-username/your-dataset-name", private=True) # Load your private dataset ds = load_dataset("your-username/your-dataset-name") # DatasetDict for multiple splits dataset = DatasetDict({ "train": Dataset.from_dict(train_data), "validation": Dataset.from_dict(val_data), "test": Dataset.from_dict(test_data), }) dataset.push_to_hub("your-username/your-dataset")
Caching tip: HF Datasets caches processed datasets under ~/.cache/huggingface/. If you change your map function, clear the cache or pass load_from_cache_file=False.

Dataset processing pipelines for LLM training

Hugging Face Datasets' map() function is the primary tool for transforming raw datasets into model-ready training data. It applies a function to every example in the dataset in parallel across CPU cores, with the processed result cached to disk so subsequent map() calls on the same dataset load from cache rather than reprocessing. The batched=True parameter processes examples in groups, enabling batch operations like tokenization that are significantly faster when processing multiple texts simultaneously than processing one at a time.

Streaming and memory-efficient data loading

Streaming mode enables training on datasets larger than available RAM by loading and processing examples on demand rather than downloading the full dataset. This is essential for pretraining on web-scale datasets like RedPajama or The Pile that are terabytes in size. Streaming mode uses Python generators under the hood, producing one batch at a time rather than materializing the full dataset in memory. The trade-off is that streaming datasets do not support random access or shuffling across the full dataset, making epoch-level shuffling limited to a configurable shuffle buffer size.

Dataset modeMemory usageShufflingBest for
In-memory (default)Full datasetPerfect<10GB datasets
Memory-mapped (arrow)OS cacheGood10GB–100GB
StreamingBuffer onlyBuffer-limited>100GB, web-scale

Streaming and memory efficiency

Hugging Face Datasets supports streaming mode (dataset.stream()) that downloads and processes data on-demand without materializing the full dataset in memory or disk. When training on large corpora (100GB+), streaming enables infinite training epochs by continuously cycling through data: each epoch reads from the original source, applies preprocessing, and serves batches. Streaming is essential for datasets larger than available disk (LAION-400M scale) or when iterating rapidly during development. However, streaming introduces latency: each sample access downloads a chunk from remote storage. Production systems optimize by: prefetching (downloading next batch while GPU processes current batch), local caching (keeping frequently-accessed shards locally), and batching (requesting many samples per remote call). The streaming API is transparent: the same `Dataset` object works in stream or downloaded mode, so teams can develop locally with downloaded data and scale to streaming in production without code changes.

Dataset versioning and reproducibility

Datasets are tracked via git-LFS, enabling version control and reproducibility. When a model training run uses dataset v2.1, you can checkout exactly that version for reproduction or debugging. Each version is identified by a git commit hash, creating an immutable reference point. This solves a common problem: training runs become irreproducible when data changes subtly (rows deleted, columns renamed, preprocessing logic altered). Hugging Face Datasets captures versions automatically when you push, creating a full audit trail. For production ML pipelines, this enables compliance (proving what data trained a model), debugging (comparing training across dataset versions), and governance (tracking when sensitive data was removed). Teams building data flywheel systems can leverage versioning to manage incremental updates: add 1000 new labeled examples, create version 2.2, retrain models, compare performance, and roll back if quality degrades.

Custom preprocessing and data transformations

Hugging Face Datasets provides a map() function for applying custom transformations: dataset.map(tokenize_fn, batched=True). This supports: scaling preprocessing across clusters (via multiprocessing or distributed libraries), caching intermediate results (recompute only when input data changes), and combining transformations (tokenization, truncation, numeric conversion in a pipeline). Caching is powerful: if preprocessing is expensive (loading a 100MB vocab, training a BPE tokenizer), the first call computes and caches results; subsequent runs load from cache, reducing development iteration time from minutes to seconds. For fine-tuning workflows, caching enables experimentation: tokenize the corpus once, then try multiple model architectures or hyperparameter configurations without reprocessing. Production deployments can also checkpoint intermediate datasets: save dataset after deduplification, then after filtering, enabling teams to validate each step independently and debug failures at the right stage.

Streaming and memory efficiency

Hugging Face Datasets supports streaming mode (dataset.stream()) that downloads and processes data on-demand without materializing the full dataset in memory or disk. When training on large corpora (100GB+), streaming enables infinite training epochs by continuously cycling through data: each epoch reads from the original source, applies preprocessing, and serves batches. Streaming is essential for datasets larger than available disk (LAION-400M scale) or when iterating rapidly during development. However, streaming introduces latency: each sample access downloads a chunk from remote storage. Production systems optimize by: prefetching (downloading next batch while GPU processes current batch), local caching (keeping frequently-accessed shards locally), and batching (requesting many samples per remote call). The streaming API is transparent: the same `Dataset` object works in stream or downloaded mode, so teams can develop locally with downloaded data and scale to streaming in production without code changes.