Training · Data

Data Preparation for Fine-Tuning

Format conversion, quality filtering, deduplication, and dataset mixing — getting training data production-ready.

5 Pipeline steps
4 Format standards
MinHash Near-deduplication
Contents
  1. The data prep pipeline
  2. Format standards
  3. Format conversion
  4. Quality filtering
  5. Deduplication
  6. Dataset mixing
  7. Versioning
  8. Tools & references
01 — Overview

The Data Prep Pipeline

Getting raw data into shape for fine-tuning is a series of transformations. Each step removes noise and improves signal. The order matters: collect first, clean second, deduplicate last. Jumping steps or reordering them leads to wasted compute.

The Five Steps

1
Collect & normalize: Gather data from all sources. Normalise whitespace, line breaks, encoding. Fix obvious corruption.
2
Format conversion: Convert all data to a single canonical format (Alpaca, ShareGPT, or ChatML). Handle edge cases and validation.
3
Quality filtering: Remove short, toxic, low-quality, or non-language data. Use heuristic and learned signals.
4
Deduplication: Remove near-duplicate examples using exact match, MinHash, or embedding-based methods. Most critical step.
5
Split & version: Create train/val splits. Version the dataset. Document all transformations and their parameters.
💡 Quality > Quantity: 10k high-quality examples beat 100k noisy ones. Every step removes noise; invest time here to compress your dataset.
02 — Standards

Format Standards Comparison

Different frameworks and datasets use different JSON structures. Standardizing early prevents downstream issues. Choose one canonical format and convert everything to it.

FormatStructureMulti-turnSystem promptTool use
Alpaca {"instruction", "input", "output"} No No No
ShareGPT {"conversations": [{"from", "value"}]} Yes Implicit Limited
ChatML <|im_start|>role\nmsg<|im_end|> Yes Yes Yes
OpenAI {"messages": [{"role", "content"}]} Yes Yes Yes (function calls)
JSONL custom Your schema Yes Yes Yes

Recommendation: For production, use ChatML or OpenAI format. Both support multi-turn conversations, system prompts, and tool use. For simple instruction-following, Alpaca is minimal but limiting.

03 — Transformation

Format Conversion

Converting between formats is the most common data prep task. Write validators for both source and target formats, handle edge cases (missing fields, malformed JSON), and log conversions that fail.

Alpaca to ChatML Example

import json def alpaca_to_chatml(alpaca_record): """Convert Alpaca format to ChatML.""" instruction = alpaca_record.get("instruction", "") input_text = alpaca_record.get("input", "") output = alpaca_record.get("output", "") prompt = instruction if input_text: prompt += f"\n{input_text}" messages = [ {"role": "user", "content": prompt}, {"role": "assistant", "content": output} ] return format_chatml(messages) def format_chatml(messages): """Format messages as ChatML string.""" chatml = "" for msg in messages: role = msg["role"] content = msg["content"] chatml += f"<|im_start|>{role}\n{content}<|im_end|>\n" return chatml.strip() # Example alpaca = {"instruction": "Explain AI", "input": "", "output": "AI is..."} print(alpaca_to_chatml(alpaca))

Best Practices

04 — Noise Removal

Quality Filtering

Raw data contains noise: very short examples, spam, toxic content, non-language sequences, and examples outside your domain. Quality filters remove these before expensive training.

Common Filters

FilterSignalThreshold ExampleNotes
Length Token count output > 20 tokens Remove very short responses; set lower for instruction data
Perplexity Language model log-prob ppl < 50 (tuned model) High ppl indicates low-quality or non-language text
Language Detected language lang == 'en' Use langdetect; allow multilingual if intended
Toxicity Classifier score toxicity < 0.5 Use Perspective API or HuggingFace model
Repetition Repeated substrings max_repetition_ratio < 0.3 Filters repetitive spam and corrupted text

Filtering in Python

from textstat import flesch_reading_ease import langdetect def quality_filter(record, min_tokens=20, max_ppl=50): """Apply quality filters to a record.""" output = record.get("output", "") # Length check if len(output.split()) < min_tokens: return False, "too_short" # Language check try: lang = langdetect.detect(output) if lang != 'en': return False, "non_english" except: return False, "lang_error" # Repetition check if check_repetition(output) > 0.3: return False, "too_repetitive" return True, "pass" def check_repetition(text): """Ratio of repeated 4-grams.""" words = text.split() if len(words) < 4: return 0.0 grams = [' '.join(words[i:i+4]) for i in range(len(words)-3)] unique = len(set(grams)) return 1 - (unique / len(grams)) # Filter dataset filtered = [r for r in data if quality_filter(r)[0]] print(f"Kept {len(filtered)}/{len(data)} records")
💡 Tune thresholds on validation set. Too aggressive filtering loses signal; too lenient keeps noise. Monitor validation loss during training — if it's high, increase filter strictness.
05 — Critical Step

Deduplication Strategies

Duplicate examples in training data inflate apparent dataset size without adding signal. Duplicates also hurt generalization and can leak into validation splits. Deduplication is the highest-ROI data prep step.

Three Deduplication Methods

MethodDetectsSpeedMemoryBest for
Exact hash Byte-identical Fast Low Removing copy-paste duplicates
MinHash LSH Near-duplicates (85%+ similar) Fast Medium Production deduplication
Embedding-based Semantic duplicates Slow High Finding conceptually similar examples

MinHash for Near-Deduplication

MinHash (Locality-Sensitive Hashing) is the standard for production deduplication. It's fast, scalable, and detects near-duplicates without full pairwise comparison.

from datasketch import MinHash, MinHashLSH def create_minhash(text, num_perm=128): """Create MinHash signature for text.""" m = MinHash(num_perm=num_perm) tokens = text.split() for token in tokens: m.update(token.encode()) return m def deduplicate_dataset(records, threshold=0.9): """Deduplicate using MinHash LSH.""" lsh = MinHashLSH(threshold=threshold, num_perm=128) unique_records = [] duplicate_ids = set() for i, record in enumerate(records): text = record.get("output", "") mh = create_minhash(text) # Find near-duplicates in LSH near_dupes = lsh.query(mh) if not near_dupes or i == min(near_dupes): # Keep first occurrence lsh.insert(str(i), mh) unique_records.append(record) else: duplicate_ids.add(i) print(f"Removed {len(duplicate_ids)} " f"({len(duplicate_ids)/len(records)*100:.1f}%) " f"near-duplicates") return unique_records # Apply deduplication deduped = deduplicate_dataset(data, threshold=0.85) print(f"Dataset: {len(data)} -> {len(deduped)}")
⚠️ Deduplication is lossy. You will lose valid examples if thresholds are too strict. Aim for 3–5% duplicate removal for most datasets. More than 10% suggests data quality issues upstream.
06 — Composition

Dataset Mixing & Sampling

Most fine-tuning datasets combine multiple sources: instruction data, dialogue, domain knowledge, code, etc. Mixing ratios matter. Training on a 90:10 instruction-to-dialogue split produces different behavior than 50:50.

Mixing Strategy

Mix TypeTypical RatioEffectWhen to use
Instruction-heavy 70% instruction, 30% other Strong instruction following, less conversational Task-oriented assistants
Balanced 50% instruction, 25% dialogue, 25% domain Versatile; handles various interaction patterns General-purpose assistants
Dialogue-heavy 70% multi-turn dialogue, 30% instruction Natural conversation, context awareness Chatbots, conversational agents
Domain-specific 50% domain instruction, 50% general Domain expertise + general capability Expert systems, specialized models

Temperature Sampling

When combining multiple datasets, use temperature-based sampling to balance representation. This prevents larger datasets from dominating.

import numpy as np from collections import defaultdict def balance_dataset_mix(datasets, temperatures=None): """Mix datasets using temperature sampling.""" if temperatures is None: temperatures = {name: 1.0 for name in datasets} # Compute sampling probabilities probs = defaultdict(float) total = 0 for name, dataset in datasets.items(): size = len(dataset) temp = temperatures[name] # probability ∝ size^(1/temp) prob = (size) ** (1.0 / temp) probs[name] = prob total += prob # Normalize for name in probs: probs[name] /= total # Mix datasets mixed = [] for name, dataset in datasets.items(): sample_count = int(probs[name] * 10000) # total tokens sampled = np.random.choice( dataset, size=min(sample_count, len(dataset)), replace=False ) mixed.extend(sampled) return mixed # Example datasets = { "instruction": instructions, # 50k examples "dialogue": dialogues, # 200k examples "domain": domain_data # 10k examples } temperatures = {"instruction": 1.0, "dialogue": 0.5, "domain": 1.5} mixed = balance_dataset_mix(datasets, temperatures)
07 — Reproducibility

Versioning & Reproducibility

Data is code. Version datasets like you version models. Track all transformations, parameters, and random seeds. This enables reproducibility, debugging, and understanding model behavior.

Best Practices

Dataset Card Example

--- dataset_info: features: - name: instruction dtype: string - name: output dtype: string splits: - name: train num_examples: 45000 - name: validation num_examples: 5000 --- # Fine-Tune Dataset v1.2 ## Description High-quality instruction-following dataset combining open-source data with quality filters applied. ## Processing 1. Collected from 5 sources (Alpaca, ORCA, Open-Platypus, etc.) 2. Converted to Alpaca format 3. Quality filters: length > 20 tokens, toxicity < 0.5, English only 4. Deduplication: MinHash @ 0.85 threshold 5. Final: 50k examples ## Known Issues - No code examples (filtered out; add if needed) - Biased toward English instructions - Limited non-English data ## License CC-BY-4.0 (mixed sources)
08 — Ecosystem

Data Prep Tools

Hugging Face Datasets
Format Conversion
Standard library for loading, processing, and sharing datasets. Built-in format conversion and filtering.
DataTrove
Data Engineering
ETL pipeline library. Streaming processing, filtering, deduplication at scale. Used in Dolma.
Dolma
Pretraining Corpus
Large-scale document preparation toolkit. Deduplication, language detection, filtering. AllenAI standard.
text-dedup
Deduplication
Fast, scalable deduplication using MinHash. Handles billions of documents. AllenAI maintained.
fastText
Language Detection
Language identification and toxicity scoring. Fast inference, works on CPU.
Argilla
Annotation
Open-source data labeling. Quality filtering, annotation workflows, feedback loops.
DVC (Data Version Control)
Versioning
Version large datasets like code. Track data pipelines, reproducibility, and lineage.
LM-Datasets
Dataset Collection
Curated open-source datasets for language modeling and fine-tuning. Ready-to-use with Hugging Face.
09 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing