01 — Overview
The Data Prep Pipeline
Getting raw data into shape for fine-tuning is a series of transformations. Each step removes noise and improves signal. The order matters: collect first, clean second, deduplicate last. Jumping steps or reordering them leads to wasted compute.
The Five Steps
1
Collect & normalize: Gather data from all sources. Normalise whitespace, line breaks, encoding. Fix obvious corruption.
2
Format conversion: Convert all data to a single canonical format (Alpaca, ShareGPT, or ChatML). Handle edge cases and validation.
3
Quality filtering: Remove short, toxic, low-quality, or non-language data. Use heuristic and learned signals.
4
Deduplication: Remove near-duplicate examples using exact match, MinHash, or embedding-based methods. Most critical step.
5
Split & version: Create train/val splits. Version the dataset. Document all transformations and their parameters.
💡
Quality > Quantity: 10k high-quality examples beat 100k noisy ones. Every step removes noise; invest time here to compress your dataset.
04 — Noise Removal
Quality Filtering
Raw data contains noise: very short examples, spam, toxic content, non-language sequences, and examples outside your domain. Quality filters remove these before expensive training.
Common Filters
| Filter | Signal | Threshold Example | Notes |
| Length |
Token count |
output > 20 tokens |
Remove very short responses; set lower for instruction data |
| Perplexity |
Language model log-prob |
ppl < 50 (tuned model) |
High ppl indicates low-quality or non-language text |
| Language |
Detected language |
lang == 'en' |
Use langdetect; allow multilingual if intended |
| Toxicity |
Classifier score |
toxicity < 0.5 |
Use Perspective API or HuggingFace model |
| Repetition |
Repeated substrings |
max_repetition_ratio < 0.3 |
Filters repetitive spam and corrupted text |
Filtering in Python
from textstat import flesch_reading_ease
import langdetect
def quality_filter(record, min_tokens=20, max_ppl=50):
"""Apply quality filters to a record."""
output = record.get("output", "")
# Length check
if len(output.split()) < min_tokens:
return False, "too_short"
# Language check
try:
lang = langdetect.detect(output)
if lang != 'en':
return False, "non_english"
except:
return False, "lang_error"
# Repetition check
if check_repetition(output) > 0.3:
return False, "too_repetitive"
return True, "pass"
def check_repetition(text):
"""Ratio of repeated 4-grams."""
words = text.split()
if len(words) < 4:
return 0.0
grams = [' '.join(words[i:i+4])
for i in range(len(words)-3)]
unique = len(set(grams))
return 1 - (unique / len(grams))
# Filter dataset
filtered = [r for r in data
if quality_filter(r)[0]]
print(f"Kept {len(filtered)}/{len(data)} records")
💡
Tune thresholds on validation set. Too aggressive filtering loses signal; too lenient keeps noise. Monitor validation loss during training — if it's high, increase filter strictness.
05 — Critical Step
Deduplication Strategies
Duplicate examples in training data inflate apparent dataset size without adding signal. Duplicates also hurt generalization and can leak into validation splits. Deduplication is the highest-ROI data prep step.
Three Deduplication Methods
| Method | Detects | Speed | Memory | Best for |
| Exact hash |
Byte-identical |
Fast |
Low |
Removing copy-paste duplicates |
| MinHash LSH |
Near-duplicates (85%+ similar) |
Fast |
Medium |
Production deduplication |
| Embedding-based |
Semantic duplicates |
Slow |
High |
Finding conceptually similar examples |
MinHash for Near-Deduplication
MinHash (Locality-Sensitive Hashing) is the standard for production deduplication. It's fast, scalable, and detects near-duplicates without full pairwise comparison.
from datasketch import MinHash, MinHashLSH
def create_minhash(text, num_perm=128):
"""Create MinHash signature for text."""
m = MinHash(num_perm=num_perm)
tokens = text.split()
for token in tokens:
m.update(token.encode())
return m
def deduplicate_dataset(records, threshold=0.9):
"""Deduplicate using MinHash LSH."""
lsh = MinHashLSH(threshold=threshold,
num_perm=128)
unique_records = []
duplicate_ids = set()
for i, record in enumerate(records):
text = record.get("output", "")
mh = create_minhash(text)
# Find near-duplicates in LSH
near_dupes = lsh.query(mh)
if not near_dupes or i == min(near_dupes):
# Keep first occurrence
lsh.insert(str(i), mh)
unique_records.append(record)
else:
duplicate_ids.add(i)
print(f"Removed {len(duplicate_ids)} "
f"({len(duplicate_ids)/len(records)*100:.1f}%) "
f"near-duplicates")
return unique_records
# Apply deduplication
deduped = deduplicate_dataset(data, threshold=0.85)
print(f"Dataset: {len(data)} -> {len(deduped)}")
⚠️
Deduplication is lossy. You will lose valid examples if thresholds are too strict. Aim for 3–5% duplicate removal for most datasets. More than 10% suggests data quality issues upstream.
06 — Composition
Dataset Mixing & Sampling
Most fine-tuning datasets combine multiple sources: instruction data, dialogue, domain knowledge, code, etc. Mixing ratios matter. Training on a 90:10 instruction-to-dialogue split produces different behavior than 50:50.
Mixing Strategy
| Mix Type | Typical Ratio | Effect | When to use |
| Instruction-heavy |
70% instruction, 30% other |
Strong instruction following, less conversational |
Task-oriented assistants |
| Balanced |
50% instruction, 25% dialogue, 25% domain |
Versatile; handles various interaction patterns |
General-purpose assistants |
| Dialogue-heavy |
70% multi-turn dialogue, 30% instruction |
Natural conversation, context awareness |
Chatbots, conversational agents |
| Domain-specific |
50% domain instruction, 50% general |
Domain expertise + general capability |
Expert systems, specialized models |
Temperature Sampling
When combining multiple datasets, use temperature-based sampling to balance representation. This prevents larger datasets from dominating.
import numpy as np
from collections import defaultdict
def balance_dataset_mix(datasets, temperatures=None):
"""Mix datasets using temperature sampling."""
if temperatures is None:
temperatures = {name: 1.0 for name in datasets}
# Compute sampling probabilities
probs = defaultdict(float)
total = 0
for name, dataset in datasets.items():
size = len(dataset)
temp = temperatures[name]
# probability ∝ size^(1/temp)
prob = (size) ** (1.0 / temp)
probs[name] = prob
total += prob
# Normalize
for name in probs:
probs[name] /= total
# Mix datasets
mixed = []
for name, dataset in datasets.items():
sample_count = int(probs[name] * 10000) # total tokens
sampled = np.random.choice(
dataset,
size=min(sample_count, len(dataset)),
replace=False
)
mixed.extend(sampled)
return mixed
# Example
datasets = {
"instruction": instructions, # 50k examples
"dialogue": dialogues, # 200k examples
"domain": domain_data # 10k examples
}
temperatures = {"instruction": 1.0,
"dialogue": 0.5,
"domain": 1.5}
mixed = balance_dataset_mix(datasets, temperatures)
07 — Reproducibility
Versioning & Reproducibility
Data is code. Version datasets like you version models. Track all transformations, parameters, and random seeds. This enables reproducibility, debugging, and understanding model behavior.
Best Practices
- Push to Hugging Face: Use
push_to_hub() to version datasets. Includes git-style version history.
- Dataset cards: Document source, license, filtering steps, and known issues in README.
- Manifest files: Store split info (train/val/test record IDs) so you can recreate exact splits later.
- DVC for large files: If datasets are >5GB, use Data Version Control for efficient storage.
Dataset Card Example
---
dataset_info:
features:
- name: instruction
dtype: string
- name: output
dtype: string
splits:
- name: train
num_examples: 45000
- name: validation
num_examples: 5000
---
# Fine-Tune Dataset v1.2
## Description
High-quality instruction-following dataset combining
open-source data with quality filters applied.
## Processing
1. Collected from 5 sources (Alpaca, ORCA,
Open-Platypus, etc.)
2. Converted to Alpaca format
3. Quality filters: length > 20 tokens,
toxicity < 0.5, English only
4. Deduplication: MinHash @ 0.85 threshold
5. Final: 50k examples
## Known Issues
- No code examples (filtered out; add if needed)
- Biased toward English instructions
- Limited non-English data
## License
CC-BY-4.0 (mixed sources)
09 — Further Reading
References
Academic Papers
-
Paper
Kvratinsky, S. & Cafarella, M. (2023).
Text Deduplication for Large Language Models.
arXiv:2303.07133. —
arxiv:2303.07133 ↗
-
Paper
Lhoest, Q. et al. (2021).
Datasets: A Community Library for Natural Language Processing.
ACL. —
aclanthology.org ↗
-
Paper
Li, M. et al. (2022).
Dolma: A Massive, Multi-Source Collection of Data for Language Model Training.
arXiv:2402.00159. —
arxiv:2402.00159 ↗
Documentation & Guides
Practitioner Writing