Data Engineering

Active Learning

A data collection strategy that selects the most informative examples for labelling — reducing annotation cost by 50–80% while achieving the same model performance as random sampling.

Annotation savings
50–80%
Strategies
uncertainty/diversity/query-by-committee
Frameworks
modAL, small-text, Argilla

Table of Contents

SECTION 01

What Is Active Learning?

A model trained on 1,000 randomly selected examples performs worse than one trained on 500 intelligently selected examples. Active learning identifies which unlabelled examples would be most informative for the model to learn from — typically the examples the model is currently most uncertain about. By labelling only these examples, you get maximum performance improvement per annotation dollar.

SECTION 02

Uncertainty Sampling

The simplest strategy: label examples where the current model is least confident. " "For classifiers: select examples where the probability of the most likely class is lowest. " "For sequence models: select examples with the highest token-level entropy.

import numpy as np
from sklearn.linear_model import LogisticRegression
def uncertainty_sample(model: LogisticRegression, X_unlabelled: np.ndarray,
                       n_samples: int = 100) -> list[int]:
    probs = model.predict_proba(X_unlabelled)
    # Least confident: max probability closest to 1/n_classes
    max_probs = np.max(probs, axis=1)
    # Sort ascending: lowest confidence first
    uncertainty_order = np.argsort(max_probs)
    return uncertainty_order[:n_samples].tolist()
def margin_sampling(model, X_unlabelled, n_samples=100):
    probs = model.predict_proba(X_unlabelled)
    sorted_probs = np.sort(probs, axis=1)[:, ::-1]
    # Margin = difference between top-2 class probabilities
    margins = sorted_probs[:, 0] - sorted_probs[:, 1]
    return np.argsort(margins)[:n_samples].tolist()
SECTION 03

Diversity Sampling

Uncertainty sampling can create redundant batches — if 100 uncertain examples are all very similar, labelling them adds little new information. Diversity sampling selects examples spread across the feature space: cluster the unlabelled pool (k-means), then sample from each cluster. CoreSet selection finds the most representative subset geometrically.

SECTION 04

Query by Committee

Train multiple models with different initialisations or hyperparameters. Label examples where the committee disagrees most (measured by vote entropy or KL divergence between models' predictions). More robust than single-model uncertainty — disagreement is a stronger signal than low confidence from one model.

SECTION 05

Active Learning Loop

The loop: train → select uncertain examples → label → retrain.

class ActiveLearningLoop:
    def __init__(self, model, X_labelled, y_labelled, X_unlabelled):
        self.model = model
        self.X_l, self.y_l = list(X_labelled), list(y_labelled)
        self.X_u = list(X_unlabelled)
def run(self, n_iterations: int, batch_size: int, annotator_fn):
        for iteration in range(n_iterations):
            # Train on current labelled set
            self.model.fit(self.X_l, self.y_l)
            # Select most uncertain examples
            indices = uncertainty_sample(self.model, np.array(self.X_u), batch_size)
            # Annotate selected examples
            selected_X = [self.X_u[i] for i in indices]
            new_labels = annotator_fn(selected_X)
            # Move from unlabelled to labelled pool
            for i, label in zip(sorted(indices, reverse=True), new_labels):
                self.X_l.append(self.X_u.pop(i))
                self.y_l.append(label)
            print(f"Iteration {iteration+1}: {len(self.X_l)} labelled examples")
SECTION 06

When to Use

Active learning is most valuable when: annotation is expensive (expert labellers), the unlabelled pool is large (10K+), and you have budget for 1,000–10,000 labels total. Avoid when: labelling is cheap and fast (crowd annotation <$0.01/sample), you need full coverage (compliance requirements), or the feature space is small enough that random sampling covers it well.

SECTION 07

Uncertainty Sampling Strategies

Active learning works best when you can quantify uncertainty. Entropy-based selection (calculating prediction probability distribution entropy) is computationally cheap and works well for classification. Margin sampling — selecting instances where the model is least confident about its top prediction — is even faster. For regression or more complex tasks, deeper uncertainty estimates like Bayesian active learning by disagreement (BALD) or ensemble methods provide better signal but require training multiple models in parallel.

StrategyUncertainty TypeComputational CostBest For
EntropyPrediction confidenceLowClassification
Margin SamplingMargin between top 2LowBinary/Multi-class
Query-by-CommitteeEnsemble disagreementHighDeep uncertainty
BALDJoint entropyMediumHigh-stakes tasks
def uncertainty_sampling(model, unlabeled_data, n_samples=100):
    """Select samples with highest prediction entropy."""
    import numpy as np
    predictions = model.predict_proba(unlabeled_data)
    entropy = -np.sum(predictions * np.log(predictions + 1e-10), axis=1)
    top_idx = np.argsort(entropy)[-n_samples:]
    return unlabeled_data[top_idx], entropy[top_idx]
SECTION 08

Cost-Benefit Analysis

The key trade-off in active learning is labeling cost vs. model improvement. Budget-constrained scenarios require a cost-benefit calculation: Is the improvement from 100 carefully selected examples worth more than from 500 random examples? Empirically, active learning reduces labeling requirements by 30–50% but introduces overhead in training, inference, and selection. Large datasets where random sampling is cheap often outperform active learning despite worse per-example efficiency. For domain-specific tasks with scarce labels (medical imaging, specialized NLP), active learning becomes essential.

The active learning paradox is that uncertainty often doesn't correlate with usefulness. A model might be uncertain about an edge case that teaches nothing new; certain about a common example that would improve the decision boundary. Pool-based sampling (rank unlabeled data by uncertainty and label the top-k) is the simplest and most popular approach, but it has blind spots. Query synthesis (generate new examples designed to be maximally informative) is more principled but requires careful design—naive synthesis generates unrealistic examples. Diversity-based sampling balances exploration (uncertain examples) with coverage (diverse examples from different regions of the data space). In practice, a hybrid combining entropy (uncertainty) and diversity (embedding-based clustering) often outperforms either alone.

The cold-start problem affects active learning: early rounds have few labeled examples, so model uncertainty estimates are unreliable. Your first 50 labels should be sampled strategically—randomly or via stratified sampling to cover the input space, not by uncertainty (the model hasn't learned enough). Some systems use a hybrid: random sampling for the first 10%, then switch to active learning. Another approach is transfer learning—initialize on a related task, then apply active learning on your task. This gives you better uncertainty estimates from day one. The speed of improvement also matters: active learning shines when labeling is expensive (medical imaging) or feedback loops are tight (recommender systems). For cheap labeling (crowdsourcing $0.05 per label), the overhead of active learning might outweigh the savings.

Implementation details impact performance significantly. The choice of baseline model (logistic regression vs. neural network) changes optimal sampling. Neural networks are overconfident on out-of-distribution examples, so entropy-based uncertainty is biased toward OOD points. Bayesian approaches (MC dropout, ensemble methods) provide better calibrated uncertainties. Batch selection (label 50 examples at once) requires different strategies than single-example selection—greedy batch selection can lead to redundancy. Recent work on learning to select (meta-learning what kinds of examples are informative) shows promise but requires careful tuning. For production systems, simplicity often wins: entropy-based sampling with periodic retraining is fast, interpretable, and effective for most tasks.

Beyond single-model uncertainty, ensemble-based methods provide complementary signals. Query-by-Committee uses multiple models (or model checkpoints) and selects examples where they disagree most. This identifies examples the ensemble is uncertain about. Bayesian approaches use dropout (Monte Carlo dropout) to approximate Bayesian inference: run the model multiple times with dropout enabled, compute prediction variance, select high-variance examples. These are more expensive than single-model entropy (you need multiple forward passes) but often more informative. Empirically, ensemble disagreement correlates better with actual model errors than single-model entropy. The downside: you're training multiple models or maintaining checkpoints, which adds complexity. For budget-constrained scenarios, single-model entropy with careful calibration often outperforms expensive ensemble methods.

Another dimension: expected model change (how much would labeling this example improve the model?). Gradient-based active learning computes loss gradients for unlabeled examples and selects those with largest magnitude gradients. The intuition: examples with large gradients would significantly update model parameters if labeled and added to the training set. This is computationally expensive (gradients for all unlabeled examples) but provides a principled notion of informativeness. In practice, approximate methods (compute gradients for a subset, score the rest via k-NN) reduce compute. Learning loss prediction (meta-learn to predict which examples are hard) is an emerging approach: train an auxiliary network to predict whether the model will make errors on unlabeled examples. This bypasses the need to compute loss on unlabeled data. These advanced methods require careful implementation and tuning; for many practitioners, simple entropy-based sampling provides 80% of the value with 20% of the complexity.

Real-world active learning often breaks down due to data distribution shift. Your labeled training set comes from early, high-uncertainty examples; your test set comes from a different distribution. This causes active learning to overfit to the uncertainty sampling strategy. Mitigation: include some random examples in your labeling queue (exploratory sampling) to reduce distribution shift. A common ratio: 70% uncertainty-based, 30% random. This biases toward uncertainty but maintains diversity. Another approach: stratified active learning (ensure each labeled batch covers the full input distribution), or clustering-based strategies (label examples from different clusters). Real-world pipelines often use hybrid heuristics: uncertainty + recency (prioritize recently arrived examples) + diversity. The best strategy is task-dependent; if you have compute, A/B test different selection strategies and pick the winner empirically.