BERT / Encoder-only

Encoder-only vs decoder-only
BERT architecture
Pre-training objectives
Fine-tuning for classification
Extracting embeddings
When to choose encoder-only
Gotchas

SECTION 01

Encoder-only vs decoder-only

The key architectural difference is masking. In a decoder-only model (GPT, Llama), causal masking ensures token i can only see tokens before it — necessary for autoregressive generation (you can't look at future tokens when predicting them). In an encoder-only model (BERT, RoBERTa), there's no causal mask: every token attends to every other token in both directions simultaneously.

This bidirectionality gives encoder models richer representations of each token — they encode context from both sides. "Bank" in "river bank" vs "bank account" will have very different representations in BERT because the model sees the full sentence context when computing each token's representation.

The trade-off: encoder-only models can't generate text autoregressively. They're encoders — they transform an input sequence into contextualised representations. Those representations are fed to task-specific heads for classification, span extraction, or similarity scoring.

SECTION 02

BERT architecture

BERT-base: 12 transformer layers, 12 attention heads, d_model=768, 110M parameters. BERT-large: 24 layers, 16 heads, d_model=1024, 340M parameters.

Special tokens: [CLS] prepended to every input (its final representation used for sentence-level classification), [SEP] separating sentence pairs, [MASK] for masked language model pre-training.

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

text = "The capital of France is Paris."
inputs = tokenizer(text, return_tensors="pt")
print(inputs.keys())          # input_ids, attention_mask, token_type_ids
print(inputs["input_ids"])    # [101, 1996, 3007, ..., 102] (101=[CLS], 102=[SEP])

with torch.no_grad():
    outputs = model(**inputs)

# Two main outputs:
last_hidden = outputs.last_hidden_state  # (1, seq_len, 768) — all token representations
pooler_out  = outputs.pooler_output      # (1, 768) — CLS token through a linear+tanh

SECTION 03

Pre-training objectives

Masked Language Modelling (MLM): 15% of tokens are randomly selected. Of these: 80% replaced with [MASK], 10% replaced with a random token, 10% left unchanged. The model must predict the original token for masked positions. This forces the model to use bidirectional context to reconstruct masked tokens.

from transformers import BertForMaskedLM, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# Mask a word and predict it
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0][1]

with torch.no_grad():
    logits = model(**inputs).logits  # (1, seq_len, vocab_size)

top5 = logits[0, mask_idx].topk(5)
for score, idx in zip(top5.values, top5.indices):
    print(f"  {tokenizer.decode([idx])}: {score:.2f}")
# paris: 14.3, lyon: 10.1, london: 9.8, ...

Next Sentence Prediction (NSP): predict whether sentence B follows sentence A. Later research (RoBERTa, 2019) showed NSP hurts more than it helps — it was dropped in most subsequent models. RoBERTa trains only with MLM, longer sequences, and more data.

SECTION 04

Fine-tuning for classification

from transformers import (
    BertForSequenceClassification, BertTokenizer,
    TrainingArguments, Trainer
)
from datasets import load_dataset
import numpy as np

# Load BERT with a classification head (linear layer on top of CLS token)
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2  # binary classification
)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Load and tokenise dataset
dataset = load_dataset("imdb")
def tokenise(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=512)
dataset = dataset.map(tokenise, batched=True)
dataset = dataset.rename_column("label", "labels")

# Training
args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,   # key: small lr for fine-tuning
    warmup_ratio=0.1,
)

trainer = Trainer(model=model, args=args,
    train_dataset=dataset["train"], eval_dataset=dataset["test"])
trainer.train()

SECTION 05

Extracting embeddings

import torch
from transformers import AutoTokenizer, AutoModel

# For embeddings, use a dedicated sentence-transformer or mean pool BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
model.eval()

def get_embeddings(texts: list[str], batch_size: int = 32) -> torch.Tensor:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, padding=True, truncation=True,
                           max_length=512, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs)
        # Mean pooling over non-padding tokens (better than CLS for similarity)
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
        all_embeddings.append(embeddings)
    return torch.cat(all_embeddings, dim=0)

texts = ["Paris is the capital of France.", "France's capital city is Paris."]
embs = get_embeddings(texts)
cos_sim = torch.nn.functional.cosine_similarity(embs[0:1], embs[1:2])
print(f"Similarity: {cos_sim.item():.3f}")  # ~0.97

# For production embeddings, prefer: sentence-transformers/all-MiniLM-L6-v2
# Trained specifically for semantic similarity — much better than raw BERT

SECTION 06

When to choose encoder-only

Choose encoder-only when your task is understanding, not generation:

Use encoder-only for: text classification (sentiment, intent, topic), named entity recognition, question answering (extracting spans), semantic similarity / duplicate detection, dense retrieval (bi-encoder architecture), cross-lingual NLP (mBERT, XLM-R).

Use decoder-only for: text generation, chatbots, summarisation, translation (with large enough models), tasks where the output is a full sequence rather than a label or span.

Use encoder-decoder for: seq2seq tasks where you need to generate a structured output from an input — translation, abstractive summarisation, code generation from a spec (T5, BART, FLAN-T5).

For RAG pipelines, encoder-only models (especially fine-tuned bi-encoders like E5, BGE) produce significantly better embeddings than using an LLM's hidden states, at a fraction of the cost.

SECTION 07

Gotchas

BERT has a hard 512-token limit. The learned positional embeddings have exactly 512 positions. Sequences longer than 512 must be truncated, chunked, or handled with a sliding window approach. For long documents, use Longformer, BigBird (sparse attention), or a decoder-only model with large context.

Fine-tuning with too high a learning rate causes catastrophic forgetting. The BERT weights encode rich pre-trained representations. A learning rate above 5e-5 often causes the model to forget its pre-training and fit only the fine-tuning data, degrading generalisation. Use 1e-5 to 5e-5 with warmup.

Pooler output (CLS + linear) is not always the best representation. The pooler_output BERT returns is designed for NSP classification, not general sentence similarity. For semantic similarity tasks, mean pooling of all token representations consistently outperforms using pooler_output directly.

SECTION 08

Encoder-Only Model Selection Guide

Model	Parameters	Strengths	Best Task
BERT-base	110M	General NLP baseline	Text classification, NER
RoBERTa-large	355M	Better pretraining, more robust	Classification, sentiment
DeBERTa-v3-large	304M	Disentangled attention, best accuracy	NLI, extractive QA
all-MiniLM-L6-v2	22M	Very fast, good for embeddings	Semantic similarity, retrieval
BGE-large-en	335M	State-of-art embeddings	Dense retrieval, RAG

Encoder-only models remain the best choice for high-throughput classification and embedding workloads because they are significantly cheaper to run than decoder-only LLMs at the same accuracy level for classification tasks. A fine-tuned DeBERTa-v3-large can process 50,000 classification requests per second on a single A10G GPU at under $0.001 per 1,000 requests. The equivalent quality using a decoder-only model for classification costs 10-50x more per request. Always benchmark a fine-tuned encoder-only baseline before defaulting to a decoder-only LLM for classification tasks.

The pretraining objective of encoder-only models — masked language modeling (MLM) — creates representations that are deeply contextual in both directions simultaneously. Unlike autoregressive models that build representations left-to-right, BERT-style models see the full sentence context when encoding each token. This bidirectional attention is what makes encoder models exceptionally strong for tasks requiring holistic understanding of a passage, such as determining whether a claim is supported by a document or extracting all entities from a paragraph.

Sentence transformers extend encoder-only architectures with a pooling layer that collapses variable-length token sequences into fixed-size dense vectors. Mean pooling over all token embeddings (rather than using the [CLS] token alone) typically produces more stable and semantically rich sentence representations. These dense vectors support efficient approximate nearest neighbor search at billion-scale using FAISS or similar indices, making sentence transformers the backbone of most semantic search and retrieval-augmented generation systems in production.

Fine-tuning encoder-only models for downstream classification tasks is computationally cheap compared to training from scratch. Adding a linear classification head on top of the frozen or partially unfrozen encoder and training for 3–5 epochs on a few thousand labeled examples routinely achieves strong performance. The key hyperparameter decisions are learning rate (1e-5 to 5e-5 is the typical range) and whether to freeze the lower encoder layers during early training to prevent catastrophic forgetting of pre-trained representations.