T5 / Enc-Dec

Encoder-decoder architecture
T5: text-to-text unification
Relative position bias in T5
Fine-tuning T5 with HuggingFace
When to use enc-dec vs decoder-only
BART vs T5
Gotchas

SECTION 01

Encoder-decoder architecture

Encoder-decoder transformers consist of two stacks. The encoder processes the full source sequence with bidirectional self-attention, producing contextualised representations for every input token. The decoder generates the output sequence autoregressively: at each step, it attends to its own previously generated tokens (via causal self-attention) and to the encoder's output (via cross-attention). This two-stage design naturally fits tasks where you read an entire input before generating output: translation, summarisation, question answering.

SECTION 02

T5: text-to-text unification

T5 (Raffel et al. 2020, Google) reframes all NLP tasks as text-to-text: both input and output are always text strings. Translation: "translate English to German: The cat sat on the mat" → "Die Katze saß auf der Matte". Summarisation: "summarize: [long article]" → "[summary]". Classification: "sst2 sentence: This movie was great" → "positive". This unification allows training a single model on diverse tasks with a shared format, which proved highly effective — T5-11B set state-of-the-art on many benchmarks in 2020.

SECTION 03

Relative position bias in T5

T5 uses relative position biases instead of sinusoidal PE or learned absolute positions. A learned bias b(i-j) is added to attention scores, where i-j is the relative distance between query and key. Distances beyond a threshold are bucketed together. This is simpler than RoPE but similar in spirit — attention scores depend on relative rather than absolute positions, giving better generalisation to unseen lengths.

SECTION 04

Fine-tuning T5 with HuggingFace

from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
from datasets import Dataset
import torch

model_name = "t5-small"  # or "t5-base", "t5-large", "google/flan-t5-xl"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Prepare summarisation data — T5 expects "summarize: " prefix
raw_data = [
    {"input": "summarize: The quick brown fox jumps over the lazy dog. The dog didn't react.",
     "target": "Fox jumps over unresponsive dog."},
]

def tokenize(examples):
    inputs = tokenizer(examples["input"], max_length=512, truncation=True, padding="max_length")
    targets = tokenizer(examples["target"], max_length=64, truncation=True, padding="max_length")
    inputs["labels"] = targets["input_ids"]
    # T5 uses -100 to mask padding in loss
    inputs["labels"] = [
        [(t if t != tokenizer.pad_token_id else -100) for t in label]
        for label in inputs["labels"]
    ]
    return inputs

dataset = Dataset.from_list(raw_data).map(tokenize, batched=True)

training_args = TrainingArguments(
    output_dir="./t5-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=3e-4,
    save_strategy="epoch",
)
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()

# Inference
input_text = "summarize: Large language models are trained on vast text corpora."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_length=64, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

SECTION 05

When to use enc-dec vs decoder-only

Use encoder-decoder when: (1) you have a fixed source that needs to be fully read before generating output (translation, summarisation), (2) you need strong bidirectional context in the input, (3) you're fine-tuning a specialised seq2seq model for a specific task pair.

Use decoder-only (GPT-style) when: (1) you're doing open-ended generation, chat, or instruction following, (2) you want a single model that can handle diverse tasks via prompting, (3) you're working with frontier-scale models (GPT-4, Claude, Llama) which are all decoder-only.

The industry has largely converged on decoder-only for general-purpose LLMs, but encoder-decoder models like NLLB-200 and mBART remain dominant for production translation.

SECTION 06

BART vs T5

BART (Lewis et al. 2020, Meta): Denoising autoencoder — pre-trained by corrupting text (masking, shuffling, deletion) and training the decoder to reconstruct the original. Particularly strong at abstractive summarisation (used in mBART for multilingual tasks). T5: Masked span prediction objective on C4 corpus. Strong across diverse tasks with the text-to-text framing. For summarisation specifically, BART and T5 are competitive; for translation, NLLB-200 (based on encoder-decoder) is state-of-the-art.

SECTION 07

Gotchas

Decoder input_ids vs labels: In HuggingFace, T5 expects input_ids for the encoder and labels for the decoder target. The library handles teacher-forcing automatically — don't manually construct decoder_input_ids unless you need custom decoding.
Flan-T5 for zero-shot: Vanilla T5 requires task-specific fine-tuning. Flan-T5 (instruction-tuned T5) can handle zero-shot and few-shot tasks. For most use cases, start with google/flan-t5-xl.
Speed vs GPT: Enc-dec models are slower at inference than same-parameter decoder-only models because you run two stacks. For real-time generation, decoder-only is usually faster per token.

Encoder-decoder vs decoder-only architectures

The architectural choice between encoder-decoder and decoder-only transformers has practical implications for the tasks each model type handles most efficiently. Encoder-decoder models process the full input bidirectionally before generating output, enabling richer input representations at the cost of requiring a separate encoding step. Decoder-only models generate autoregressively from the concatenated input and output, handling the same tasks but without the architectural separation of input understanding and output generation. The practical convergence between the two architectures has increased as decoder-only models have scaled — today's large decoder-only models match or exceed encoder-decoder performance on conditional generation tasks that were previously dominated by T5 and BART.

Architecture	Examples	Best for	Key limitation
Encoder-decoder	T5, BART, mT5	Translation, summarization, structured generation	Higher memory, two forward passes
Decoder-only	GPT-4, Llama, Mistral	General text generation, chat, instruction following	Less efficient for classification
Encoder-only	BERT, RoBERTa	Classification, NER, embedding	Cannot generate text

from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("t5-base")
tokenizer = T5Tokenizer.from_pretrained("t5-base")

# T5 uses task-prefix prompting
input_text = "summarize: The stock market fell sharply today amid concerns about inflation."
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

# Encoder processes input, decoder generates summary
outputs = model.generate(**inputs, max_new_tokens=100, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

T5's text-to-text formulation unifies all NLP tasks under a single architecture by framing every task as a string-to-string transformation. Classification tasks become generation tasks with class labels as target strings, question answering becomes generation conditioned on the question and context, and translation becomes generation in the target language. This unification simplifies multi-task learning because all tasks share the same loss function (cross-entropy on output tokens) and the same training loop, enabling a single fine-tuning job to improve performance across multiple tasks simultaneously rather than training separate task-specific heads.

BART's denoising pre-training involves corrupting input text with various noise functions — token masking, token deletion, sentence permutation, text infilling — and training the model to reconstruct the original text. This denoising objective makes BART particularly well-suited for text generation tasks that involve transformation of noisy or partial input, including document summarization from partial information, dialogue generation from sparse context, and question generation from passages. Fine-tuned BART models for summarization held state-of-the-art positions on CNN/DailyMail benchmarks for an extended period, establishing the effectiveness of denoising pre-training for abstractive summarization.

Conditional generation quality with encoder-decoder models benefits from beam search decoding, which maintains a set of partial hypotheses at each generation step and selects the globally highest-scoring complete sequence. Beam widths of 4–6 typically provide most of the quality improvement over greedy decoding with acceptable latency overhead for summarization and translation tasks. Length penalty parameters control the tendency of beam search to prefer short sequences — increasing the length penalty (alpha > 1.0) encourages longer outputs, which is important for tasks requiring comprehensive summaries rather than concise extractions.

Encoder-decoder model serving in production requires managing two separate execution phases — the encoder pass and the autoregressive decoder pass — which have different computational profiles. The encoder processes the full input in a single parallel pass, well-suited to batching multiple requests together. The decoder generates tokens autoregressively, with each step conditioned on all previous tokens, making it less efficient to batch than the encoder phase. Efficient encoder-decoder serving frameworks cache the encoder output (the key-value representations of the source sequence) and reuse it across all decoder steps, avoiding redundant re-encoding for each generation step.

mT5, the multilingual T5 variant, extends the text-to-text framework to 101 languages with a shared vocabulary of 250,000 sentencepiece tokens. Cross-lingual fine-tuning of mT5 on English task data transfers surprisingly effectively to other languages due to the shared representation space established during multilingual pre-training. For organizations handling documents in multiple languages without the resources to collect task-specific training data in every language, mT5 provides a practical foundation model that can be fine-tuned on English examples and deployed for cross-lingual inference with acceptable quality on the other supported languages.