Infused Adapter by Inhibiting and Amplifying Inner Activations: learn three tiny scaling vectors per layer that multiply keys, values, and FFN activations. Fewer parameters than LoRA, strong few-shot performance.
IA³ (Liu et al. 2022, "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning") introduces an extremely lightweight PEFT method: learn three learned scaling vectors per transformer layer — one for the attention keys (l_k), one for the attention values (l_v), and one for the FFN intermediate activations (l_ff). These vectors are element-wise multiplied with the corresponding activations during the forward pass. The base model weights are frozen; only the l_k, l_v, l_ff vectors are trained.
For a layer with d_model dimensions:
All three vectors are initialised to ones (identity transform), so at the start of training the model behaves identically to the frozen base. The scaling vectors learn to inhibit (values near 0) or amplify (values > 1) specific dimensions, effectively steering which features each layer amplifies for the target task.
Total trainable parameters for a model with L layers and d_model dimensions: 3 × L × d_model. For T5-XL (3B): ~100K parameters — far fewer than LoRA at typical ranks.
from peft import IA3Config, get_peft_model, TaskType
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model_name = "google/flan-t5-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
ia3_config = IA3Config(
task_type=TaskType.SEQ_2_SEQ_LM,
target_modules=["k", "v", "wo"], # key, value, FFN output projections
feedforward_modules=["wo"], # which modules are FFN (gets l_ff)
)
model = get_peft_model(model, ia3_config)
model.print_trainable_parameters()
# trainable params: 98,304 || all params: 2,849,546,240 || trainable%: 0.003%
# Training is standard — gradient only flows to l_k, l_v, l_ff vectors
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="./ia3-finetuned",
num_train_epochs=10, # IA3 converges fast with few params
per_device_train_batch_size=16,
learning_rate=3e-3,
predict_with_generate=True,
)
The T-Few recipe (from the IA³ paper) achieves strong few-shot performance using IA³ + a multitask pre-trained T5 base. With only 16 training examples per task, T-Few outperforms GPT-3 175B (few-shot in-context learning) on most Super-NaturalInstructions tasks — at roughly 1000× less compute. The recipe: start from a T5 model already multitask fine-tuned on a large task mixture, then apply IA³ with only 16 gradient steps per new task. The small parameter count makes overfitting on tiny datasets much less of a problem than with LoRA or full fine-tuning.
# T-Few: zero-shot / few-shot with IA³ (HuggingFace t-few style)
# pip install peft transformers datasets
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from peft import IA3Config, get_peft_model, TaskType
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
config = IA3Config(
task_type=TaskType.SEQ_2_SEQ_LM,
target_modules=["k", "v", "wo"], # keys, values, FF output
feedforward_modules=["wo"],
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 53,248 || all params: 247,577,856 || trainable%: 0.02
# Few-shot inference: prepend examples directly in prompt
few_shot_prompt = """Classify sentiment:
Input: The flight was delayed by 4 hours. Label: negative
Input: Great legroom and friendly crew. Label: positive
Input: {test_input} Label:"""
inputs = tokenizer(
few_shot_prompt.format(test_input="The meal was surprisingly good."),
return_tensors="pt"
)
out = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(out[0], skip_special_tokens=True)) # → positive
IA³ is the right choice when: (1) you have very few training examples (<100) and need to avoid overfitting; (2) you're running many task-specific adapters and need the smallest possible per-task overhead; (3) you're working with T5/encoder-decoder models where the T-Few recipe gives strong results out of the box. For general instruction tuning with 1000+ examples, LoRA typically performs better.
model.named_modules() to find the right names.model = model.merge_and_unload() after training to bake the scaling vectors into the weight matrices. This eliminates the runtime multiplication overhead and lets you serve the model as a standard transformer.Parameter-efficient fine-tuning methods differ in where they inject trainable parameters, how many parameters they add, and which adaptation types they support best. IA³ uses the fewest additional parameters of any PEFT method, making it attractive for multi-task serving scenarios where many task-specific adapters must be loaded simultaneously. The table below summarizes key tradeoffs across the most commonly used PEFT approaches.
| Method | Trainable params | Inference overhead | Best for |
|---|---|---|---|
| IA³ | ~0.01% of base | Element-wise multiply | Few-shot, multi-task serving |
| LoRA (r=8) | ~0.1–0.5% of base | Low-rank matmul addition | Instruction following, style |
| Prefix tuning | ~0.1% of base | Extended KV cache | Conditional generation |
| Full fine-tune | 100% of base | None | Large datasets, major task shift |
from peft import IA3Config, get_peft_model, TaskType
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
config = IA3Config(
task_type=TaskType.SEQ_2_SEQ_LM,
target_modules=["k", "v", "wo"], # keys, values, FFN output
feedforward_modules=["wo"], # also scale FFN output
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 147,456 || all params: 247,577,856 || trainable%: 0.0596
IA³ is particularly well-suited for the T-Few setup, where a single pre-trained IA³ adapter generalizes across many unseen tasks with only a handful of demonstrations at inference time. This makes IA³ a strong choice for applications requiring rapid task switching without maintaining separate adapter weights for each task variant.
IA³ inference serving with multiple task adapters exploits the element-wise structure of the learned rescaling vectors. Because IA³ adapts the model by multiplying activations by learned vectors rather than adding low-rank matrices, adapter switching requires only replacing small vectors in memory rather than loading new weight matrices. A serving system can maintain adapters for dozens of tasks in CPU memory and copy the small IA³ vectors to GPU before each forward pass with negligible overhead, enabling efficient multi-task serving from a single model replica. This architecture is particularly valuable in settings where many specialized tasks must be served with low per-task memory cost.
IA³ gradient flow during fine-tuning is concentrated in the rescaling vectors applied to keys, values, and feedforward outputs, with the base model weights remaining frozen. This concentration means that IA³ training requires very few gradient update steps to converge, and learning rate schedules designed for full fine-tuning must be adjusted. Higher learning rates (1e-3 to 1e-2) and fewer training steps (typically 1,000–5,000 for instruction-following tasks) are appropriate for IA³ compared to LoRA or full fine-tuning. The rapid convergence makes IA³ well-suited for interactive adaptation scenarios where adapters must be trained quickly from small amounts of feedback data.
Combining IA³ with quantized base models provides the most parameter-efficient deployment configuration. Running a 4-bit quantized base model with IA³ adapters achieves a memory footprint roughly 4x smaller than a full-precision LoRA-adapted model, with competitive task performance on many NLP benchmarks. The QIA³ combination is particularly attractive for edge deployment scenarios where both model size and adaptation quality matter — the quantized base handles the bulk of the parameter cost, while IA³ provides the task-specific behavioral adjustment with minimal additional memory overhead.
IA³ weight merging into the base model for deployment eliminates the runtime overhead of applying the rescaling vectors at inference time. Because IA³ modifications are element-wise multiplications of existing weight rows, the adapted weights can be computed by multiplying the base weight rows by the IA³ scaling vectors and storing the result as a new set of base weights. This merged deployment requires no adapter loading infrastructure and adds zero inference latency, making it the preferred deployment approach when the adapter will not need to be swapped at runtime.
IA³'s few-shot generalization capability stems from the T-Few training methodology, which trains the IA³ adapter on a diverse mixture of tasks using in-context learning demonstrations. This cross-task training teaches the adapter to leverage the patterns in few-shot examples efficiently, rather than simply memorizing task-specific surface patterns. The result is a model that adapts more effectively to new tasks from small numbers of demonstrations than a model fine-tuned on individual tasks in isolation, making IA³ with T-Few training a strong baseline for low-resource task adaptation scenarios.