Google's Gemma 2 family: 2B, 9B, and 27B open models. Uses alternating local/global attention and knowledge distillation from larger models. Punches well above its weight class for small-model inference.
Gemma 2 (Google DeepMind, June 2024) is a family of open-weight models at 2B, 9B, and 27B parameters. All three variants use the same core architecture but differ in depth and width. Context window: 8192 tokens. The models are released with permissive terms for research and commercial use. Gemma 2 improves significantly over Gemma 1, particularly for instruction following and reasoning.
Gemma 2 introduces two key architectural changes over standard transformers:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "google/gemma-2-9b-it" # 'it' = instruction-tuned
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "What is the capital of each G7 country?"},
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
)
new_tokens = outputs[0][inputs.input_ids.shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))
Gemma 2 2B is one of the strongest models for on-device and edge deployment. At bfloat16 it requires ~4.5 GB VRAM, and with int4 quantisation fits in ~1.3 GB â runnable on a smartphone with a GPU (Google Pixel 9, iPhone 15 Pro). Despite its small size, Gemma 2 2B-IT outperforms many 7B models on instruction-following benchmarks, thanks to distillation. It's used in Google's AI Edge SDK for on-device inference.
model_name = "google/gemma-2-2b-it"
# For deployment with llama.cpp / Ollama:
# ollama run gemma2:2b
# For mobile: Google AI Edge SDK uses Gemma 2 2B via MediaPipe LLM
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-9b",
torch_dtype=torch.bfloat16,
device_map="auto",
)
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 39,976,960 || all params: 9,281,839,104 || trainable%: 0.43%
At the 9B scale, Gemma 2 9B outperforms Llama 3 8B on MMLU (71.3% vs 68.4%), MATH (36.6% vs 30.0%), and HumanEval (71.3% vs 62.2%). The 27B model is competitive with much larger models â it was near-state-of-the-art for open models under 30B when released. The 2B model outperforms Llama 3 8B on several benchmarks despite being 4Ă smaller, demonstrating the effectiveness of knowledge distillation.
<start_of_turn>user
...<end_of_turn> tokens. Always use tokenizer.apply_chat_template() for instruct variants.| Innovation | Purpose | Impact |
|---|---|---|
| Local-global attention | Reduce computation of long sequences | Faster inference, lower memory |
| Rotary embeddings (RoPE) | Better length extrapolation | Generalizes to longer contexts |
| Knowledge distillation | Train from larger teacher (Gemma 27B) | Better performance per parameter |
| Flash Attention | Memory-efficient attention | 2-3x faster attention computation |
| Grouped query attention | Reduce KV cache size | Better batching on inference |
Gemma 2 model family scaling: Gemma 2 comes in multiple sizes: 2B (edge devices), 9B (mobile/consumer), and 27B (data center). Each size balance performance, latency, and memory constraints for different deployment targets. The 2B variant achieves approximately 20-30% of the 27B variant's capability while using 7.4% of the parameters. This makes Gemma 2 2B suitable for on-device inference where latency and privacy are critical, like running AI features directly on phones without cloud calls.
Knowledge distillation was central to Gemma 2's development. Smaller models were trained using the larger Gemma 27B as a teacher, enabling them to match much larger models. The distillation process involves matching not just the final predictions but also intermediate representations, helping smaller models develop similar feature hierarchies to the teacher. This technique scales wellâyou can apply Gemma 27B as a teacher for multiple student sizes simultaneously.
Gemma 2's local-global attention pattern reduces the quadratic complexity of full attention to approximately linear. In the first few layers, attention is local (attending to nearby tokens only), then alternates with global attention layers that can attend to distant tokens. This hybrid approach maintains long-range modeling capability while reducing computation by 30-40% compared to full attention, making inference faster without sacrificing quality.
Compared to similar-sized open models, Gemma 2 consistently achieves 5-10% higher scores on standard benchmarks like MMLU and GSM8K. Gemma 2 9B competes with models 3-4x larger through careful training, architecture, and distillation choices. This efficiency advantage makes Gemma 2 particularly valuable for commercial deployment where both inference cost and model capability matter.
Gemma 2's training data includes more recent information than many contemporaries (up to mid-2024), giving it advantages on current event questions and recently-published content. The engineering team invested heavily in training infrastructure, enabling thousands of experiments that refined every component. Reproducing this quality with 1/1000th the experiments would be extremely difficult.
Fine-tuning Gemma 2 on domain-specific data (legal documents, medical texts, code) can adapt the base model with modest data and compute. Quantized Gemma 2 2B running on consumer GPUs or TPUs (via Google Cloud) makes state-of-the-art AI accessible to small teams. This democratization of AI capability is one of Gemma's stated goals.
Gemma 2's training procedure involved multiple stages: pretraining on diverse web data, supervised fine-tuning on instruction-following data, and reinforcement learning from human feedback (RLHF) to improve alignment. Each stage builds on prior stages, with later stages using smaller, curated datasets. Understanding this multi-stage training helps practitioners design their own fine-tuning and alignment procedures.
Evaluation of Gemma 2 spans academic benchmarks (MMLU, TruthfulQA, BIG-bench), code benchmarks (HumanEval, MBPP), and real-world metrics like user preference scores. Benchmarks have limitationsâthey don't perfectly correlate with real-world performance. Practitioners should develop domain-specific evaluation metrics beyond standard benchmarks. Gemma 2's benchmark scores provide a starting point, not an end point.
Safety considerations in Gemma 2 included training to refuse harmful requests, reduce biases, and provide accurate information. Safety-alignment involves trade-offs: perfectly aligned models might be overly cautious and unhelpful on edge cases. Red-teaming (adversarial testing) identified failure modes before release. Understanding alignment challenges helps practitioners make informed choices about safety-capability trade-offs in their deployments.
Google's decision to open-source Gemma 2 creates an ecosystem of downstream models, fine-tuning tutorials, and community contributions. Developers fine-tune Gemma 2 on code, mathematics, medicine, and other domains. The open-source model enables rapid innovationâresearchers can build on Gemma 2 without starting from scratch. This democratization of advanced AI capabilities accelerates research and enables small teams to compete with well-resourced institutions.
Gemma 2's code implementation in JAX and PyTorch is publicly available, enabling researchers to understand architectural details and propose improvements. Seeing exactly how local-global attention and other innovations are implemented demystifies the model. This transparency builds trust and enables principled extensions. Practitioners can implement similar techniques in their own models after studying Gemma 2.
Commercial and non-commercial uses of Gemma 2 span chatbots, code generation assistants, RAG systems, and fine-tuned task-specific models. The licensing model permits commercial use under certain conditions, enabling startups to build on Gemma 2. Understanding Gemma 2's capabilities and limitations helps practitioners decide whether to use it directly or as a foundation for customized models.