Gemma 2

Gemma 2 model family
Architectural innovations
Running Gemma 2
Gemma 2 2B: edge deployment
Fine-tuning Gemma 2
Performance vs peers
Gotchas

SECTION 01

Gemma 2 model family

Gemma 2 (Google DeepMind, June 2024) is a family of open-weight models at 2B, 9B, and 27B parameters. All three variants use the same core architecture but differ in depth and width. Context window: 8192 tokens. The models are released with permissive terms for research and commercial use. Gemma 2 improves significantly over Gemma 1, particularly for instruction following and reasoning.

SECTION 02

Architectural innovations

Gemma 2 introduces two key architectural changes over standard transformers:

Alternating local and global attention: Even-numbered layers use sliding-window attention with a 4096-token window (local); odd-numbered layers use full global attention. This halves the KV cache size of the global-attention layers while preserving long-range context. The pattern is: local → global → local → global...
Knowledge distillation: The 2B model is trained with soft targets from a larger teacher model, not just next-token prediction. This knowledge distillation gives the 2B model much better instruction-following and reasoning than its size would suggest.
Logit soft-capping: Attention logits and output logits are soft-capped (tanh(x/cap)·cap) to prevent training instability from extreme logit values.

SECTION 03

Running Gemma 2

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "google/gemma-2-9b-it"  # 'it' = instruction-tuned
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "What is the capital of each G7 country?"},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
    )
new_tokens = outputs[0][inputs.input_ids.shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

SECTION 04

Gemma 2 2B: edge deployment

Gemma 2 2B is one of the strongest models for on-device and edge deployment. At bfloat16 it requires ~4.5 GB VRAM, and with int4 quantisation fits in ~1.3 GB — runnable on a smartphone with a GPU (Google Pixel 9, iPhone 15 Pro). Despite its small size, Gemma 2 2B-IT outperforms many 7B models on instruction-following benchmarks, thanks to distillation. It's used in Google's AI Edge SDK for on-device inference.

model_name = "google/gemma-2-2b-it"
# For deployment with llama.cpp / Ollama:
# ollama run gemma2:2b
# For mobile: Google AI Edge SDK uses Gemma 2 2B via MediaPipe LLM

SECTION 05

Fine-tuning Gemma 2

from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 39,976,960 || all params: 9,281,839,104 || trainable%: 0.43%

SECTION 06

Performance vs peers

At the 9B scale, Gemma 2 9B outperforms Llama 3 8B on MMLU (71.3% vs 68.4%), MATH (36.6% vs 30.0%), and HumanEval (71.3% vs 62.2%). The 27B model is competitive with much larger models — it was near-state-of-the-art for open models under 30B when released. The 2B model outperforms Llama 3 8B on several benchmarks despite being 4× smaller, demonstrating the effectiveness of knowledge distillation.

SECTION 07

Gotchas

8192 token context: Gemma 2 has a shorter context window than Qwen 2.5 or Llama 3 (which support 128k). For long-context tasks, prefer Llama 3 or Qwen 2.5.
Chat template format: Gemma 2 uses a specific chat format with <start_of_turn>user ...<end_of_turn> tokens. Always use tokenizer.apply_chat_template() for instruct variants.
Logit cap side effects: The logit soft-capping (tanh) changes the distribution compared to standard softmax. This can affect speculative decoding compatibility — the draft model must also use logit capping for correct acceptance probabilities.

SECTION 08

Gemma 2 Architectural Innovations

Innovation	Purpose	Impact
Local-global attention	Reduce computation of long sequences	Faster inference, lower memory
Rotary embeddings (RoPE)	Better length extrapolation	Generalizes to longer contexts
Knowledge distillation	Train from larger teacher (Gemma 27B)	Better performance per parameter
Flash Attention	Memory-efficient attention	2-3x faster attention computation
Grouped query attention	Reduce KV cache size	Better batching on inference

Gemma 2 model family scaling: Gemma 2 comes in multiple sizes: 2B (edge devices), 9B (mobile/consumer), and 27B (data center). Each size balance performance, latency, and memory constraints for different deployment targets. The 2B variant achieves approximately 20-30% of the 27B variant's capability while using 7.4% of the parameters. This makes Gemma 2 2B suitable for on-device inference where latency and privacy are critical, like running AI features directly on phones without cloud calls.

Knowledge distillation was central to Gemma 2's development. Smaller models were trained using the larger Gemma 27B as a teacher, enabling them to match much larger models. The distillation process involves matching not just the final predictions but also intermediate representations, helping smaller models develop similar feature hierarchies to the teacher. This technique scales well—you can apply Gemma 27B as a teacher for multiple student sizes simultaneously.

Gemma 2's local-global attention pattern reduces the quadratic complexity of full attention to approximately linear. In the first few layers, attention is local (attending to nearby tokens only), then alternates with global attention layers that can attend to distant tokens. This hybrid approach maintains long-range modeling capability while reducing computation by 30-40% compared to full attention, making inference faster without sacrificing quality.

EXTRA

Gemma 2 Comparison with Other Models

Compared to similar-sized open models, Gemma 2 consistently achieves 5-10% higher scores on standard benchmarks like MMLU and GSM8K. Gemma 2 9B competes with models 3-4x larger through careful training, architecture, and distillation choices. This efficiency advantage makes Gemma 2 particularly valuable for commercial deployment where both inference cost and model capability matter.

Gemma 2's training data includes more recent information than many contemporaries (up to mid-2024), giving it advantages on current event questions and recently-published content. The engineering team invested heavily in training infrastructure, enabling thousands of experiments that refined every component. Reproducing this quality with 1/1000th the experiments would be extremely difficult.

Fine-tuning Gemma 2 on domain-specific data (legal documents, medical texts, code) can adapt the base model with modest data and compute. Quantized Gemma 2 2B running on consumer GPUs or TPUs (via Google Cloud) makes state-of-the-art AI accessible to small teams. This democratization of AI capability is one of Gemma's stated goals.

Gemma 2's training procedure involved multiple stages: pretraining on diverse web data, supervised fine-tuning on instruction-following data, and reinforcement learning from human feedback (RLHF) to improve alignment. Each stage builds on prior stages, with later stages using smaller, curated datasets. Understanding this multi-stage training helps practitioners design their own fine-tuning and alignment procedures.

Evaluation of Gemma 2 spans academic benchmarks (MMLU, TruthfulQA, BIG-bench), code benchmarks (HumanEval, MBPP), and real-world metrics like user preference scores. Benchmarks have limitations—they don't perfectly correlate with real-world performance. Practitioners should develop domain-specific evaluation metrics beyond standard benchmarks. Gemma 2's benchmark scores provide a starting point, not an end point.

Safety considerations in Gemma 2 included training to refuse harmful requests, reduce biases, and provide accurate information. Safety-alignment involves trade-offs: perfectly aligned models might be overly cautious and unhelpful on edge cases. Red-teaming (adversarial testing) identified failure modes before release. Understanding alignment challenges helps practitioners make informed choices about safety-capability trade-offs in their deployments.

BOOST

Gemma 2 Ecosystem and Community

Google's decision to open-source Gemma 2 creates an ecosystem of downstream models, fine-tuning tutorials, and community contributions. Developers fine-tune Gemma 2 on code, mathematics, medicine, and other domains. The open-source model enables rapid innovation—researchers can build on Gemma 2 without starting from scratch. This democratization of advanced AI capabilities accelerates research and enables small teams to compete with well-resourced institutions.

Gemma 2's code implementation in JAX and PyTorch is publicly available, enabling researchers to understand architectural details and propose improvements. Seeing exactly how local-global attention and other innovations are implemented demystifies the model. This transparency builds trust and enables principled extensions. Practitioners can implement similar techniques in their own models after studying Gemma 2.

Commercial and non-commercial uses of Gemma 2 span chatbots, code generation assistants, RAG systems, and fine-tuned task-specific models. The licensing model permits commercial use under certain conditions, enabling startups to build on Gemma 2. Understanding Gemma 2's capabilities and limitations helps practitioners decide whether to use it directly or as a foundation for customized models.