Phi-3 / Phi-4

The Phi philosophy
Phi-3 model family
Phi-4 improvements
Running Phi locally
Fine-tuning small models
When to choose Phi over Llama
Gotchas

SECTION 01

The Phi philosophy

The Phi family (Microsoft Research) is built on a counterintuitive thesis: model quality depends more on training data quality than parameter count. Instead of training on Common Crawl web scrapes, Phi uses synthetic "textbook-quality" data — carefully generated educational content that teaches reasoning step-by-step, with minimal noise and high information density.

Phi-1 (2023) showed a 1.3B code model trained on synthetic "Python textbooks" could match or beat GPT-4 on specific coding benchmarks. Phi-2 extended this to reasoning. Phi-3 generalised the approach to instruction following and chat. The result: a 3.8B model (Phi-3-mini) that matches or beats GPT-3.5-turbo on MMLU, HumanEval, and GSM8K.

SECTION 02

Phi-3 model family

Phi-3 comes in three sizes, each with short (4K) and long (128K) context variants:

Phi-3-mini (3.8B): Fits in ~2.5GB at 4-bit. Runs on a phone or Raspberry Pi 5. Best quality-per-FLOP at this size class. Good for on-device inference.
Phi-3-small (7B): Uses GQA and novel FlashAttention tricks. Competes with Llama 3.1 8B on many benchmarks.
Phi-3-medium (14B): The most capable Phi-3. Rivals Claude Haiku and GPT-3.5 on reasoning tasks. Fits in ~9GB at 4-bit.

Architecture: standard decoder-only transformer with RoPE (mini/small) and ALiBi-like attention for the 128K variants. The key innovation is the training data pipeline, not the architecture.

SECTION 03

Phi-4 improvements

Phi-4 (14B, released December 2024) pushes the synthetic data approach further with a focus on "reasoning-heavy" training data. Benchmarks show Phi-4 scoring above GPT-4o on GSM8K (93.4% vs 90.8%) and competitive with GPT-4o on MMLU (84.8%).

Key improvements: better multi-step reasoning, stronger coding performance, and more consistent instruction following. The 14B parameter count is identical to Phi-3-medium but the data quality improvements yield significantly better results.

Phi-4-mini (3.8B) is also available, optimised for on-device inference with a focus on math and reasoning tasks for embedded applications.

SECTION 04

Running Phi locally

# Fastest: Ollama
ollama run phi3           # 3.8B - runs on any laptop
ollama run phi3:medium    # 14B - needs 12GB+ RAM
ollama run phi4           # Phi-4 14B

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,  # Required for Phi-3
)
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Write a Python function to check if a number is prime."},
]
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
output = pipe(messages, max_new_tokens=512)[0]["generated_text"]
print(output[-1]["content"])

SECTION 05

Fine-tuning small models

The advantage of small models for fine-tuning: the full model fits in memory even without quantisation. Phi-3-mini (3.8B) can be fully fine-tuned on a single consumer GPU (RTX 4090 24GB), and QLoRA fine-tuning runs on an RTX 3080 10GB.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                                  bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["qkv_proj","o_proj"],
                          task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # ~1.3% of params

SECTION 06

When to choose Phi over Llama

Phi is the right choice when: (1) you need on-device inference (Phi-3-mini at 2GB is uniquely deployable on phones and edge devices), (2) you're doing tasks requiring careful step-by-step reasoning (Phi's training data emphasises this), (3) you're fine-tuning and want fast iteration (smaller model = faster training), or (4) you need a model that performs well with limited VRAM.

Llama 3.1 8B is generally better for: multilingual tasks (Llama has much better multilingual training data), instruction following variety (larger community fine-tunes), and tasks requiring world knowledge depth (smaller models have less capacity to store factual knowledge).

The practical heuristic: start with Phi-3-mini for quick prototyping (fast, cheap), upgrade to Llama 3.1 8B if you need multilingual or broader knowledge, and use Phi-4 14B when you need strong reasoning in a compact form factor.

SECTION 07

Gotchas

trust_remote_code=True: The Phi models require this flag because they use custom attention implementations. Always verify the model source before setting this — trust_remote_code=True runs arbitrary Python code from the model repo.

Context length variants: Phi-3-mini comes in 4K and 128K context variants (phi-3-mini-4k-instruct vs phi-3-mini-128k-instruct). The 128K variant is slower. Choose based on your use case.

Small model limitations: Despite benchmark performance, small models fail on tasks requiring broad world knowledge or complex multi-step reasoning chains. A 3.8B model has ~3% of a 70B model's parameter capacity. Benchmarks test narrow capabilities; real-world tasks are more varied.

Phi-4 licence: Phi-4 uses the MIT licence (fully open). Phi-3 uses the Microsoft Research Licence (also permissive, allows commercial use). Check the specific version's licence before deploying.

Phi-3 Model Family Comparison

Microsoft's Phi-3 family demonstrates that careful training data curation can produce models that punch significantly above their parameter weight. The Phi models are trained on "textbook quality" data — carefully filtered educational and reasoning-focused text — rather than raw web crawl data, achieving strong performance on reasoning benchmarks despite being orders of magnitude smaller than frontier models.

Model	Parameters	Context	MMLU	Deployment Target
Phi-3-mini	3.8B	4K / 128K	~70%	Mobile, edge devices
Phi-3-small	7B	8K / 128K	~75%	Laptop, single GPU
Phi-3-medium	14B	4K / 128K	~78%	Workstation GPU
Phi-3.5-mini	3.8B	128K	~69%	Long context mobile
Phi-3.5-MoE	16 × 3.8B (6.6B active)	128K	~78%	High-quality edge

Phi-3's training data philosophy focuses on synthetic data generation: rather than filtering a massive internet corpus, Microsoft generates educational text, coding exercises, and reasoning problems specifically designed to teach the model skills efficiently. This approach enables strong performance on math, coding, and logical reasoning relative to model size. The trade-off is that Phi models have less factual breadth than larger models trained on more diverse web corpora, making them stronger at reasoning tasks but potentially weaker at knowledge recall tasks about obscure or domain-specific facts.

ONNX export support for Phi-3 models enables deployment on mobile devices using ONNX Runtime, bypassing the need for heavy PyTorch or HuggingFace dependencies on constrained hardware. Microsoft provides official ONNX model variants for the Phi-3 family through the model catalog, with quantized versions that reduce model size to 2–4GB for on-device deployment. This makes Phi-3 one of the most practical choices for offline, privacy-preserving LLM applications that run entirely on user devices without network connectivity to a model hosting service.

Phi-3's instruction tuning used a combination of human-written demonstrations and model-generated synthetic data carefully filtered for quality and safety. The small parameter count creates particular challenges for safety alignment — smaller models have less capacity to reliably suppress harmful outputs across all contexts. Microsoft addressed this through additional safety fine-tuning focused on the specific failure modes most likely in small models, and provides detailed safety guidance for developers deploying Phi-3 in applications where safety properties need careful evaluation and testing.

Phi-3's strong performance on coding benchmarks makes it a compelling choice for code completion and generation tasks in resource-constrained environments. The 3.8B and 7B variants fit comfortably in browser-based inference environments using WebGPU (via transformers.js or similar) or in mobile applications using ONNX Runtime Mobile. For development tool integrations — code completions, error explanations, documentation generation — where sub-second latency is essential and network calls to cloud APIs are unacceptable, Phi-3 provides frontier-competitive coding quality in an on-device footprint.

Comparative evaluation between Phi-3 and larger models should distinguish between knowledge-recall tasks and reasoning tasks. On knowledge-recall tasks — "What is the capital of Kazakhstan?", "Who wrote Middlemarch?" — Phi-3's smaller training corpus means it will have lower factual coverage than 70B+ models with more training data. On structured reasoning tasks — mathematics, logical inference, code debugging — Phi-3's "textbook quality" training data enables strong performance that exceeds raw parameter count predictions, making it competitive with models 5–10× larger on reasoning-heavy applications.

Fine-tuning Phi-3 models is particularly efficient due to their small size, enabling rapid iteration on domain-specific datasets. A full fine-tuning run on a 10,000-example dataset takes hours rather than days, and QLoRA fine-tuning of the 3.8B model runs on a single consumer GPU with 12GB VRAM. This accessibility makes Phi-3 attractive for organizations that want to create domain-specific models but lack the infrastructure for fine-tuning larger models. The compact size also makes iterating on training data quality and hyperparameters significantly faster than with 13B+ models.