Qwen 2.5

Qwen 2.5 model family
Key capabilities
Running Qwen 2.5 in Python
Qwen2.5-Coder
Quantised versions
Multilingual usage
Gotchas

SECTION 01

Qwen 2.5 model family

Qwen 2.5 (Alibaba Cloud, September 2024) is a family of open-weight language models ranging from 0.5B to 72B parameters. The series uses GQA, RoPE positional encoding, and SwiGLU activation. Context window: 128k tokens for most sizes. The family includes three specialised variants: Qwen2.5-Coder (7B, 14B, 32B, 72B) for programming, Qwen2.5-Math (1.5B, 7B, 72B) for mathematics, and the general-purpose Qwen2.5 series for instruction following.

SECTION 02

Key capabilities

Qwen 2.5 is particularly notable for: (1) Multilingual coverage — 29 languages including Chinese, Japanese, Korean, Arabic, French, German, Spanish, and more. This is broader than Llama 3 which focuses primarily on English and code. (2) Coding — Qwen2.5-Coder-32B outperforms GPT-4o on HumanEval and several competitive programming benchmarks. (3) Math reasoning — Qwen2.5-Math-72B achieves top performance on MATH, GSM8K, and competition mathematics. (4) Instruction following — strong performance on IFEval and complex instruction chains.

SECTION 03

Running Qwen 2.5 in Python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain the difference between GQA and MHA."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

# Decode only newly generated tokens
new_tokens = output_ids[0][inputs.input_ids.shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

SECTION 04

Qwen2.5-Coder

Qwen2.5-Coder is pre-trained on 5.5T tokens of code and code-related text, then instruction-tuned for programming tasks. The 32B variant is particularly strong — it outperforms GPT-4o on HumanEval (92.7% vs 90.2% pass@1) and achieves state-of-the-art on EvalPlus, LiveCodeBench, and SWE-Bench verified. It supports 92 programming languages with dedicated tokenization. For code generation, code explanation, debugging, and code completion tasks, Qwen2.5-Coder-32B is one of the best open models available.

model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
messages = [
    {"role": "user", "content": "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
new_tokens = output[0][inputs.input_ids.shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

SECTION 05

Quantised versions

GPTQ and AWQ quantised versions are available on HuggingFace for reduced VRAM requirements:

Qwen2.5-7B-Instruct-GPTQ-Int4: ~4.5 GB VRAM (vs ~15 GB bfloat16)
Qwen2.5-14B-Instruct-AWQ: ~8.5 GB VRAM (vs ~29 GB bfloat16)
Qwen2.5-72B-Instruct-GPTQ-Int4: ~40 GB VRAM — fits on 2×A100 40GB

GGUF versions are also available for use with llama.cpp and Ollama: ollama run qwen2.5:7b.

SECTION 06

Multilingual usage

# Qwen 2.5 handles mixed-language contexts well
messages = [
    {"role": "system", "content": "你是一个有帮助的助手。"},  # Chinese system prompt
    {"role": "user", "content": "Summarize this in Japanese: The capital of France is Paris."},
]
# The model will respond in Japanese as instructed
# Also works for: Arabic, Korean, Vietnamese, Thai, and 25 other languages

SECTION 07

Gotchas

Chat template is mandatory: Qwen 2.5 instruct models require the chat template via tokenizer.apply_chat_template(). Feeding raw text without the template produces poor results.
Trust remote code: Some Qwen model configurations require trust_remote_code=True when loading with HuggingFace. This executes code from the repo — only do this from the official Qwen HuggingFace repo.
Context window in practice: 128k context is the official limit, but performance degrades on tasks requiring retrieval from >32k tokens in many models. Test with your specific use case before relying on very long contexts.

SECTION 08

Fine-tuning and quantization best practices

Qwen 2.5 models are designed to be fine-tuned efficiently. The full-size models (70B) can be LoRA-finetuned on a single 80GB GPU, and quantized versions (GPTQ, AWQ) fit on smaller hardware. For domain adaptation (legal, medical, code), even modest fine-tuning (a few hundred examples) improves performance significantly.

Quantization is the main path to deployment: Qwen 2.5 is available in 4-bit GPTQ, 8-bit, and AWQ formats. Quality loss is minimal (typically <2% on benchmarks), and memory/speed gains are dramatic. For edge deployment, the smaller 1.5B and 3B variants are popular; they trade some quality but fit on phones and embedded systems.

Qwen Variant	Params	Memory (FP32)	Memory (4-bit)	Throughput (tok/s)
Qwen 2.5 Base	0.5B	~2GB	~0.5GB	100+
Qwen 2.5	1.5B	~6GB	~1.5GB	80+
Qwen 2.5	3B	~12GB	~3GB	50+
Qwen 2.5	7B	~28GB	~7GB	30+
Qwen 2.5	32B	~128GB	~32GB	10+
Qwen 2.5	72B	~288GB	~72GB	5–10

Qwen 2.5 training and architecture: Qwen 2.5 uses a Transformer architecture with rotary embeddings (RoPE), SwiGLU activation, and layer-wise grouped query attention (GQA). The training data is diverse (Chinese, English, code, multilingual) with a 128K context window. Compared to Qwen 2, version 2.5 improves reasoning, code generation, and long-context performance through architectural refinements and data curation. The models are instruction-tuned out of the box and respond well to few-shot prompts.

Qwen 2.5's strong performance on code and math benchmarks makes it popular for developer-focused applications. The open-source release includes pre-trained, instruction-tuned, and quantized variants, enabling deployment from edge to cloud. Community fine-tunes are abundant: domain-specific Qwen (medical, legal, e-commerce) variants appear on Hugging Face regularly, often outperforming larger closed-source models on specialized tasks.

Qwen 2.5 competitive positioning: Qwen 2.5 competes directly with Llama 2/3, Mistral, and Grok. On public benchmarks (MMLU, GSM8K, MATH), Qwen 2.5 consistently ranks in the top tier, often exceeding similarly-sized models. The 72B variant rivals GPT-4 on coding tasks. For teams preferring open-source models, Qwen 2.5 is a solid default; for teams bound to a single provider, Qwen 2.5 LoRA-finetuned on your data often outperforms larger proprietary models on domain-specific tasks.

Licensing and commercial use: Qwen 2.5 is licensed under the Qwen License Agreement, which allows commercial use but has restrictions (no reverse engineering, no making competing models). For most companies, this is permissive enough. Always read the license before deployment; edge cases (using Qwen to train another LLM, using Qwen in a competing commercial product) may be restricted.

Community and ecosystem: Hugging Face, ModelScope, and GitHub have thousands of Qwen 2.5 derivatives (fine-tuned models, LoRA adapters, quantized versions). The community is active and helpful. If you're stuck, the community Discord and GitHub discussions are good resources. For proprietary modifications, Alibaba's official support is available through commercial licensing.

Performance Benchmarks & Optimization

Qwen 2.5 achieves state-of-the-art performance on reasoning and instruction-following benchmarks. Detailed benchmark results show strong performance on mathematics, coding, and knowledge-based tasks. The model family spans multiple sizes, from 0.5B to 32B parameters, enabling deployment across diverse hardware constraints. Understanding benchmark results relative to your specific application requirements helps guide model selection. The publicly available benchmarks provide transparent performance data compared to competing models.

Beyond standard benchmarks, practical performance depends heavily on prompt engineering and fine-tuning. Models require careful instruction design to achieve their full potential, and task-specific fine-tuning often yields improvements beyond standard evaluation. Real-world performance testing on representative examples from your domain is essential, as benchmark performance doesn't always translate directly to production quality.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load Qwen 2.5
model_name = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Optimized inference
prompt = "Your question here"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=1024)

response = tokenizer.decode(outputs[0])

Qwen 2.5 represents significant progress in open-source language modeling. The model family provides accessible alternatives to proprietary models while maintaining competitive performance. As the AI landscape evolves, strong open-source models like Qwen expand options for organizations seeking both performance and independence from proprietary vendors.

The quantization options for Qwen 2.5 enable deployment in resource-constrained environments without significant quality loss. 4-bit and 8-bit quantized versions require substantially less memory than full-precision models. These quantized variants enable running models on consumer GPUs or edge devices, expanding deployment possibilities. The trade-offs between memory usage and model quality vary depending on quantization schemes and the specific task.

Multi-lingual capabilities make Qwen 2.5 suitable for applications serving diverse user bases. Strong performance across numerous languages enables international deployments without separate models. This multilingual proficiency extends beyond simple translation to deep understanding of context and nuance across languages. For global applications, the multilingual capabilities represent significant value.

Qwen's vision capabilities in larger models enable visual understanding alongside text processing. This multimodal functionality opens applications including document analysis, image captioning, and visual question answering. As vision capabilities mature in open-source models, applications previously requiring separate vision and language models can consolidate onto unified architectures.

The active development and community support around Qwen models ensure continued improvements and rapid incorporation of new techniques. The Alibaba team regularly releases new versions and capabilities, keeping the model family current with the evolving AI landscape. The strong community adoption provides extensive documentation, examples, and third-party integrations that reduce friction in adoption.