DeepSeek R1

Why R1 matters
R1 model variants
Running via API
Running locally with Ollama
Visible chain-of-thought
Distilled R1 models
Gotchas

SECTION 01

Why R1 matters

DeepSeek R1 achieved near-parity with OpenAI's o1 on key reasoning benchmarks (AIME 2024, Codeforces, MATH-500) while being open-weight and MIT licensed. This was a landmark moment: previously, frontier reasoning capabilities were exclusively locked behind proprietary APIs. R1 proved that open-source models could match the best commercial reasoning models.

The cost difference is dramatic. DeepSeek's API charges roughly $0.55 per 1M input tokens — compared to $15/1M for o1. For high-volume reasoning tasks, this is a 27× cost reduction. And since the weights are open, you can run R1 locally with no per-token cost at all (just hardware).

R1 was trained using GRPO (Group Relative Policy Optimization), a variant of reinforcement learning that rewards the model for correct final answers on verifiable tasks (math, code). The model developed explicit chain-of-thought reasoning as an emergent behaviour during RL training — it wasn't explicitly trained to show reasoning steps, it learned this spontaneously.

SECTION 02

R1 model variants

DeepSeek released the full model family:

DeepSeek-R1 (671B parameters, MoE architecture): the full model. Matches o1 on benchmarks. Requires significant hardware to run locally (impractical without a multi-GPU server). Available via DeepSeek API.

DeepSeek-R1-Distill-Qwen-32B: distilled version trained on R1's reasoning traces. 32B parameters — runs on a single A100 80GB or two A6000s. Scores close to o1-mini on most tasks. Best balance of quality and local runnability.

DeepSeek-R1-Distill-Qwen-7B: 7B version, runs on consumer GPUs (RTX 3090/4090) or fast CPU inference. Quality drops noticeably on hard math but remains strong for coding tasks.

DeepSeek-R1-Distill-Llama-8B / 70B: same distillation process but using Llama 3 as the base model. Useful if you need Llama 3's license terms rather than Qwen's.

SECTION 03

Running via API

pip install openai  # DeepSeek API is OpenAI-compatible

from openai import OpenAI

# DeepSeek uses OpenAI-compatible API
client = OpenAI(
    api_key="sk-...",  # DeepSeek API key from platform.deepseek.com
    base_url="https://api.deepseek.com"
)

# R1 — reasoning model
response = client.chat.completions.create(
    model="deepseek-reasoner",  # R1
    messages=[{
        "role": "user",
        "content": "Solve: A train leaves Chicago at 9am going 60mph. Another leaves NYC at 10am going 80mph. The cities are 790 miles apart. When do they meet?"
    }],
)

# R1 exposes the reasoning trace!
reasoning = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content

print("Thinking:", reasoning[:500], "...")
print("
Answer:", answer)

# DeepSeek-V3 — non-reasoning model (cheaper, faster for simpler tasks)
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "Write a Python function to parse CSV."}],
)

SECTION 04

Running locally with Ollama

# Pull and run DeepSeek R1 distilled models with Ollama
# Terminal:
# ollama pull deepseek-r1:7b    # 4.7 GB, runs on most machines
# ollama pull deepseek-r1:32b   # 19 GB, needs 24GB+ VRAM or Apple M-series
# ollama run deepseek-r1:7b     # interactive chat

# Python client
import requests

def chat_deepseek(message: str, model: str = "deepseek-r1:7b") -> dict:
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": [{"role": "user", "content": message}],
            "stream": False
        }
    )
    return response.json()

result = chat_deepseek("What is the time complexity of quicksort? Prove it.")
print(result["message"]["content"])

# Or use the OpenAI-compatible endpoint
from openai import OpenAI
local_client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = local_client.chat.completions.create(
    model="deepseek-r1:7b",
    messages=[{"role": "user", "content": "Find bugs in this code: [paste code]"}]
)
print(response.choices[0].message.content)

SECTION 05

Visible chain-of-thought

R1's most distinctive feature: it shows its reasoning in <think>...</think> tags before the final answer. This is valuable for debugging (why did the model reach this conclusion?), building user trust (show the reasoning process), and distillation (use R1's thinking traces to train smaller models).

def parse_r1_response(text: str) -> tuple[str, str]:
    '''Separate thinking from final answer.'''
    import re
    think_match = re.search(r'(.*?)', text, re.DOTALL)
    if think_match:
        thinking = think_match.group(1).strip()
        answer = text[think_match.end():].strip()
        return thinking, answer
    return "", text

# With DeepSeek API (reasoning_content is a separate field)
response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=[{"role": "user", "content": "Is 10007 prime?"}]
)
thinking = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content

print(f"Thinking ({len(thinking)} chars): {thinking[:200]}...")
print(f"Answer: {answer}")

# With Ollama local models (think tags are in the content):
result = chat_deepseek("Is 10007 prime?", model="deepseek-r1:7b")
thinking, answer = parse_r1_response(result["message"]["content"])

SECTION 06

Distilled R1 models

DeepSeek released the dataset of R1's reasoning traces and used it to distill smaller models. This technique — training a small model to imitate a large model's reasoning process — produces models that punch far above their parameter count.

# Using distilled R1 models via HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# R1 format: wrap user message in thinking prompt
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Prove that there are infinitely many primes."}],
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.6,  # R1 models work well at 0.6
    do_sample=True,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

SECTION 07

Gotchas

R1's reasoning traces can be very long. For hard problems, R1 may generate thousands of tokens of thinking before the final answer. This is fine for quality but can be jarring for users. In production, either hide the thinking display by default (reveal on demand), or use a distilled model with shorter reasoning chains.

R1 struggles with non-reasoning tasks. Creative writing, conversational chat, and instruction following are weaker compared to models like Claude or GPT-4o that were specifically trained for these. Use R1 for its strengths (math, coding, logical reasoning) and a different model for generation-heavy tasks.

Local quantised R1 has significant quality degradation on hard tasks. The 4-bit quantised 7B distil model and the full 671B model are not comparable — the smaller model may fail on competition math problems the full model handles easily. Benchmark on your actual task before committing to a particular model size.

DeepSeek-R1 vs. Other Reasoning Models

DeepSeek-R1 achieved frontier-level mathematical and coding reasoning performance through reinforcement learning directly on reasoning traces, without relying on distillation from a larger teacher model for its core capabilities. This approach produced emergent reasoning behaviors including self-verification, backtracking, and extended chain-of-thought, at a fraction of the compute cost reported by comparable proprietary models.

Model	Reasoning Method	Training Data	Open Weights	AIME 2024
DeepSeek-R1	RL on reasoning traces	Self-generated	Yes	~79%
OpenAI o1	RL (details proprietary)	Proprietary	No	~83%
OpenAI o3	RL + search	Proprietary	No	~96%
QwQ-32B	RL on reasoning	Self-generated	Yes	~50%

The training recipe for DeepSeek-R1 proceeds in stages: first, a cold-start SFT phase teaches the model basic chain-of-thought formatting using a small set of curated reasoning examples; then large-scale RL with rule-based rewards (correctness for math/code, format compliance) drives the emergence of more sophisticated reasoning strategies. The rule-based reward design avoids reward hacking more effectively than a learned reward model, since mathematical correctness is verifiable and unambiguous.

DeepSeek-R1's open-weight release (including distilled variants at 1.5B, 7B, 14B, 32B, and 70B) has significant implications for the open-source ecosystem. The distilled versions use R1's reasoning traces as SFT data to transfer reasoning capabilities to smaller base models like Qwen2.5 and Llama-3, making capable reasoning models accessible without the computational requirements of running the full 671B MoE architecture.

The emergent self-verification behavior in DeepSeek-R1 is one of its most practically useful properties. During generation, the model sometimes produces a tentative answer, then re-examines its reasoning and explicitly revises it — a pattern that mirrors how human experts check their work. This behavior was not explicitly trained but emerged from the RL reward structure: since correctness is rewarded and the model has budget to continue generating, it learns that re-checking its work before committing to a final answer increases expected reward.

Deploying DeepSeek-R1 in production requires decisions about thinking token budgets. The model generates a long chain of reasoning (the "thinking" trace) before producing its final answer, and the quality of the final answer improves with longer reasoning traces up to a point. Setting a maximum thinking token budget controls latency and cost, but too tight a budget degrades reasoning quality on hard problems. Dynamic budgeting — allocating more thinking tokens for queries classified as complex by a lightweight router — balances quality and cost better than a single fixed limit.

Cost benchmarking for DeepSeek-R1 versus proprietary reasoning models should account for both API pricing and thinking token consumption. Problems that require extensive reasoning chains consume substantially more tokens than the visible output, and API costs scale with total tokens generated including the thinking trace. Comparing cost-per-correct-answer rather than cost-per-output-token provides a more accurate picture of the economic trade-offs between open-weight and proprietary reasoning models.