Ollama

What Ollama does
Installation and first run
REST API and Python client
Model management
Custom Modelfiles
Performance tuning
Gotchas

SECTION 01

What Ollama does

Ollama wraps llama.cpp (and other inference backends) in a user-friendly tool that handles model downloading, GPU acceleration, and serving behind a REST API. Instead of configuring CUDA drivers, quantisation settings, and inference parameters manually, you run ollama run llama3 and have a working LLM in 60 seconds.

The REST API is deliberately designed to be a drop-in replacement for the OpenAI API. The same Python code that calls api.openai.com can call localhost:11434 with a different base_url — enabling development and testing against local models before incurring API costs in production.

Ollama is the right tool when: you need total privacy (no data leaves your machine), you're developing offline, you want to avoid per-token costs during heavy development, or you need to serve open models in an on-premises enterprise environment.

SECTION 02

Installation and first run

# macOS / Linux:
# curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com

# Pull and run a model
# ollama run llama3.2          # 3B model, fast on CPU
# ollama run llama3.1:8b       # 8B, needs 8GB+ RAM
# ollama run mistral            # 7B, good coding model
# ollama run deepseek-r1:7b    # reasoning model

# Server starts automatically; also accessible programmatically:
import subprocess
import requests

# Check if Ollama is running
def is_ollama_running() -> bool:
    try:
        r = requests.get("http://localhost:11434/api/tags", timeout=2)
        return r.status_code == 200
    except:
        return False

# Start Ollama server (if not already running as a service)
if not is_ollama_running():
    subprocess.Popen(["ollama", "serve"])
    import time; time.sleep(2)

# List available models
models = requests.get("http://localhost:11434/api/tags").json()
for m in models["models"]:
    print(f"{m['name']}: {m['size'] / 1e9:.1f} GB")

SECTION 03

REST API and Python client

# Option 1: OpenAI SDK (drop-in replacement)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Write a Python decorator for retry logic."}],
    temperature=0.7,
)
print(response.choices[0].message.content)

# Option 2: Ollama Python library
pip install ollama

import ollama

# Simple generation
response = ollama.chat(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain GQA in 2 sentences."}]
)
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(
    model="mistral",
    messages=[{"role": "user", "content": "Write a haiku about neural networks."}],
    stream=True
):
    print(chunk["message"]["content"], end="", flush=True)

# Embeddings (for local RAG)
embed_response = ollama.embeddings(
    model="nomic-embed-text",
    prompt="The attention mechanism computes weighted sums of values."
)
vector = embed_response["embedding"]  # 768-dimensional float list
print(f"Embedding dim: {len(vector)}")

SECTION 04

Model management

import ollama

# Pull a model (downloads if not present)
ollama.pull("llama3.1:8b")

# List local models
models = ollama.list()
for m in models["models"]:
    size_gb = m["size"] / 1e9
    print(f"{m['name']}: {size_gb:.1f} GB, modified {m['modified_at'][:10]}")

# Delete a model to free disk space
ollama.delete("llama3:latest")

# Copy a model (create an alias)
ollama.copy("llama3.2", "my-custom-llama")

# Show model details (architecture, parameters, context length)
info = ollama.show("mistral")
print(info["modelinfo"])

# Popular models and their use cases:
RECOMMENDED = {
    "llama3.2:3b":     "Fast responses, low memory (3GB RAM)",
    "llama3.1:8b":     "Good quality/speed balance (8GB RAM)",
    "mistral:7b":      "Strong coding, instruction following (8GB RAM)",
    "deepseek-r1:7b":  "Reasoning tasks, visible CoT (5GB RAM)",
    "qwen2.5:7b":      "Multilingual, code generation (5GB RAM)",
    "nomic-embed-text":"Embeddings for local RAG (274MB)",
    "llava:7b":        "Vision + language tasks (5GB RAM)",
}

SECTION 05

Custom Modelfiles

# Modelfiles let you customise models: system prompts, parameters, base model
# Save as 'Modelfile' then run: ollama create my-model -f Modelfile

MODELFILE_CONTENT = '''
FROM llama3.2

# Set a persistent system prompt
SYSTEM """
You are a senior Python engineer at a fintech company.
Always write type annotations. Prefer composition over inheritance.
Include docstrings for all public functions.
When reviewing code, check for: security vulnerabilities, edge cases, performance.
"""

# Model parameters
PARAMETER temperature 0.1        # Low temp for consistent code output
PARAMETER top_p 0.9
PARAMETER num_ctx 8192           # Context window (tokens)
PARAMETER num_predict 2048       # Max output tokens

# Add example conversations to guide behaviour
MESSAGE user Review this function: def divide(a, b): return a/b
MESSAGE assistant The function lacks input validation. Here's an improved version:

```python
def divide(a: float, b: float) -> float:
    """Divide a by b, raising ValueError for zero division.

    Args:
        a: Dividend
        b: Divisor (must be non-zero)

    Returns:
        The result of a / b

    Raises:
        ValueError: If b is zero
    """
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b
```
'''

# Write and create the model
with open("/tmp/Modelfile", "w") as f:
    f.write(MODELFILE_CONTENT)

import subprocess
subprocess.run(["ollama", "create", "python-reviewer", "-f", "/tmp/Modelfile"])

# Now use your custom model
response = ollama.chat(
    model="python-reviewer",
    messages=[{"role": "user", "content": "Review: def get_user(id): return db.query(f'SELECT * FROM users WHERE id={id}')"}]
)

SECTION 06

Performance tuning

import os

# GPU layers — offload model layers to GPU for faster inference
# Set via OLLAMA_GPU_LAYERS env var or in Modelfile:
# PARAMETER num_gpu 35  (number of layers to offload; -1 = all)

# Parallel requests — how many requests to handle simultaneously
os.environ["OLLAMA_NUM_PARALLEL"] = "4"  # default 1

# Context window — larger = more memory, slower
os.environ["OLLAMA_MAX_LOADED_MODELS"] = "2"  # keep 2 models in memory

# Benchmark a model's throughput
import time

def benchmark(model: str, prompt: str, runs: int = 5) -> dict:
    times = []
    token_counts = []

    for _ in range(runs):
        start = time.time()
        response = ollama.generate(model=model, prompt=prompt)
        elapsed = time.time() - start
        times.append(elapsed)
        token_counts.append(response.get("eval_count", 0))

    avg_tokens = sum(token_counts) / len(token_counts)
    avg_time = sum(times) / len(times)
    return {
        "model": model,
        "avg_tokens_per_sec": avg_tokens / avg_time,
        "avg_latency_s": avg_time,
    }

result = benchmark("llama3.2", "Write a Python hello world function.")
print(f"{result['model']}: {result['avg_tokens_per_sec']:.1f} tok/s")

# Typical throughput (approximate):
# CPU only (M2 MacBook): 15-30 tok/s for 7B models
# Single RTX 4090:       80-120 tok/s for 7B, 30-50 tok/s for 13B
# Apple M3 Max:          50-80 tok/s for 7B models

SECTION 07

Gotchas

Cold start latency on first request. Ollama loads the model into memory on the first request after starting (or after the model was evicted). This can take 5–30 seconds for large models. Warm up the model by sending a dummy request on startup, and configure OLLAMA_KEEP_ALIVE to prevent eviction between requests (default is 5 minutes).

Context window is not automatically set to maximum. By default, Ollama uses 2048 tokens of context even if the model supports 128K. If you need long context, explicitly set num_ctx in your Modelfile or pass it in the request. Note that larger context requires proportionally more memory.

Quantised models have quality differences. When you run ollama pull llama3.1:8b, you get a Q4_K_M quantised version by default. This is noticeably worse on reasoning-heavy tasks than the FP16 model. For production use cases, benchmark the quantised vs full-precision versions on your specific task.

SECTION 08

Ollama Model Comparison

Model	Parameters	VRAM (Q4)	Best For
llama3.2:1b	1B	~1 GB	Fast classification, simple completions
llama3.2:3b	3B	~2 GB	Lightweight chat, edge deployment
llama3.1:8b	8B	~5 GB	General purpose, fits most laptops
llama3.1:70b	70B	~40 GB	High quality, needs workstation GPU
mistral:7b	7B	~4 GB	Fast, efficient, good instruction following
qwen2.5-coder:7b	7B	~4 GB	Code generation and review

For production deployments with Ollama, run it behind a reverse proxy (nginx or Caddy) rather than exposing port 11434 directly. This allows you to add TLS, authentication headers, and rate limiting that Ollama itself does not provide. Enable the OLLAMA_NUM_PARALLEL environment variable to allow concurrent requests -- by default Ollama processes requests sequentially. Set this to the number of CPU cores divided by 2 as a starting point, and tune based on your latency vs throughput requirements.

Ollama's Modelfile system allows you to bundle system prompts, stop tokens, and sampling parameters into a named model variant that users can call by name. This is useful for creating domain-specific assistants (a coding assistant, a customer service bot) that always use the right system prompt without requiring callers to pass it with every request.

Ollama's model format (GGUF-based, managed through Modelfiles) provides a portable, self-contained packaging system for local LLMs. A Modelfile specifies the base model, system prompt, template format, and inference parameters in a Dockerfile-like syntax. This abstraction makes it trivial to maintain multiple specialized variants of the same base model — a customer support persona, a coding assistant, a summarization agent — each as a distinct named model that can be pulled and run with a single command.

Memory requirements for local model deployment with Ollama depend primarily on the model's parameter count and quantization level. A 7B model at 4-bit quantization requires approximately 4–5GB of VRAM, fitting comfortably on consumer GPUs. Offloading layers to CPU RAM via the num_gpu parameter allows running models larger than available VRAM, but CPU inference is typically 10–20× slower per token than GPU. For interactive use, partial GPU offloading often provides a better latency/capacity tradeoff than full CPU inference.

Ollama's REST API is intentionally compatible with the OpenAI API surface for chat completions, which means existing code that targets the OpenAI Python SDK can be redirected to a local Ollama instance by changing only the base URL and model name. This drop-in compatibility lowers the barrier to experimenting with local models during development and makes it straightforward to build cost-saving hybrid deployments that route cheaper queries to local models and reserve API calls for tasks requiring frontier-model capability.