Modular instruction + tool bundles loaded dynamically at runtime — extend agent capabilities without retraining. A skill packages a prompt fragment, a set of tools, and optionally example interactions into a reusable unit.
An agent's capabilities are typically defined at build time: you write the system prompt, define the tools, and deploy. To add new capabilities, you update code and redeploy.
Agent skills flip this model: a skill is a self-contained, loadable bundle of prompt instructions + tools that can be added to an agent at runtime without code changes. The agent discovers what skills are available, loads the ones relevant to the current task, and gains those capabilities on-demand.
This is analogous to browser extensions, VS Code plugins, or LLM MCPs — a plugin architecture for agents. Examples: a "code review" skill adds a set of code analysis tools and a reviewing persona; a "customer support" skill adds tools to query order history and a support-tone system prompt; a "research" skill adds web search and citation tools.
from dataclasses import dataclass
@dataclass
class AgentSkill:
name: str # "code_review"
description: str # Used by agent to decide when to load
system_prompt_fragment: str # Appended to base system prompt
tools: list[dict] # Tool definitions (Anthropic format)
examples: list[dict] # Few-shot examples (optional)
version: str = "1.0"
tags: list[str] = None # ["coding", "quality"]
# Example: a code review skill
code_review_skill = AgentSkill(
name="code_review",
description="Review Python code for bugs, style, and security issues.",
system_prompt_fragment='''When reviewing code:
- Check for security vulnerabilities (injection, auth, secrets in code)
- Verify error handling covers all edge cases
- Assess readability and adherence to PEP 8
- Look for performance issues (N+1 queries, unnecessary loops)
Always provide specific line references and improvement suggestions.''',
tools=[
{
"name": "run_linter",
"description": "Run pylint/ruff on a code snippet.",
"input_schema": {"type": "object", "properties": {"code": {"type": "string"}}, "required": ["code"]}
},
{
"name": "check_security",
"description": "Check for common security issues using bandit.",
"input_schema": {"type": "object", "properties": {"code": {"type": "string"}}, "required": ["code"]}
},
],
tags=["coding", "quality"]
)
import json
from pathlib import Path
class SkillRegistry:
def __init__(self, skills_dir: str):
self._skills: dict[str, AgentSkill] = {}
self._load_from_dir(skills_dir)
def _load_from_dir(self, directory: str):
'''Load skills from JSON files in a directory.'''
for path in Path(directory).glob("*.skill.json"):
data = json.loads(path.read_text())
skill = AgentSkill(**data)
self._skills[skill.name] = skill
print(f"Loaded skill: {skill.name} v{skill.version}")
def register(self, skill: AgentSkill):
self._skills[skill.name] = skill
def get(self, name: str) -> AgentSkill | None:
return self._skills.get(name)
def find_by_task(self, task_description: str) -> list[AgentSkill]:
'''Use embedding similarity to find relevant skills.'''
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
task_emb = model.encode([task_description])
skill_descs = [s.description for s in self._skills.values()]
skill_embs = model.encode(skill_descs)
sims = (skill_embs @ task_emb.T).flatten()
sorted_skills = sorted(
zip(self._skills.values(), sims),
key=lambda x: x[1], reverse=True
)
return [skill for skill, sim in sorted_skills if sim > 0.5][:3]
registry = SkillRegistry("./skills/")
registry.register(code_review_skill)
import anthropic
client = anthropic.Anthropic()
BASE_SYSTEM = "You are a helpful assistant."
def build_agent_context(task: str, registry: SkillRegistry) -> tuple[str, list[dict]]:
'''Load relevant skills and build system prompt + tools.'''
relevant_skills = registry.find_by_task(task)
if not relevant_skills:
return BASE_SYSTEM, []
# Compose system prompt
system_parts = [BASE_SYSTEM]
all_tools = []
for skill in relevant_skills:
system_parts.append(f"
## {skill.name.upper()} SKILL
{skill.system_prompt_fragment}")
all_tools.extend(skill.tools)
return "
".join(system_parts), all_tools
def skilled_agent(user_query: str) -> str:
system, tools = build_agent_context(user_query, registry)
kwargs = {"model": "claude-sonnet-4-5", "max_tokens": 1024,
"system": system,
"messages": [{"role": "user", "content": user_query}]}
if tools:
kwargs["tools"] = tools
response = client.messages.create(**kwargs)
return response.content[0].text
# Skills are loaded automatically based on the task
print(skilled_agent("Please review this Python function for security issues: def get_user(id): return db.query(f'SELECT * FROM users WHERE id={id}')"))
# → code_review skill loaded automatically
Skills can be composed: a "research report" workflow might load the "web research" skill + "writing" skill + "citation" skill simultaneously. The agent gets the combined tools and prompt fragments from all three.
def compose_skills(skill_names: list[str], registry: SkillRegistry) -> tuple[str, list[dict]]:
'''Explicitly compose named skills.'''
parts = [BASE_SYSTEM]
tools = []
for name in skill_names:
skill = registry.get(name)
if skill:
parts.append(f"
## {name.upper()}
{skill.system_prompt_fragment}")
tools.extend(skill.tools)
return "
".join(parts), tools
# For a complex research task, compose multiple skills
system, tools = compose_skills(
["web_research", "writing", "citation_formatting"],
registry
)
Skill composition requires careful prompt design to avoid conflicts — if two skills both give instructions about response format, they'll fight. Use clear section headers in the system prompt and test composed skill sets end-to-end before deploying.
MCP servers are the natural evolution of agent skills: they expose tools, resources, and prompts via a standard protocol. An MCP marketplace (like a skills registry) where agents can discover and load capabilities at runtime is the production-grade implementation of the skills pattern.
In Cowork mode (this application), skills are implemented exactly this way: each skill is a directory with a SKILL.md prompt file and optional tools, loaded dynamically when the agent determines it's relevant to the task. The skill registry is the /skills directory, and skill discovery is based on task description matching.
For your own agents, you can implement skills as: JSON files (simple), Python modules loaded via importlib (dynamic code), or MCP servers (most portable). MCP servers give you the best separation of concerns — the skill's tools are truly independent from the agent runtime.
Prompt bloat degrades quality. Loading 5 skills each with 200 tokens of system prompt instructions adds 1,000 tokens to every request. Beyond about 3-4 loaded skills, the model can start to lose track of earlier instructions. Be selective: load only the skills directly relevant to the current task.
Tool name collisions break agents. If two skills both define a tool called "search", the model gets confused. Namespace skill tools: "code_review__lint" instead of "lint". Or use a skill prefix convention enforced by the registry.
Skills need versioning. When you improve a skill's prompts or add new tools, you want to roll out the change gradually — not break all running agents simultaneously. Store skill version in the registry and let agents pin to a major version ("web_research@1").
Agent skills are modular capability units that encapsulate a specific ability — web search, code execution, API calls, file manipulation — behind a consistent interface. Designing skills well determines whether an agent can reliably compose them to solve complex tasks or gets stuck on ambiguous tool selection and incorrect parameter construction.
| Skill Type | Interface Pattern | Error Handling | Example |
|---|---|---|---|
| Retrieval | query → documents | Return empty list on miss | web_search, vector_lookup |
| Action | params → status | Raise on failure, return receipt | send_email, create_file |
| Transform | input → output | Validate schema in/out | parse_json, summarize |
| Compound | goal → result | Sub-skill error propagation | research_topic, book_meeting |
Retrieval skills should always return structured metadata alongside content: source URL, timestamp, confidence score. This allows the orchestrator to reason about result quality rather than treating all retrieved content as equally reliable. Action skills should return receipts — unique identifiers or confirmation tokens — so the agent can reference the completed action in subsequent steps without re-executing it.
Compound skills compose simpler primitives into higher-level capabilities. A "research_topic" skill might internally invoke web_search, read_page, and summarize in sequence. Exposing compound skills to the top-level agent reduces the planning horizon required, but introduces a trade-off: the agent loses fine-grained control over intermediate steps and cannot recover gracefully if a sub-skill fails in an unexpected way.
Skill versioning is essential for long-running agent deployments. When a skill interface changes — new required parameters, modified return schema — agents that were trained or prompted to use the old interface will fail silently or produce incorrect results. Semantic versioning for skill APIs, with backward compatibility guarantees within major versions, allows agents to safely call skills without needing to be retrained every time the underlying implementation is updated.
Skill discovery mechanisms allow agents to dynamically learn about available capabilities rather than having a fixed set of tools hardcoded at design time. MCP (Model Context Protocol) and OpenAI function calling both support listing available tools at runtime. Dynamic skill discovery enables agent architectures where new capabilities can be added to the tool registry without modifying the agent itself, and allows the agent to gracefully degrade when certain skills are temporarily unavailable.
Testing agent skills in isolation, before integrating them into a full agent loop, dramatically accelerates development. Unit tests for individual skills verify that correct inputs produce correct outputs and that invalid inputs are rejected gracefully. Integration tests verify that the agent correctly selects and parameterizes skills given realistic natural language inputs, catching schema mismatch errors before they surface in production conversations.