LangChain, LangGraph, CrewAI, AutoGen, and when to build your own — compared
An agent framework gives you tool dispatch, memory management, multi-step planning, and observability — instead of writing it yourself. Rather than stringing together 500 lines of glue code with OpenAI SDK calls, a framework provides the scaffolding.
(1) LLM interface layer: abstract the specific model (Claude, GPT-4, Llama). Send structured prompts, parse tool calls, handle errors. (2) Tool registry: define what tools the agent has access to, their schemas, and how to execute them. (3) Memory/state store: persist conversation history, internal state, past decisions. (4) Orchestration loop: handle the agentic loop itself — reason, act, observe, update state, repeat.
Build vs buy question: frameworks accelerate prototyping but add abstraction overhead. Many teams start with a framework, then rewrite the hot path in plain Python when they hit production constraints.
Here's a full comparison of the major frameworks used in production:
| Framework | Paradigm | State handling | Multi-agent | Best for |
|---|---|---|---|---|
| LangChain | DAG / chain | Stateless by default | Via agents | Quick prototypes, chains |
| LangGraph | Stateful graph | First-class state | Yes (native) | Complex workflows, human-in-loop |
| CrewAI | Role-based agents | Per-agent memory | Yes (crews) | Collaborative agent teams |
| AutoGen | Conversational | Conversation history | Yes (group chat) | Research, multi-agent chat |
| Pydantic AI | Type-safe agents | Structured I/O | Limited | Production, typed pipelines |
| LlamaIndex | Data-centric | Index + query | Via workflows | RAG-heavy, document QA |
| Plain Python + SDK | Manual | Whatever you build | Whatever you build | Full control, high scale |
LangGraph models agents as stateful graphs: nodes are actions or decisions, edges are transitions, state is a typed dict persisted across steps. This makes it powerful for workflows that need to pause, inspect, and resume.
StateGraph: Define your state schema as a TypedDict. Nodes: Python functions that take state and return modified state. Edges: Deterministic or conditional transitions between nodes. Checkpointers: Persist state to Postgres, Redis, memory, or custom backend. Human-in-the-loop: Use interrupt_before and interrupt_after to pause execution for human review.
CrewAI models each agent as a role with a goal, backstory, tools, and memory. A Crew orchestrates agents sequentially or hierarchically; a manager agent can delegate to specialized agents. Best for content pipelines, research tasks, and multi-perspective synthesis.
Each agent has a clear persona and responsibility. This works exceptionally well when different "experts" are conceptually clean — a researcher agent, an analyst agent, a writer agent. The abstraction maps cleanly to product workflows and makes prompting more intuitive.
CrewAI is particularly strong when you need sequential task dependencies and role-based decomposition. The trade-off is less fine-grained control over state compared to LangGraph.
AutoGen treats agents as conversational participants: each agent has a system prompt, can send and receive messages, and can be a human-proxy or an LLM-backed assistant. GroupChat enables round-robin or managed multi-agent conversations with code execution loops.
AssistantAgent can generate code; UserProxyAgent executes it locally and returns feedback. This creates a tight loop: LLM generates, execute, report results, LLM refines. Exceptionally effective for code-generation and problem-solving tasks where you can test ideas immediately.
| Aspect | LangGraph | CrewAI | AutoGen |
|---|---|---|---|
| Mental model | State machine | Role-based crew | Chat participants |
| State | Typed dict, persistent | Per-agent memory | Conversation history |
| Control flow | Graph edges | Sequential/hierarchical | Conversation-driven |
| Human in loop | Yes (interrupt) | Limited | Yes (UserProxyAgent) |
| Best fit | Production workflows | Content/research tasks | Code generation, research |
The right choice depends on your task shape, team maturity, and scale constraints:
CrewAI. The role-based mental model maps cleanly to product workflows. Researcher, analyst, writer — each with their own backstory and tools — is a natural decomposition for many tasks.
Plain Python + async + OpenAI SDK. Frameworks add latency and debugging friction. Write the abstraction you need, not a general one. Most successful production agents eventually do this.
Multi-step agents are hard to debug. Which step failed? Which LLM call was slow? What context caused the wrong decision? Tracing becomes critical in production.
Every LLM call: input tokens, output tokens, latency, model used. Every tool call: tool name, arguments, result, execution duration. Every state transition: node entry, node exit, state delta. Errors: exception type, stack, recovery action.