Agent Infrastructure

LLM Agent Frameworks

LangChain, LangGraph, CrewAI, AutoGen, and when to build your own — compared

tool → memory → plan → act the agent loop
orchestrate vs code the key choice
5 major frameworks in active use
Contents
  1. What makes a framework
  2. Landscape & comparison
  3. LangGraph deep dive
  4. CrewAI multi-role
  5. AutoGen conversational
  6. Decision guide
  7. Observability & tracing
01 — Foundation

What Makes a Framework?

An agent framework gives you tool dispatch, memory management, multi-step planning, and observability — instead of writing it yourself. Rather than stringing together 500 lines of glue code with OpenAI SDK calls, a framework provides the scaffolding.

Four Layers Every Agent Needs

(1) LLM interface layer: abstract the specific model (Claude, GPT-4, Llama). Send structured prompts, parse tool calls, handle errors. (2) Tool registry: define what tools the agent has access to, their schemas, and how to execute them. (3) Memory/state store: persist conversation history, internal state, past decisions. (4) Orchestration loop: handle the agentic loop itself — reason, act, observe, update state, repeat.

Build vs buy question: frameworks accelerate prototyping but add abstraction overhead. Many teams start with a framework, then rewrite the hot path in plain Python when they hit production constraints.

⚠️ Every framework has an "easy path" and a "production path". They are not the same. Read production case studies before committing. The tutorial examples often hide the complexity of scaling, latency debugging, and cost optimization.
02 — Landscape

Framework Comparison

Here's a full comparison of the major frameworks used in production:

Framework Paradigm State handling Multi-agent Best for
LangChain DAG / chain Stateless by default Via agents Quick prototypes, chains
LangGraph Stateful graph First-class state Yes (native) Complex workflows, human-in-loop
CrewAI Role-based agents Per-agent memory Yes (crews) Collaborative agent teams
AutoGen Conversational Conversation history Yes (group chat) Research, multi-agent chat
Pydantic AI Type-safe agents Structured I/O Limited Production, typed pipelines
LlamaIndex Data-centric Index + query Via workflows RAG-heavy, document QA
Plain Python + SDK Manual Whatever you build Whatever you build Full control, high scale
03 — Deep Dive

LangGraph

LangGraph models agents as stateful graphs: nodes are actions or decisions, edges are transitions, state is a typed dict persisted across steps. This makes it powerful for workflows that need to pause, inspect, and resume.

Key Concepts

StateGraph: Define your state schema as a TypedDict. Nodes: Python functions that take state and return modified state. Edges: Deterministic or conditional transitions between nodes. Checkpointers: Persist state to Postgres, Redis, memory, or custom backend. Human-in-the-loop: Use interrupt_before and interrupt_after to pause execution for human review.

Minimal LangGraph Agent

from langgraph.graph import StateGraph, END from typing import TypedDict, Annotated import operator class AgentState(TypedDict): messages: Annotated[list, operator.add] tool_results: list def call_llm(state: AgentState) -> AgentState: # call LLM with state['messages'] response = llm.invoke(state['messages']) return {"messages": [response]} def should_continue(state: AgentState) -> str: if has_tool_calls(state['messages'][-1]): return "tools" return END graph = StateGraph(AgentState) graph.add_node("llm", call_llm) graph.add_node("tools", call_tools) graph.add_conditional_edges("llm", should_continue) graph.set_entry_point("llm") app = graph.compile()
LangGraph's persistence layer is critical for production agents. You can pause mid-execution, route to a human reviewer, and resume from exactly where you left off — essential for systems that touch real data or high-stakes decisions.
04 — Deep Dive

CrewAI and Multi-Role Agents

CrewAI models each agent as a role with a goal, backstory, tools, and memory. A Crew orchestrates agents sequentially or hierarchically; a manager agent can delegate to specialized agents. Best for content pipelines, research tasks, and multi-perspective synthesis.

Role-Based Mental Model

Each agent has a clear persona and responsibility. This works exceptionally well when different "experts" are conceptually clean — a researcher agent, an analyst agent, a writer agent. The abstraction maps cleanly to product workflows and makes prompting more intuitive.

3-Agent Research Crew Example

from crewai import Agent, Task, Crew researcher = Agent(role='Senior Researcher', goal='Find latest AI papers', backstory='Expert at finding academic sources', tools=[search_tool]) analyst = Agent(role='Data Analyst', goal='Synthesize findings', backstory='Turns raw research into actionable insights') writer = Agent(role='Technical Writer', goal='Write clear summary', backstory='Makes complex topics accessible') tasks = [Task(description='Research LLM benchmarks', agent=researcher), Task(description='Analyze trends', agent=analyst), Task(description='Write 500-word brief', agent=writer)] crew = Crew(agents=[researcher, analyst, writer], tasks=tasks) result = crew.kickoff()

CrewAI is particularly strong when you need sequential task dependencies and role-based decomposition. The trade-off is less fine-grained control over state compared to LangGraph.

05 — Deep Dive

AutoGen Conversational Agents

AutoGen treats agents as conversational participants: each agent has a system prompt, can send and receive messages, and can be a human-proxy or an LLM-backed assistant. GroupChat enables round-robin or managed multi-agent conversations with code execution loops.

Code Execution Loop

AssistantAgent can generate code; UserProxyAgent executes it locally and returns feedback. This creates a tight loop: LLM generates, execute, report results, LLM refines. Exceptionally effective for code-generation and problem-solving tasks where you can test ideas immediately.

AutoGen Code-Writing Loop

import autogen assistant = autogen.AssistantAgent("assistant", llm_config={"model": "gpt-4o", "api_key": "..."}) user_proxy = autogen.UserProxyAgent("user_proxy", code_execution_config={"work_dir": "coding", "use_docker": False}) user_proxy.initiate_chat(assistant, message="Write and test a Python function to compute Fibonacci numbers")

Comparison: LangGraph vs CrewAI vs AutoGen

Aspect LangGraph CrewAI AutoGen
Mental model State machine Role-based crew Chat participants
State Typed dict, persistent Per-agent memory Conversation history
Control flow Graph edges Sequential/hierarchical Conversation-driven
Human in loop Yes (interrupt) Limited Yes (UserProxyAgent)
Best fit Production workflows Content/research tasks Code generation, research
06 — Decision Guide

Which Framework?

The right choice depends on your task shape, team maturity, and scale constraints:

1

Starting a prototype — fastest path

Use LangChain for simple chains; LangGraph the moment you need state or branching. LangChain is easiest to onboard with; LangGraph becomes necessary as soon as your workflow requires conditional logic or pause/resume.

2

Collaboration between specialized agents — role-based

CrewAI. The role-based mental model maps cleanly to product workflows. Researcher, analyst, writer — each with their own backstory and tools — is a natural decomposition for many tasks.

3

Code-writing or research agents — execution feedback

AutoGen. Conversational loop + code execution is uniquely good here. The agent proposes, you test, it refines — tight feedback loop beats iterative planning for these domains.

4

High-scale production — full control

Plain Python + async + OpenAI SDK. Frameworks add latency and debugging friction. Write the abstraction you need, not a general one. Most successful production agents eventually do this.

⚠️ Framework churn is real. AutoGen 0.4 broke AutoGen 0.2 API completely. LangChain rewrote its expression language twice. Pin versions, test upgrades in staging, and maintain a fallback plan to rewrite in plain Python if the framework becomes a constraint.
07 — Observability

Tracing and Debugging Agents

Multi-step agents are hard to debug. Which step failed? Which LLM call was slow? What context caused the wrong decision? Tracing becomes critical in production.

What to Trace

Every LLM call: input tokens, output tokens, latency, model used. Every tool call: tool name, arguments, result, execution duration. Every state transition: node entry, node exit, state delta. Errors: exception type, stack, recovery action.

Recommended Tools

⚠️ Always log the full state at each node, not just the final output. You cannot debug an agent from its last message alone. State transitions and intermediate results are where bugs live.

Observability Tools Grid

Framework
LangGraph
Native checkpointing and state persistence for debugging
Framework
LangChain
Chain-level tracing and run management
Framework
CrewAI
Task-level logging and agent output tracking
Framework
AutoGen
Conversation history and code execution logs
Tracing
LangSmith
Native LangGraph integration, evaluation, datasets
Tracing
Langfuse
Open-source, provider-agnostic, SDKs for all frameworks
Tracing
Arize Phoenix
Open-source, optimized for large trace volumes
Tracing
W&B Traces
Lightweight integration with Weights & Biases
References
07 — Further Reading

References

Documentation