Plan-and-Execute

Why split planning from execution
Architecture overview
Planner implementation
Executor loop
Replanning on failure
LangChain Plan-and-Execute
Gotchas

SECTION 01

Why split planning from execution

In ReAct, the same model does everything: it decides what to do, does it, evaluates the result, and decides the next step — all interleaved. For short tasks (2–3 steps) this works well. For longer tasks (10+ steps), it breaks down: the model gets distracted by intermediate results and loses sight of the original goal, and the growing context window makes each step more expensive and slower.

Plan-and-Execute separates concerns: a planner model sees the task and produces a complete step-by-step plan upfront (high-level, goal-oriented thinking). An executor model then carries out each step independently (local, focused thinking). This mirrors how humans work: a manager defines the project plan; individual contributors execute tasks without needing the full strategic context.

SECTION 02

Architecture overview

User query
    │
    ▼
┌─────────────┐
│   Planner   │  → Creates ordered list of steps
│   (LLM)    │    e.g.: ["Search for X", "Summarise results",
└─────────────┘           "Format as report", "Email to team"]
    │
    ▼ step list
┌─────────────┐
│  Executor   │  → Executes each step with tools
│   (LLM)    │  → Accumulates results in shared state
└─────────────┘
    │
    ├── step 1 result → [optional: Replanner checks if plan still valid]
    ├── step 2 result
    ├── ...
    └── final result → User

The executor uses a compact context: just the current step + accumulated results, not the full history. This keeps each execution cheap and focused.

SECTION 03

Planner implementation

import anthropic
import json

client = anthropic.Anthropic()

def create_plan(task: str) -> list[str]:
    '''Ask the planner LLM to decompose the task into concrete steps.'''
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": f'''You are a task planner. Break down the following task
into a numbered list of concrete, executable steps. Each step should be a single,
independent action. Be specific about what information is needed at each step.

Task: {task}

Return ONLY a JSON array of step strings, e.g.:
["Search for recent sales data", "Calculate total revenue", "Create a summary"]'''}]
    )
    text = response.content[0].text.strip()
    # Extract JSON from response
    start = text.find('[')
    end = text.rfind(']') + 1
    return json.loads(text[start:end])

# Test
task = "Research the top 3 Python web frameworks, compare their GitHub stars, and write a recommendation."
plan = create_plan(task)
for i, step in enumerate(plan, 1):
    print(f"Step {i}: {step}")
# Step 1: Search for Flask GitHub repository and get star count
# Step 2: Search for Django GitHub repository and get star count
# Step 3: Search for FastAPI GitHub repository and get star count
# Step 4: Compare the three frameworks based on stars, use cases, and community
# Step 5: Write a concise recommendation paragraph

SECTION 04

Executor loop

import anthropic

client = anthropic.Anthropic()

def execute_step(step: str, context: str, tools: list) -> str:
    '''Execute a single plan step, given accumulated context.'''
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",   # cheaper model for execution
        max_tokens=1024,
        tools=tools,
        messages=[{"role": "user", "content": f'''Previous results:
{context if context else "None yet."}

Current step to execute: {step}

Use tools as needed to complete this step. Return a concise result.'''}]
    )
    # Handle tool calls recursively (simplified)
    if response.stop_reason == "tool_use":
        # ... handle tool calls as shown in Tool Use page ...
        pass
    return response.content[0].text if response.content else ""

def run_plan_execute(task: str, tools: list) -> str:
    # Phase 1: Plan
    plan = create_plan(task)
    print(f"Plan created: {len(plan)} steps")

    # Phase 2: Execute
    context = ""
    results = []
    for i, step in enumerate(plan, 1):
        print(f"\nExecuting step {i}/{len(plan)}: {step}")
        result = execute_step(step, context, tools)
        results.append(f"Step {i} ({step}): {result}")
        context = "\n".join(results)   # accumulate results as context
        print(f"Result: {result[:100]}...")

    return context   # final accumulated result

final = run_plan_execute(
    "Compare Flask and FastAPI GitHub stars and write a recommendation.",
    tools=[]   # add your real tools here
)
print("\nFinal result:")
print(final)

SECTION 05

Replanning on failure

def replan_if_needed(original_plan: list[str], completed_steps: list[str],
                     last_result: str, remaining_steps: list[str]) -> list[str]:
    '''Check if the plan needs updating based on new information.'''
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        messages=[{"role": "user", "content": f'''
Original plan: {original_plan}
Completed steps: {completed_steps}
Last result: {last_result}
Remaining steps: {remaining_steps}

Does the last result reveal that the remaining steps need to be adjusted?
If yes, return a JSON array with the updated remaining steps.
If no, return the remaining steps unchanged as a JSON array.
'''}]
    )
    text = response.content[0].text.strip()
    start, end = text.find('['), text.rfind(']') + 1
    return json.loads(text[start:end])

# Use in the executor loop after each step result:
# remaining = replan_if_needed(plan, completed, result, remaining)

Replanning is optional but valuable for tasks where intermediate results reveal the original plan was wrong (e.g., a search returns no results, requiring a different search strategy).

SECTION 06

LangChain Plan-and-Execute

from langchain_experimental.plan_and_execute import (
    PlanAndExecute, load_agent_executor, load_chat_planner
)
from langchain_anthropic import ChatAnthropic
from langchain.tools import tool

@tool
def web_search(query: str) -> str:
    '''Search the web for current information.'''
    return f"Search results for: {query}"

@tool
def calculator(expression: str) -> str:
    '''Evaluate arithmetic.'''
    return str(eval(expression, {"__builtins__": {}}, {}))

tools = [web_search, calculator]

# Planner uses a more capable model; executor uses a faster one
planner_llm = ChatAnthropic(model="claude-3-5-sonnet-20241022")
executor_llm = ChatAnthropic(model="claude-3-5-haiku-20241022")

planner  = load_chat_planner(planner_llm)
executor = load_agent_executor(executor_llm, tools, verbose=True)

agent = PlanAndExecute(planner=planner, executor=executor, verbose=True)
result = agent.run("What is the square root of the number of days in a non-leap year?")
print(result)

SECTION 07

Gotchas

Plans can be over-optimistic. The planner generates a plan without knowing what the executor will actually find. Step 3 might depend on data that Step 2 discovers doesn't exist. Replanning (or at minimum, having the executor handle "step not achievable" gracefully) is essential for robustness.

Context accumulation blows up. If you concatenate all step results into the executor's context, by step 10 you're sending thousands of tokens for background that may be irrelevant. Use a summariser after every 3–4 steps to compress history.

Planner and executor need aligned vocabularies. If the planner says "Search GitHub API" but the executor only has a generic web_search tool, the step may succeed but produce the wrong format. Either make the executor tools match the planner's vocabulary or give the planner a tool manifest to plan against.

Use a cheap model for execution. Each step is a small, focused task. Using Claude Sonnet for execution (when Haiku would suffice) wastes money. Reserve the expensive model for planning, where quality matters most.

SECTION 08

Plan-and-Execute vs Alternative Architectures

Architecture	Planning	Execution	Best For	Weakness
Plan-and-Execute	Upfront full plan	Sequential step execution	Well-defined multi-step tasks	Poor plan propagates; no mid-plan adaptation
ReAct	One step at a time	Interleaved with reasoning	Exploratory, unknown state	Reasoning drift over many steps
Hierarchical (planner + executor)	High-level subgoals	Executor sub-plans each subgoal	Complex long-horizon tasks	Coordination overhead
Reflection loop	Plan, reflect, replan	Execute revised plan	Quality-critical tasks	High token cost

The key engineering choice in plan-and-execute is how to represent the plan. A flat ordered list works for simple sequential tasks. For tasks with dependencies and parallelism opportunities, a directed acyclic graph (DAG) representation is superior -- it allows the executor to identify independent steps and run them concurrently. Represent the plan as a JSON object with steps, dependencies, and status fields so it can be serialised for checkpointing and resumed after failures without re-running completed steps.

Invest in plan quality evaluation as a separate metric from task completion quality. Have an LLM judge rate generated plans on three dimensions: completeness (does the plan cover all necessary steps?), feasibility (can each step realistically be executed with available tools?), and efficiency (does the plan avoid redundant steps?). Plans that score below threshold on any dimension should trigger a replan before execution begins, not after a costly failed execution run.

The plan-and-execute pattern separates reasoning from action, which provides a key debugging advantage: you can inspect the plan before execution begins and catch logical errors early. In multi-step workflows where errors compound — an incorrect intermediate result corrupts all downstream steps — front-loading the planning phase and validating the plan structure before any tools are called dramatically reduces wasted API calls and time.

Dynamic re-planning is the critical enhancement that makes plan-and-execute practical for real-world tasks. A static plan generated upfront cannot anticipate every possible outcome of intermediate steps. When an execution step returns an unexpected result — a tool call fails, an API returns empty data, or a sub-task turns out to be more complex than anticipated — the agent must update the remaining plan to account for the new information. This feedback loop between execution results and plan revision is what distinguishes robust agents from brittle ones.