Quickstart: CrewAI Durable Workflows
This quickstart is coming soon. Stay tuned!
Run CrewAI agents with durable execution using Dapr Workflows. Each tool call becomes a checkpoint — if your agent crashes, it resumes exactly where it left off without re-executing completed work.
What You'll Build
A CrewAI agent that:
- Survives crashes — Workflow state is checkpointed after each tool execution
- Retries on failure — Failed LLM calls and tools retry with exponential backoff
- Resumes from checkpoint — Restart the app and pick up where you left off
Why Durable Agent Execution?
Long-running agents are fragile. Network errors, API rate limits, or process restarts can lose hours of work. Dapr Workflows make your agents production-ready:
- Automatic checkpointing — Every tool call is persisted before execution
- Crash recovery — On restart, the workflow replays from the last checkpoint
- Built-in retries — Configurable retry policies handle transient failures
- Observability — Track workflow state and debug failures
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Dapr Workflow Runtime │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ CrewAI Agent Workflow │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ LLM Call │────▶│Tool Exec │────▶│ LLM Call │ │ │
│ │ │(Activity)│ │(Activity)│ │(Activity)│ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ [Checkpoint] [Checkpoint] [Checkpoint] │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Each LLM call and tool execution runs as a workflow activity. If the process crashes after any checkpoint, the workflow resumes from that point — previous activities won't re-execute.
Prerequisites
- Dapr CLI installed and initialized (
dapr init) - Python 3.11+
- An LLM API key (OpenAI, Anthropic, or other LiteLLM-supported provider)
Setup
Create Project Structure
Create a new project directory:
mkdir crewai-durable && cd crewai-durable
mkdir -p components
Your project structure:
crewai-durable/
├── components/
│ └── statestore.yaml
├── agent.py
└── requirements.txt
Configure Dapr State Store
Create components/statestore.yaml:
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: statestore
spec:
type: state.redis
version: v1
metadata:
- name: redisHost
value: localhost:6379
- name: redisPassword
value: ""
- name: actorStateStore
value: "true"
This uses Redis for workflow state persistence. Dapr initializes Redis automatically with dapr init.
Install Dependencies
Create requirements.txt:
dapr-ext-crewai>=0.1.0
crewai>=0.28.0
Set up a virtual environment and install:
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
Set Your API Key
export OPENAI_API_KEY="your-api-key-here"
Or for other providers (see LiteLLM docs):
export ANTHROPIC_API_KEY="your-key" # For Claude
export GEMINI_API_KEY="your-key" # For Gemini
How It Works
The DaprWorkflowAgentRunner wraps your CrewAI agent in a Dapr workflow:
- Agent configuration is serialized and stored in workflow state
- Each LLM call runs as a workflow activity with checkpointing
- Each tool execution runs as a separate activity
- On crash, the workflow replays from the last checkpoint — already-completed activities return cached results
This means:
- LLM API calls that succeeded won't be repeated
- Tool executions that completed won't re-run
- You only pay for the work that actually needs to be done
Build the Agent
Create the Durable Agent
Create agent.py:
import asyncio
import os
from crewai import Agent, Task
from crewai.tools import tool
from dapr.ext.crewai import DaprWorkflowAgentRunner
# Define tools using CrewAI's @tool decorator
@tool("Get the current weather for a city")
def get_weather(city: str) -> str:
"""Get the current weather for a specified city."""
# In production, call a real weather API
weather_data = {
"Tokyo": "Sunny, 22°C",
"London": "Cloudy, 15°C",
"New York": "Partly cloudy, 18°C",
"Paris": "Rainy, 12°C",
}
return weather_data.get(city, f"Weather data not available for {city}")
@tool("Search for information on the web")
def search_web(query: str) -> str:
"""Search the web for information."""
# In production, call a real search API
return f"Search results for '{query}': Found relevant information about {query}."
@tool("Get the current date and time")
def get_datetime() -> str:
"""Get the current date and time."""
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
async def main():
# Create a CrewAI agent with tools
agent = Agent(
role="Research Assistant",
goal="Help users find accurate and up-to-date information",
backstory="""You are an expert research assistant with access to
various information sources. You excel at finding and synthesizing
information to provide comprehensive answers.""",
tools=[get_weather, search_web, get_datetime],
llm=os.getenv("CREWAI_LLM", "openai/gpt-4o-mini"),
verbose=True,
)
# Define a task
task = Task(
description="""Find out the current weather in Tokyo and search for
recent news about AI developments. Provide a brief summary.""",
expected_output="""A summary containing:
1. Current weather in Tokyo
2. Key recent AI news highlights""",
agent=agent,
)
# Create the Dapr Workflow runner
runner = DaprWorkflowAgentRunner(
agent=agent,
max_iterations=10,
)
try:
# Start the workflow runtime
print("Starting Dapr Workflow runtime...")
runner.start()
# Run the agent
session_id = "demo-session-001"
print(f"\nExecuting agent task with session: {session_id}")
print("=" * 60)
async for event in runner.run_async(task=task, session_id=session_id):
event_type = event["type"]
if event_type == "workflow_started":
print(f"Workflow started: {event.get('workflow_id')}")
elif event_type == "workflow_status_changed":
print(f"Status: {event.get('status')}")
elif event_type == "workflow_completed":
print("\n" + "=" * 60)
print("AGENT COMPLETED")
print("=" * 60)
print(f"Iterations: {event.get('iterations')}")
print(f"\nFinal Response:\n{event.get('final_response')}")
elif event_type == "workflow_failed":
print(f"Workflow FAILED: {event.get('error')}")
finally:
print("\nShutting down...")
runner.shutdown()
if __name__ == "__main__":
asyncio.run(main())
Run the Agent
Start with Dapr
Run the agent with Dapr:
dapr run --app-id crewai-agent \
--dapr-grpc-port 50001 \
--resources-path ./components \
-- python agent.py
You'll see:
- Dapr initializing the workflow runtime
- The agent executing tool calls (each as a durable activity)
- The final response
Test Crash Recovery
To see durability in action, try these scenarios:
Scenario 1: Normal completion
- Run the agent and let it complete
- Note the workflow ID in the output
Scenario 2: Crash and resume
- Start the agent
- While it's running, press
Ctrl+Cto simulate a crash - Restart the agent with the same command
- The workflow will resume from the last checkpoint
Key Concepts
| Concept | Description |
|---|---|
| Workflow | A durable orchestration that survives restarts |
| Activity | A single unit of work (LLM call or tool execution) that can be retried |
| Checkpoint | Automatic state persistence after each activity completes |
| Retry Policy | Built-in retry with exponential backoff (3 attempts, 1s to 30s) |
Next Steps
- Add session management for persistent agent memory
- Use pub/sub for multi-agent coordination
- Deploy to Kubernetes with Catalyst
Clean Up
Stop the Dapr application:
dapr stop --app-id crewai-agent