Durable execution
Durable execution is the property that makes Catalyst workflows and agents resilient to crashes, restarts, deploys, and infrastructure failures. Every step a workflow takes is persisted, so when a process dies mid-flight, the execution resumes from the last successful step rather than starting from scratch. This page explains the mechanics: what gets persisted, how state is reconstructed, and what rules your orchestrator code has to follow.
This model is inherited from Dapr Workflows, the open-source runtime Catalyst is built on. The semantics described here apply equally to workflows you author directly (see Develop workflows) and to AI agents running on Catalyst (see AI agents). Agents are workflows underneath.
What "durable" actually means
A durable workflow can crash at any point (process kill, pod eviction, region failover, deploy rollout) and on restart it resumes execution from exactly where it stopped. The work executed so far is not repeated, no execution state is lost, and no external side effects of completed work are duplicated. From the orchestrator's perspective the crash is invisible: the code sees the same return value from activities after the restart and continues to the next line.
This isn't checkpointing in the snapshot sense. Catalyst does not pause your code to serialize the call stack. It's a much more powerful property, achieved through event sourcing and deterministic replay.
Event sourcing: the history log
Every workflow instance has a history: an append-only log of every event that has happened to it. Events look like:
WorkflowStarted(input=...)ActivityScheduled(name="charge_card", input=...)ActivityCompleted(name="charge_card", output=...)TimerCreated(fires_at=...)TimerFired(...)ExternalEventReceived(name="human_approval", payload=...)WorkflowCompleted(output=...)
Those are the friendly names. Under the hood, Dapr's workflow engine records these as ExecutionStarted, TaskScheduled, TaskCompleted / TaskFailed, TimerCreated, TimerFired, EventRaised, and ExecutionCompleted, plus SubOrchestrationInstanceCreated / Completed / Failed for child workflows. You'll see those names in traces, engine logs, and the Catalyst console when inspecting an instance.
The history is written to a durable store (managed by Catalyst) every time the workflow code reaches an await point: every time it pauses to wait for an activity, a timer, or an external event. The history is the workflow's only source of truth. There is no snapshot, no serialized stack, no pickled state, just the log of what has already happened.
Replay: reconstructing state without snapshots
When a workflow needs to resume (whether after a normal await, a process crash, or a region failover), the runtime does not restore a snapshot of variables. Instead, it re-runs the orchestrator function from the top, replaying the history event by event:
- The orchestrator starts executing from line 1.
- When it reaches a call to the
charge_cardactivity the runtime checks the history. - If the history shows
ActivityCompleted(name="charge_card", output={...}), the runtime returns that recorded output immediately. The activity is not re-executed. - The orchestrator continues to the next
await(for ab activity, a timer, an event) , and the same replay-check happens. - Eventually the orchestrator catches up to the end of history and either pauses again (to await the next event) or returns (completing the workflow).
The implication is profound: a workflow that has been running for 30 days, has called 200 activities, and has waited on 50 timers can be resumed on a fresh process in milliseconds, by replaying the history through your orchestrator code until execution catches up to the present.
The local variables in your orchestrator function (order id, customer, accumulated results, etc.) are reconstructed naturally as a side effect of replay. You did not have to ask Catalyst to persist them. They're just whatever your code computes from the inputs and the activity results, both of which are recorded.
The determinism requirement
For replay to produce the same state as the original execution, the orchestrator function must be deterministic: given the same history, it must always take the same path. This is the single most important rule of durable execution, and it has direct consequences for the code you write inside an orchestrator.
What you CANNOT do directly in orchestrator code
- Random numbers. Produce a different value every call.
- UUIDs. May be non-deterministic (e.g. depend on current timestamp). Generate IDs inside an activity, or derive them from the workflow instance id using a deterministic function.
- Current time. Return the real time when the workflow is re-executed, not the time when it was originally executed.
- I/O. Network calls, file reads, environment variables, HTTP clients, anything that depends on the outside world.
- Blocking primitives. Workflow should only wait for events in its own execution (activities, external events and timers) and do not wait for wall clock timers, or other external events.
Each of these would produce a different result on replay than on first execution, which would silently corrupt the workflow's state.
What you CAN do
- Plain in-memory computation. Arithmetic, string manipulation, looping over recorded activity results, building structured outputs.
- Activities. Anything non-deterministic (every line in the "cannot" list above) goes inside an activity. Activities run as regular function calls in your worker process; their result is recorded in history before the orchestrator sees it, so on replay the recorded value is returned without re-invoking the function.
- Durable timers. Wait for durable timers instead of waiting for wall clock timers. The durable timer is persisted; it survives crashes and replays as a recorded event.
- External events. Use external events to pause the workflow until something outside signals it (a webhook, a CLI command, a UI action). The optional
timeoutmakes the wait give up after a bounded duration without losing replay-safety. - The workflow's recorded clock. Returns the time the orchestrator was first executed: the same value on replay.
- The replay flag. Is
Truewhile the runtime is walking the history andFalseonce execution has caught up to live execution. Gate any unavoidable orchestrator-side side effects (debug prints, log lines) on this flag so they fire once per real step instead of once per replay.
The deterministic-replay rule is what makes long-running, fault-tolerant workflows possible at all. Once you put every non-deterministic operation inside an activity, the orchestrator becomes a pure function from history to outcome, and pure functions can be replayed safely forever.
Activities: the escape hatch for the real world
Activities are where the workflow touches reality. They run in your application process (same code, same runtime as your orchestrator), but each activity invocation is treated as an atomic, recorded unit. The runtime:
- Schedules the activity (records
ActivityScheduledin history). - Invokes your activity function (which can call any API, do any I/O, take any time).
- Records the result in history (
ActivityCompletedorActivityFailed). - Resumes the orchestrator, returning the recorded result.
If the process crashes between steps 2 and 3, or if you've configured a retry policy and the activity raises a transient error, the runtime will retry the activity when the workflow resumes.
The Dapr Workflow engine guarantees that each called activity is executed at least once . This means activities must be idempotent for any operation with external side effects: if an activity sends an email or charges a card, a second invocation must not duplicate the action. Common approaches: idempotency keys derived from the workflow instance ID, conditional writes, or "check-then-act" with an idempotent receiver.
Also, activities should handle external side effects atomically: in case of a failure, the changes of an in-flight activity should be reverted. Dapr workflow runtime does not guarantee this atomicity*. It is responsability of the activity's code to handle this logic.
Child workflows: composition with durability
For a sub-process that is itself long-running, retryable, or needs its own timers and external events, call a child workflow instead of an activity. The child runs as its own durable instance with its own history, and the parent sees the call shape exactly like an activity. Yield once, receive the recorded result when the child completes.
In history, the parent records SubOrchestrationInstanceCreated, then SubOrchestrationInstanceCompleted (or Failed) when the child returns. On replay the parent doesn't re-run the child; it returns the recorded result, the same way it does for activities.
Rule of thumb:
- Activity when the work is a single unit that either succeeds or fails and doesn't need internal checkpoints.
- Child workflow when the sub-process needs its own durability (multiple steps, timers, retries, external events, or its own fan-out) and you want it visible as a separate instance in the console.
Durable timers and external events
When a workflow creates a timer it does not block a thread or hold a connection. It writes a TimerCreated event to history with a wake-up time, then the workflow pauses entirely. The process can scale down, redeploy, or crash. When the wake-up time arrives, the runtime schedules the workflow to resume; replay walks the history (including the TimerFired event) and execution continues. The same model applies to external events: the workflow sleeps until something raises the named event.
This is how a Catalyst workflow can durably wait for a human approval over the course of a week, polling no resources and consuming no compute while waiting, then resume the moment the approval arrives. See the human-in-the-loop example in Workflow patterns.
Continue-as-new: bounding history
For workflows that run forever (monitor loops, long-running agents, scheduled poll-and-act patterns) the history would grow unboundedly.
The fix is continue-as-new: the workflow ends itself and starts a fresh instance with a new (small) input, truncating the history. The instance ID is preserved across the generation boundary, so from the caller's perspective the workflow keeps running; under the hood the runtime is rotating instances to bound storage. Use it for any workflow whose history would otherwise grow without bound.
There's one important footgun: by default, any external events that have already arrived but haven't been consumed yet are dropped when the workflow continues as new, unless you instruct the workflow runtime to pass them to the new instance. Check the documentation of your SKD of choice for the specific options.
What this means in practice
The durable-execution model has three practical implications you should internalise before writing your first workflow:
- Anything that touches the outside world goes in an activity. When in doubt, wrap it.
- The orchestrator function will run many times. Print statements, log lines, and side effects in the orchestrator will fire on every replay — use activities for logging too if you care about not flooding logs.
- Code changes need to be replay-safe. A workflow that's been running for a week against version N of your code might replay against version N+1 after a deploy. See Workflow versioning for the safe-change rules and migration strategies.
See also
- AI agents: LLM-driven agents
- Workflow concept: overview of Catalyst Workflows
- Workflow patterns: chaining, fan-out/fan-in, monitor, async HTTP, external events, compensation
- Workflow versioning: evolving workflow code without breaking in-flight instances
- Develop workflows: SDK guides in .NET, Go, Java, JavaScript, Python
- Operate workflows: inspecting running workflows in the console
- Dapr Workflow documentation: the open-source runtime underneath