Skip to main content

Workflow versioning

Because Catalyst Workflows reconstruct state by replaying history through your orchestrator code (see Durable execution), changes to that code must preserve determinism for in-flight executions. A change that's "safe" against fresh starts can still corrupt a workflow that's halfway through its history.

This page covers the rules for what is and isn't replay-safe, and the three migration strategies teams use to evolve workflows without breaking running instances.

What replay actually checks

When a workflow resumes, the runtime walks your orchestrator code top to bottom, comparing each call (call_activity, create_timer, wait_for_external_event) to the next event in history:

  • Same call as recorded → return the recorded result, advance to the next call.
  • Different call than recorded → the runtime cannot reconcile the code with the history, and the workflow fails (typically with a non-determinism error).

So the rule of thumb is: changes that don't alter the calls already made by in-flight instances are safe; changes that do are not.

Safe changes

You can deploy these against running workflows with no migration:

  • Add new code paths that in-flight instances will not enter. A new branch in a match / if statement, gated on a new input value or new state, is fine — existing instances will never see the gate flip.
  • Add new activities or sub-workflows after the current checkpoint. Anything that runs after what in-flight instances have already executed is replay-safe; the new calls become new history events when the workflow advances past its old checkpoint.
  • Add new external-event handlers. A new wait_for_external_event reachable only by future code paths is fine.
  • Tighten activity input or output schemas in a backwards-compatible way. Adding optional fields, never removing required ones.

Unsafe changes

These will break in-flight executions and require a migration strategy:

  • Reordering activity calls. The order in history is the order replay expects.
  • Adding, removing, or renaming activities that in-flight instances have already scheduled. Anything that changes a call_activity already in history.
  • Changing the parameters passed to an existing activity call. The runtime keys replay on the call shape.
  • Changing the duration of a timer that an in-flight instance has already set. The recorded TimerCreated.fires_at is the source of truth; new code reading a different value desynchronises replay.
  • Reordering parallel calls in a fan-out. Even though they run in parallel, the orchestrator records them in a specific order.

When in doubt, ask: "Does this change alter any call that an existing instance has already recorded?" If yes, use one of the strategies below.

Strategy 1 — Version gates

Branch the workflow code on a version flag passed as workflow input. Workflow input is recorded in history on the first run and replayed back on every subsequent run, so reading it inside the orchestrator is replay-safe. What is NOT safe is reading the version from an external source inside the orchestrator (env var, config file, database) — those would produce different values on replay. Old instances follow the old path; new instances follow the new path.

@wfr.workflow(name="order_processing")
def order_processing(ctx: DaprWorkflowContext, input: dict):
version = input.get("version", 1)
order = input["order"]

if version >= 2:
# New path — refactored validation + new compliance check.
valid = yield ctx.call_activity(validate_v2, input=order)
yield ctx.call_activity(compliance_check, input=order)
else:
# Old path — must remain byte-identical for in-flight v1 instances.
valid = yield ctx.call_activity(validate, input=order)

if not valid:
return {"status": "rejected"}

yield ctx.call_activity(charge_card, input=order)
return {"status": "completed"}

When starting a new workflow, callers pass version=2. Existing v1 instances continue running through the v1 branch. Eventually you remove the v1 branch — but only once you're sure no v1 instances are still running (see Strategy 2).

Strategy 2 — Drain and redeploy

Stop starting new instances of the old version, wait for existing instances to complete (or terminate them explicitly), then deploy the new version.

  1. Deploy a build that rejects new starts of the old workflow (or routes them to the new name — see Strategy 3).
  2. Monitor in-flight instances in the Workflows console until the count drops to zero, or use diagrid workflow list --status running from the workflow CLI.
  3. For long-running workflows that would never drain, use diagrid workflow terminate on stragglers.
  4. Deploy the new version. There are no in-flight instances against the old code, so the new code can be a clean rewrite.

This is the simplest strategy when the workflow's typical duration is short (minutes to hours).

Strategy 3 — New workflow name

Deploy the updated logic as a new workflow name (order_processing_v2) and route new starts to it. Old instances keep running on the old name. Once the old workflow drains, retire the name.

@wfr.workflow(name="order_processing")
def order_processing_v1(ctx: DaprWorkflowContext, order: dict):
# Preserved unchanged so in-flight instances continue to replay correctly.
...

@wfr.workflow(name="order_processing_v2")
def order_processing_v2(ctx: DaprWorkflowContext, order: dict):
# New logic, free to differ arbitrarily from v1.
...

The deploy is risk-free because the v1 code is untouched. The downside is operational: two workflow names coexist until v1 drains, and any external system that starts workflows must know which name to call.

This is the safest strategy when the change is large (architectural, not just additive) or when the workflow's typical duration is long (days to weeks).

Choosing a strategy

StrategyWhen to useEffortRisk
Version gateAdditive change, both versions can coexist in the same functionLowLow — if you preserve the v1 branch byte-for-byte
Drain and redeployShort-duration workflows, willing to pause new starts brieflyLowLow — but only viable if drain is feasible
New workflow nameLarge refactor, long-duration workflows, can't risk replay corruptionMediumLowest — old code is fully isolated

For most teams, start with the version-gate strategy for small additive changes, and reach for new workflow name when the refactor is significant.

See also