Resiliency policies

A resiliency policy declares how Catalyst should react when an outbound call fails. Instead of scattering retry loops and timeout handling across application code, you describe the desired recovery behaviour once (retries, timeouts, and circuit breakers) and bind it to one or more App IDs. Catalyst applies the policy on every relevant call, policies can be easily tuned per environment without needing application redeploys.

What a resiliency policy targets

A resiliency policy attaches to outbound calls of two kinds:

Outbound application calls: service invocation from one App ID to another.
Components: calls to a state store, pub/sub broker, binding, or secret store.

Targets are matched by name and by kind, which means one policy can cover every call from a workload to a specific database while a separate policy covers calls to every other component. Policies are project-scoped; the App ID binding determines which workloads they apply to.

Retry strategies

When a call fails for a transient reason, such as a momentary network blip or a brief overload on the destination, retrying after a short delay is often enough to succeed. Catalyst supports two retry shapes:

Constant retries wait the same duration between every attempt. Use them when the cause of failure is short and uncorrelated with load, for example a single network hop dropping a packet.
Exponential retries grow the wait between attempts up to a configurable maximum interval. Use them when the destination might itself be overloaded: backing off gives it time to recover instead of piling on more load.

Every retry strategy carries a maximum attempt count (after which the call surfaces the failure to the caller) and an optional matching rule that limits retries to specific HTTP or gRPC status codes. Matching is important: retrying a 400 Bad Request will never succeed, and retrying a 500 Internal Server Error only sometimes will. A good policy retries on transient signals (503, 504, certain 5xx) and lets permanent failures through immediately.

Timeouts

A retry strategy without a timeout will happily wait forever. A timeout policy gives Catalyst the authority to terminate an in-flight call that is taking too long, and either retry it or fail fast depending on the retry policy bound to the same target. Pick a timeout short enough that the caller doesn't block on an unhealthy destination, but long enough that healthy calls complete on a normal day. Match the shape of the timeout to the shape of the failure you expect.

Circuit breakers

Retries and timeouts handle transient failures. They do not, on their own, handle sustained ones, and retrying aggressively against a failing destination can make the problem worse by adding load just as the destination is trying to recover. A circuit breaker solves this by tracking the recent failure rate to a target and tripping when failures cross a threshold (for example, more than five consecutive failures).

A breaker has three states:

Closed: the breaker is healthy and lets every call through. Failures are counted.
Open: the breaker has tripped. Calls fail fast immediately, without dispatching a request. This protects the destination and lets Catalyst surface failure quickly to the caller.
Half open: after a cooldown, the breaker lets a small number of probe requests through. If they succeed, the breaker returns to closed; if they fail, it returns to open.

Circuit breakers compose naturally with retries: while the breaker is open, retries are short-circuited so no attempt reaches the failing destination. The result is a system that recovers quickly when the destination heals, and that absorbs partial outages without retry storms.

Composition with workflow activity retries

Catalyst Workflows layer their own retry on top of resiliency policies. A workflow activity has its own retry policy expressed in the workflow code (number of attempts, backoff, conditions), and that policy governs whether the activity as a whole is rerun by the workflow engine. When the activity body calls out through Catalyst (a service invocation, a state store read, a pub/sub publish), the resiliency policy bound to the App ID applies to that individual call.

Both layers can be in play at once, so it is worth being deliberate about which one owns which concern:

Resiliency policies own infrastructure level transients. Momentary network failures, a state store hiccup, a pub/sub broker reconnecting. These should be invisible to the workflow.
Workflow activity retries own business level outcomes. A downstream API responding with a domain error that might clear if you wait longer, a third-party rate limit that needs a longer backoff window than makes sense for an inline retry.

If both layers retry the same failure, the effective retry budget is the product of the two. That is easy to reason about for small numbers, but a surprise for large ones. A common pattern is a tight, short resiliency policy for transients and a sparser, longer workflow activity retry for business outcomes.

Where policies are defined and bound

Resiliency policies live in the project, alongside components, App IDs, and the other declarative resources. Each policy is named, and bindings are expressed as a list of App ID scopes on the policy itself. One policy can govern many App IDs by listing them in its scope, and one App ID can be governed by multiple named policies (typically a retry policy, a timeout policy, and a circuit breaker policy together).

The diagrid resiliency list and diagrid resiliency get CLI commands surface the policy definition along with its current App ID scopes, which is the canonical view of "who is bound to what". For creating, editing, and scoping policies with the CLI, see Apply resiliency policies to your apps; for binding from the App ID side, see App IDs.

For the declarative YAML for every policy kind, see the Policies reference.

What a resiliency policy targets​

Retry strategies​

Timeouts​

Circuit breakers​

Composition with workflow activity retries​

Where policies are defined and bound​

See also​