Production Planning

A production installation of Catalyst Enterprise Self-Hosted runs on a dedicated Kubernetes cluster with access to external PostgreSQL and, optionally, Kafka. Use the guidance below to size your infrastructure for the expected workload.

Cluster requirements

Kubernetes 1.24 or later, installed in a dedicated cluster.
A CNI that supports NetworkPolicy resources.
At least 3 worker nodes for production, to support High Availability of the system components.
Outbound connectivity to the Diagrid Cloud endpoints listed in Architecture.

Reference sizing profiles

The following profiles are verified minimum configurations for running Catalyst. Pick the profile that best matches your workload; individual components can be scaled independently from there.

Profile	Use case	Kubernetes worker nodes	AWS equivalent	Workflow PostgreSQL	AWS equivalent
Dev / PoC	Evaluation and development. Not suitable for production.	2 × (2 vCPU, 4 GiB)	`c5.large`	2 vCPU, 4 GiB	`db.t3.medium`
Small production	Low-volume workflows.	3 × (8 vCPU, 16 GiB)	`c5.2xlarge`	4 vCPU, 16 GiB	`db.m5.xlarge`
Large production	High-volume workflows with independent scheduler state.	3 or more × (8 vCPU, 16 GiB)	`c5.2xlarge`	16 vCPU, 64 GiB	`db.m5.4xlarge`

For the Large profile, we recommend a separate PostgreSQL instance for the Dapr Scheduler (8 vCPU / 32 GiB, e.g. AWS db.m5.2xlarge) so that scheduler state does not contend with workflow state.

Component resource footprint

The following are the default resource requests and limits shipped with the Catalyst Helm chart. Use them to validate that your node pool has sufficient capacity. All values can be tuned via Helm; refer to the Helm Reference for the full set of options.

Component	Replicas	CPU (request / limit)	Memory (request / limit)
Agent	1	40m / —	500Mi / 1200Mi
Management	2	40m / —	100Mi / 1200Mi
Gateway (Envoy)	1 (2 with HA)	100m / 1000m	512Mi / 2048Mi
Gateway (Control Plane)	1 (2 with HA)	50m / 100m	50Mi / 100Mi
Identity Injector	1	50m / 200m	64Mi / 128Mi
Dapr Server (per App ID)	per app	10m / 300m	25Mi / 256Mi
OpenTelemetry Collector (metrics)	per project	75m / —	500Mi / 1100Mi
OpenTelemetry Collector (logs)	per project	50m / —	500Mi / 750Mi
Dapr Scheduler	1 (3 with HA)	— / —	130Mi / 175Mi
Dapr Sentry	1	— / —	— / 100Mi

Dapr Servers and OpenTelemetry Collectors are provisioned per App ID and Project at runtime, so their aggregate footprint scales with the number of Apps and Projects you deploy.

Scale limits

Each Catalyst installation enforces the following internal limits to prevent resource exhaustion:

Limit	Default	Helm value
Projects per installation	50	`agent.config.placement.max_project_count`
App IDs per installation	300	`agent.config.placement.max_appid_count`

High availability

We recommend enabling High Availability for production installations. Set gateway.ha.enabled: true in your Helm values to run two replicas of the Gateway Envoy and Control Plane. The Management service runs two replicas by default. See the Helm Reference for the full set of HA-related options.

We also recommend running external dependencies in a Multi-AZ configuration:

Workflow PostgreSQL.
Dapr Scheduler PostgreSQL, when using the Large profile.
Kafka, when Managed Pub/Sub is enabled.

Refer to the Helm Reference for configuration of external PostgreSQL and Kafka instances.

Private container images

If you are installing in an environment without access to public container registries or prefer to use your own container registry, you can pull the artifacts from our public registry, re-tag them, and push them to your private registry. Then, you can configure the Helm chart to use your private registry by setting the appropriate values. See the Helm Reference for the chart values.

We have provided a script and documentation on how to achieve this in the Catalyst Enterprise Self-Hosted Helm Chart repository.

Next steps

AWS Deployment — reference architecture on AWS (VPC, EKS, RDS, Bastion host).
Azure Deployment — reference architecture on Azure (VNet, AKS, Azure Firewall, management VM).

Cluster requirements​

Reference sizing profiles​

Component resource footprint​

Scale limits​

High availability​

Private container images​

Next steps​