Skip to main content

Observability

Catalyst emits metrics, distributed traces, and structured logs for every workflow, agent, and Dapr API call running on the platform. Because the Catalyst data plane is the thing handling those calls, the instrumentation lives in the sidecar rather than in your application code. You get request rates, latencies, error counts, trace context that flows across services, and per call payload logs the moment an App ID starts running, with no SDK setup, no exporter wiring, and no manual span creation.

This page explains the conceptual model: what each signal contains, how Catalyst surfaces it, where it can be exported, and how sampling and retention work. For step by step configuration see Observability in Catalyst.

The three signals

Observability on Catalyst rests on the same three signals as any modern system, with each one answering a different question:

  • Metrics. Numeric aggregates over time. Best for "how is the system behaving in aggregate right now?" Examples include request rate, error rate, latency percentiles, and resource pressure.
  • Distributed traces. Causally linked spans across services. Best for "what was the path of this one request, and where did it slow down or fail?"
  • Logs. Discrete, structured records of individual events. Best for "what exactly happened on this call, with the full payload?"

Catalyst captures all three for every App ID by default. The rest of this page describes what each looks like on the platform.

Metrics

Catalyst's metrics describe the behaviour of an App ID and its components in aggregate. The set you get out of the box covers:

  • Request volume per App ID and per Dapr API (service invocation, state, pub/sub, workflows, Conversation, and so on).
  • Error rate broken down by status code and component.
  • Latency percentiles (p50, p95, p99) for each API.
  • Component health. Successes, failures, and timeouts on calls to backing infrastructure like state stores, brokers, and LLM providers.
  • Resource utilisation against any limit set at the project or App ID level.

Metrics are scoped by project and App ID and are visible in the Catalyst console's Metrics page. They are the right signal for dashboards, alerts, and any "is something broken right now" investigation. They are not the right signal for inspecting an individual request; for that, use traces or API logs.

Distributed traces

Every Dapr API call carries W3C TraceContext headers, and the sidecar emits a span for the operation. Because the sidecar is on both sides of every call between services within a project, traces stitch together automatically across App IDs, even when the call goes through pub/sub, a workflow activity, or an MCP server. You see the full causal chain for a single request: which App ID started it, which components it touched, where time was spent, and where errors propagated.

Trace export is the one observability surface that is configured by the user rather than by the platform. You attach a Configuration resource to your App IDs to set the sampling rate and the OTLP endpoint of the backend that should receive the traces:

apiVersion: cra.diagrid.io/v1beta1
kind: Configuration
metadata:
name: tracing
spec:
tracing:
samplingRate: "1"
otel:
endpointAddress: "otlp.example.com:4317"
isSecure: true
protocol: grpc
headers:
- name: "api-key"
secretKeyRef:
name: otlp-credentials
key: api-key

Any backend that speaks OTLP works, including Datadog, New Relic, Honeycomb, Grafana Tempo, Jaeger, or your own OpenTelemetry Collector. See Distributed tracing for the full configuration workflow.

API logs

API logs are specific to Catalyst. They are not application logs (those come from your code) and they are not the same as traces (those carry timing and span relationships, but not full payloads). An API log is a structured record of one Dapr API call as it crossed the data plane, and it captures:

  • Request and response bodies (subject to size limits) so you can replay or diff a call without re running it.
  • Status code, latency, and timestamps for success and failure analysis.
  • Originating App ID and the component invoked for attribution.
  • LLM token counts (token_prompt, token_completion, and token_total) for every call through the Conversation API.

API logs are the fastest path from "an LLM gave me a strange answer" or "a state store write looked off" to a copy of the exact bytes Catalyst saw on the wire. They sit alongside metrics and traces in the console; see API Logs for the inspection UI.

App Graph and workflow visualizer

The console builds two visualisations on top of the signals above. They are not separate signals, but derived views.

The App Graph (Application Graph) shows the live topology of a project: which App IDs invoke which other App IDs, which components each touches, and which pub/sub topics carry traffic between them. Edges are weighted by call volume and decorated with error rates, so unexpected dependencies and silently failing edges show up at a glance. It is the right place to start when you need to understand how a project actually fits together, not just how it was intended to.

The workflow visualizer is the per instance view of a durable workflow. Every workflow execution has a recorded history (activity scheduled, activity completed, timer created, external event received), and the console renders that history as a step by step graph with inputs, outputs, durations, and failure reasons at each step. Because agents on Catalyst are workflows underneath, the same visualiser is the per run view for agents as well.

Agent observability

Agents inherit two layers of observability:

  • As workflows. Every agent run has a workflow history and replay view, with each tool call surfaced as an activity in the step graph.
  • As LLM clients. Every call to the Conversation API appears in API logs with the full prompt, the model's response, and per call token counts.

That combination is the difference between "the agent looped five times" and "the agent looped five times, here is what the model returned each time, here is which tool was selected, and here is how many tokens it cost." See Develop agents and AI agents for how those surfaces map to your code.

Exporting telemetry

The export story differs by signal and by deployment model. On Catalyst Cloud:

  • Traces can be exported to any OTLP backend through the Configuration resource shown above. This is the surface the user configures.
  • Metrics and API logs are queryable from the console and the Catalyst control plane API. Exporting them to a backend you own is not configurable on Catalyst Cloud today.
  • Application logs (Dapr runtime and app stdout/stderr) are accessible through diagrid appid logs and the console.
  • Audit logs are available on Catalyst Enterprise plans; contact Diagrid to enable export.

On Catalyst Enterprise Self-Hosted, the data plane runs in your own cluster, and you control the OpenTelemetry Collector that sits in front of it. You can forward every signal (metrics, traces, and logs) to any destination you choose.

Sampling and retention

Trace sampling is the one sampling knob you control, set as a samplingRate between "0" (no traces) and "1" (every call) on the Configuration resource bound to an App ID. Metrics and API logs are not sampled by the user; they are collected at granularities defined by the platform.

Retention depends on your plan on Catalyst Cloud:

  • Metrics are retained for 7 days on the free tier; longer windows are available on paid plans.
  • Application logs are retained for 3 days on the free tier; longer windows are configurable on paid plans.
  • Traces retention is governed by the backend you export to. Catalyst Cloud does not store traces beyond the export pipeline.

See Plans & support for the retention windows and limits on each plan.

See also