Skip to main content

Troubleshooting and FAQ

Common issues and their fixes. For runtime issues with a running workflow or agent, also check the Workflows and Agents console pages — most surface the failure inline. For the conceptual model behind observability, see Observability.

Install and login

diagrid login doesn't open a browser

Symptom: Login appears to hang, and you see:

WARNING: error opening browser: exec: "xdg-open,x-www-browser,www-browser": executable file not found in $PATH

Cause: No default browser is available to auto-open the device-confirmation page. This is common over SSH, in CI, and inside dev containers.

Fix: The CLI prints a fallback URL and a user code. On any machine with a browser, visit https://login.diagrid.io/activate, enter the code shown in your terminal, and confirm it matches. Authentication completes as soon as the code is confirmed — no browser on the CLI host is required.

See the Diagrid CLI reference for the full command surface.

Connecting an app to Catalyst

Health check fails with an SSL certificate error

Symptom: Your app can't establish connectivity and the Dapr health check times out:

[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate
TimeoutError: Dapr health check timed out, after 60.0.

Cause: Some runtimes — Python in particular — don't read the operating-system certificate store, so they can't verify the TLS chain on your project endpoint. The 60-second health-check timeout is a downstream symptom, not the root cause.

Fix: Point the runtime at a valid CA bundle. For Python, install certifi and set the certificate environment variable to its path — python -m certifi prints the location:

export SSL_CERT_FILE="$(python -m certifi)"
export REQUESTS_CA_BUNDLE="$(python -m certifi)"

dapr-api-token header is missing or invalid

Symptom: Requests from your app are rejected with dapr-api-token header is missing or invalid.

Cause: The app isn't presenting the project API token, so Catalyst refuses the connection.

Fix: Ensure DAPR_API_TOKEN is set in the app's environment. When running locally, this token is supplied through your generated dev configuration — re-run your diagrid dev scaffold so it's injected, or set DAPR_API_TOKEN explicitly before starting the app.

No App ID exists to connect to

Symptom: You're ready to connect your app but there's nothing to point it at — diagrid appid list returns an empty list, and commands or scaffolds that expect an App ID (such as diagrid dev scaffold) have nothing to resolve connection details against.

Cause: An App ID is the identity your app authenticates as and the entry point Catalyst routes traffic through. Until you create one in your project, there's no endpoint or API token to connect with.

Fix: Create an App ID, then generate your dev configuration against it:

diagrid appid create my-app

Confirm it's ready with diagrid appid list, then run your diagrid dev scaffold so the endpoint and DAPR_API_TOKEN are injected into your app's environment. See diagrid appid create and Manage App IDs.

See Local development for the dev-loop tools.

Workflow runtime

A workflow is stuck in Running and won't progress

Symptom: A workflow sits in Running — often with an activity that never advances past its scheduled state — and console or CLI actions against it (terminate, rerun, pause, resume, raise event) seem to do nothing or return an API error.

Cause: The workflow worker application — the App ID that hosts your orchestrator and activity code — isn't running or connected. Catalyst's managed workflow engine records and schedules the work, but activities only execute while your worker is connected to pick them up, and management actions require a running worker to take effect.

Fix: Confirm the worker App ID is up and connected, and redeploy or restart it if not. The Workflows console shows the stalled step and its last recorded event. Once the worker is healthy, in-flight instances resume on their own; if one stays wedged, terminate it and diagrid workflow rerun.

A code change breaks in-flight workflows (non-determinism error)

Symptom: After deploying a new build, existing in-flight workflows fail — typically with a non-determinism error — while brand-new instances run fine.

Cause: Catalyst Workflows rebuild state by replaying history through your orchestrator code. If your change altered a call that in-flight instances have already recorded — reordering activities; adding, removing, or renaming an already-scheduled activity; changing the parameters of an existing activity call; or changing the duration of a timer already set — replay can no longer reconcile the code with the history, and the instance fails.

Fix: Deploy only replay-safe (additive) changes against running instances. For unsafe changes, use a migration strategy — a version gate or a new workflow name — and clear any stragglers with diagrid workflow terminate. See Workflow versioning for the full replay-safety rules and migration strategies.

A long-running activity fails with DEADLINE_EXCEEDED

Symptom: An activity — commonly an LLM call in an agent workflow — fails with StatusCode.DEADLINE_EXCEEDED / "Deadline Exceeded", and the workflow then schedules a retry of that activity.

Cause: The activity didn't return within its execution deadline. The engine retries it per the activity's retry policy — expected durable-execution behavior — but a non-idempotent activity can repeat its side effects on each attempt.

Fix: Keep individual activities within the deadline by splitting long work into smaller activities (for agent calls, trim the prompt or context that's inflating call duration), and make activities idempotent so retries are safe. Set an appropriate retry policy on the activity call in your SDK.

A burst of workflow starts overwhelms the worker

Symptom: Many workflows start at once (for example, one per incoming request); activities queue up and throughput degrades, even though no single instance has failed.

Cause: By default, workflow and activity invocations are unbounded — a spike of starts runs to the full parallelism the worker and backing store can sustain.

Fix: Bound concurrency with a workflow Configuration policy. Set maxConcurrentWorkflowInvocations and maxConcurrentActivityInvocations (enforced per sidecar; default unbounded) in a Configuration manifest and apply it with diagrid configuration create -f <file> --project <project>. See the Policies reference for the manifest shape and a worked example.

Agent runtime

An agent's LLM call times out, errors, or returns a truncated response

Symptom: An agent stalls, fails, or produces an incomplete answer, and it's unclear whether the model call is at fault.

Cause: Agent model calls go through the Catalyst Conversation API. A failed, slow, or truncated call surfaces there — not in your application logs.

Fix: Open API Logs and filter Dapr API = conversation (add Status = failure) for the agent's App ID. The detail panel shows the status, error message, end-to-end Execution time, and token counts. Sort by Execution time to surface the slowest calls; a small response paired with a large Completion tokens count points to truncation or an early stop rather than a hang. From the Agents page, an agent's Model configuration panel jumps straight to that agent's Conversation API calls.

Stub — populate. Remaining symptoms:

  • Agent loops on same tool call → tool returning non-deterministic output, see Agent patterns
  • Memory / session not persisting → durable-agent configuration check

See Operate AI agents for inspecting running agents.

Components and managed services

max number of connections reached

Symptom: Creating or applying a component fails with:

Failed processing component "<name>": max number of connections reached, current 10 max 10

Cause: Component (infrastructure connection) limits are enforced per organization, not per project. A new or empty project still counts against the org-wide total, so you can hit the cap even when the current project has few components.

Fix: Review component usage across all projects in the organization — not just the one you're working in — and remove any that are unused, or upgrade your plan to raise the cap.

See the Components reference.

MCP

Requests to an MCP server are rejected with 401 Unauthorized

Symptom: Calls to an MCP server endpoint return 401 Unauthorized before reaching your server code.

Cause: A Dapr bearer middleware on the MCP server's appHttpPipeline validates the inbound JWT — its signature against the issuer's JWKS, plus the iss (issuer) and aud (audience) claims. A missing, expired, or mismatched token is rejected at the pipeline.

Fix: Ensure the caller presents a valid token, and that the middleware's configured JWKS endpoint, iss, and aud match what your OAuth provider issues — an aud/iss mismatch is the most common cause of a token that "looks valid" but is still rejected. See MCP authentication and Securing MCP with OAuth.

An MCP request fails with 403 / ACCESS_DENIED, or no tools are discovered

Symptom: A client can't reach the MCP server — no tools are discovered and calls fail — with:

{
"detail": {
"error": "ACCESS_DENIED",
"message": "MCP server returned HTTP 403"
}
}

Or the server is reachable but one tool call returns a 403 Forbidden.

Cause: An App ID access policy denied the request. Authorization is enforced in Catalyst Cloud, so a denied call never reaches the server — with diagrid dev run, Catalyst returns the 403 before forwarding through the tunnel. A whole-server 403 means the calling App ID isn't allow-listed to reach the server; a tool-level 403 means the server's App ID isn't allowed to reach the downstream service that tool calls.

Fix: Allow-list the calling App ID — and, for a failing downstream tool, the server's own App ID — in the access policy. See MCP access policies and App ID access control.

Where to get help