Error Handling
How Triggo handles step failures — retries, continue-on-failure, circuit breakers, and auto-pause.
Error Handling
Real integrations fail. APIs go down, tokens expire, rate limits bite, networks hiccup. Triggo's executor has four layers of error handling that kick in when a step fails — each with a different purpose and a different default. Understanding them is the difference between a workflow that recovers quietly and one that wakes you up at 3am.
This page walks through what happens when a step fails, what you can configure, and how the system protects your workflows (and your wallet) from a broken integration.
When a step fails (default)
By default, a failed step halts the run. The executor records a step_failed event in the journal, marks the pipeline run as failed, and stops. Anything downstream of that step never executes.
Two important properties of this halt:
- No rollback. Completed steps keep their outputs. If step A wrote a row to Google Sheets and step B failed, the row stays written. Triggo is event-sourced, not transactional — there is no "undo" for side effects a connector already performed. Downstream steps simply don't run.
- The journal keeps everything. All prior
step_completedevents, their inputs, their outputs — all still there, visible in the run detail view. That's what lets you debug the failure and replay from the point of failure once you fix it.
You can change this default on a per-step basis with two knobs: retries and continue-on-failure. Both are opt-in.
Retries
Retries are off by default. The executor's retry config sets maxRetries: 0. A step that throws runs exactly once and fails. This surprises people who come from platforms where every step retries three times automatically — Triggo does not do that.
To turn retries on, set maxRetries on the step. When retries are enabled, the executor uses exponential backoff:
delay(attempt) = baseIntervalMs * 2^(attempt - 1)The delay is capped at 30 seconds (DEFAULT_MAX_DELAY_MS). An optional symmetric jitter (jitterMs) adds ±jitterMs of random noise to spread out retries from many parallel runs.
Example: maxRetries=3, baseIntervalMs=1000
| Attempt | Raw delay (ms) | After 30 s cap |
|---|---|---|
| 1st retry (attempt 1 failed) | 1 × 2⁰ = 1 000 | 1 000 ms (1 s) |
| 2nd retry (attempt 2 failed) | 1 × 2¹ = 2 000 | 2 000 ms (2 s) |
| 3rd retry (attempt 3 failed) | 1 × 2² = 4 000 | 4 000 ms (4 s) |
After the 4th attempt (original + 3 retries), the error is re-thrown and the step is marked failed. Total wall-clock cost: about 7 seconds of sleeping, plus the connector calls themselves.
The cap matters more when baseIntervalMs is large or maxRetries is high. With baseIntervalMs: 5000, maxRetries: 5 the sequence would be 5 s, 10 s, 20 s, 30 s (capped from 40 s), 30 s (capped from 80 s) — the last two attempts sit at the ceiling.
Per-step overrides
Retry config is per-step. There is no workspace-wide default and no way to turn on retries globally — you turn them on for the specific steps where retrying makes sense (transient network errors, third-party flakiness) and leave them off where it doesn't (auth errors, validation failures that won't change on a second try). An optional shouldRetry(error) => boolean predicate can further filter which errors are worth retrying; connector-level retry classification uses RETRYABLE_ERROR_CODES from @triggo/shared.
Continue-on-failure
Sometimes you want a step to fail without halting the run — a non-critical notification, an analytics write, a best-effort enrichment. Set the step's continueOnFailure flag and the executor will:
- Catch the failure (including thrown errors, returned failure results, and timeouts).
- Write a
step_failed_continuedjournal event instead ofstep_failed. - Keep the run going. Downstream nodes still execute.
This behaviour is covered by, which verifies that timeouts, thrown errors, and explicit failure results all get routed through this path when the flag is set. When it's off (the default), the same errors halt the run normally.
What downstream nodes see
A failed-but-continued step has no output. Field mappings that reference its outputs ({{failed_step.foo}}) resolve to undefined — the same behaviour as referencing a node that hasn't run. See Field Mapping for how undefined propagates: inline templates stringify it to an empty string, whole-value templates preserve it as undefined, and required-field validation at the next connector boundary is where it usually surfaces.
If you use continueOnFailure, assume downstream nodes may need to handle missing data. Put a Code node or a condition after the maybe-failing step to branch on whether it produced useful output.
Circuit breaker
The circuit breaker protects integrations — not individual steps. It's per-integration (keyed by integration code, e.g. google-sheets), Redis-backed, and shared across every workflow in the workspace that uses that integration.
Thresholds:
- Failure threshold: 5 failures within the failure window trips the breaker.
- Failure window: 300 seconds (5 minutes). Failures outside this window don't count toward the threshold.
- Cooldown: 60 seconds. While open, every call to that integration short-circuits with a breaker-open error instead of hitting the remote API.
A successful call resets the failure counter. Once the cooldown expires, the breaker closes on its own — no manual intervention needed.
What you see when the breaker is open
Steps targeting the broken integration fail fast with a breaker-open error code, instead of hanging on timeouts. This is the system protecting the remote API from hammering during an outage, and protecting your workflows from long cascading waits. If you see breaker-open errors across several runs, that's the signal — the integration itself is in trouble, not your individual step.
Recovery
Two paths:
- Wait. After 60 seconds with no new calls counted, the breaker allows requests through again. If the upstream API has recovered, the next call succeeds and the counter resets.
- Fix the root cause. If the integration is broken (expired credentials, rotated keys, schema changes), the breaker will just keep tripping when the cooldown expires. Fix the credential or the config on the Connections page, then let the next scheduled run validate the fix.
Auto-pause
The outermost layer. If a pipeline has 3 consecutive failed runs, the executor automatically stops it — status goes to stopped, webhook subscriptions are deactivated, and a system message is posted to the linked chat thread explaining why.
The threshold is CONSECUTIVE_FAILURE_THRESHOLD = 3 in. The counter lives in Redis (autopause:{pipelineId}:failures) and is reset on the next successful run or when the pipeline is paused.
Why it exists
Auto-pause is a blast-radius limiter. A pipeline that's broken — wrong credentials, a connector schema mismatch, a trigger firing on bad data — can burn through your execution quota, flood you with error notifications, and rack up spend on downstream services (OpenAI tokens, webhook deliveries, email sends) for as long as the schedule or webhook keeps firing it. Three strikes and the platform stops the bleeding.
This is a hard stop, not a retry. Once auto-paused, the pipeline stays stopped until you reactivate it.
How to recover
When a pipeline is auto-paused or a run has failed:
- Open the run detail view for the most recent failure. Read the error message on the failed step. See Debugging Runs for how to trace data flow through the run.
- Fix the underlying issue. Rotate the credential, update the field mapping, fix the downstream schema — whatever the error actually is. A retry or reactivation without a fix will just hit the same wall.
- Replay the failed run if you want to pick up where it stopped. Replay reuses the journal from the original run, so steps that already succeeded don't run twice — the executor resumes from the point of failure.
- Reactivate the pipeline from the pipeline detail view once you're confident in the fix. This re-enables webhook subscriptions and schedules.
Related
- Debugging Runs — reading the journal, finding the failed step, tracing data.
- Limits — step timeout (30 s), pipeline timeout (300 s), per-user run rate limit.
- Code Node Reference — how errors thrown from a Code node are reported.
- Field Mapping — how
undefinedoutputs from a failed-but-continued step propagate.