Incidents¶

An incident is an operational failure that halts one branch of a running process and flags it for a human to resolve, rather than tearing the whole process down. It is Orchestra's answer to the "poison token" problem: a task that throws every time it runs.

How a token fails¶

Tokens advance asynchronously through the advance queue. If a token's task throws, the engine rolls the step back and retries it on the next queue run. A transient failure (a downstream service briefly down) clears on a later try. A deterministic failure (a bug, bad data, a permanently missing dependency) would otherwise retry forever and block the process from ever completing.

To stop that, the engine counts the failed attempts. Once a token has failed to advance MAX_ADVANCE_ATTEMPTS times it is dead-lettered: moved to the terminal error status so it is no longer retried.

What dead-lettering does¶

What happens next is set by the on_unrecoverable_failure setting:

incident (the default): an incident is raised for the failed branch and the instance keeps running. Its other branches advance normally; only the failed branch is parked, in error, waiting for an operator. The instance cannot complete while an incident is open.
fail: the whole instance is failed and its remaining live tokens are cancelled, the older all-or-nothing behavior.

Failing the whole process for one recoverable branch is usually the wrong response, so incident is the default.

Resolving an incident¶

A user with the Resolve Orchestra incidents permission sees an incident's recovery actions on the process instance page. Each acts on just that branch:

Retry: clears the failure count and re-queues the token, so its node runs again. Use it once the cause has been fixed.
Resume: corrects one or more process variables, then retries (available through the API; see below).
Skip: advances past the failing node without running its task, for when the step is no longer needed.
Cancel branch: abandons just the failed branch, letting the rest of the instance finish.
Fail instance: the explicit "give up on this run", failing the whole instance.

Resolving the last open incident lets the instance complete if nothing else is outstanding.

From code¶

The recovery actions are methods on WorkflowEngineInterface, so anything (a controller, a Views bulk action, a remote integration) can drive them:

$engine->retryIncident($incident, $uid);
$engine->resumeIncident($incident, ['amount' => 42], $uid);
$engine->skipIncident($incident, $uid);
$engine->cancelIncident($incident, $uid);
$engine->failFromIncident($incident, $uid);

Incidents are orchestra_incident content entities, tenant-scoped like the rest of the runtime and exposed to Views, so a multi-tenant dashboard of open incidents is just a view.

Retry policy¶

How many times a failing advance is retried before dead-lettering is configurable. The site default is the max_advance_attempts setting (3). A node can override it, and add a backoff between retries, through its Retry policy in the modeler:

Max attempts: the cutoff for this node; empty uses the site default.
Backoff: a delay between retries, in seconds or as an ISO-8601 duration (e.g. PT5M); empty retries immediately. The delay is honored by cron's queue runner.

Configuration¶

# orchestra.settings
on_unrecoverable_failure: incident   # or: fail
max_advance_attempts: 3

# a workflow node, under its retry key
retry:
  max_attempts: 5
  backoff: PT2M