Skip to content

The Evaluation Pipeline

Every Fly-by-Wire evaluation follows the same pipeline: domain metrics enter, pass through translation and simulation, and exit as a deterministic safety decision. This page traces that flow from start to finish.

Domain Metrics → Validation → Rosetta Scaling → Engine Tick → Gate Chain → Decision

The pipeline has twelve stages. The first five run on every evaluation. The remaining stages are conditional — some run only in the session path, some only when specific modes or features are enabled.

Before any physics runs, the pipeline validates the incoming evaluation request:

  • The request envelope must be well-formed (correct version, request ID present)
  • The license must be valid, unexpired, and matched to the current machine and domain
  • The metric snapshot must not be stale — its timestamp is checked against the policy’s maximum staleness window

If the policy requires metric signatures, the pipeline verifies an HMAC-SHA256 signature over the snapshot. This ensures metrics have not been tampered with between the source system and the evaluator.

Any validation failure short-circuits the pipeline immediately. The system does not proceed to scaling or simulation with invalid inputs. This is the first expression of the fail-closed principle: when in doubt, reject.

Valid metrics pass through the Rosetta translation layer. Each domain metric is mapped to Λ\Lambda or Γ\Gamma using the scaling functions defined in the calibration artifact. When secondary proxies and composite aggregation are configured, multiple metrics are combined into the final parameter values.

The output of this stage is a pair of simulation parameters — a single Λ\Lambda and a single Γ\Gamma — with all domain semantics stripped away.

In the session path, the engine advances one simulation step with the computed Λ\Lambda and Γ\Gamma values. This updates the agent’s position on the simulation lattice, recalculates warning signals (severity, imminence, risk, criticality), and refreshes reachability maps and projection previews.

The stateless runtime path skips this stage entirely — it proceeds directly to the state gate with the scaled values.

The gate chain is the core enforcement mechanism. Three gates run in strict precedence order. A higher-precedence gate overrides any lower gate’s decision.

The session path checks for structural hazards — catastrophic failures in the simulation topology itself:

Basin collapse occurs when the engine preview predicts a total future collapse. Every forward branch leads to a loss event. The agent has no viable path. Decision: REJECT_BASIN_COLLAPSE.

Paradox occurs in multi-agent sessions when two actors make mutually exclusive moves that create an irreconcilable state. Decision: REJECT_PARADOX.

Both hazard decisions are non-overrideable. No operator token, no policy exception, no override of any kind can convert these to a pass. They represent structural invariant violations where the simulation topology is broken. The system enforces an absolute boundary here.

In observe mode, hazards are still detected and reported in the response telemetry, but the decision remains PASS.

The state gate compares the current Γ\Gamma against the configured gamma floor — a threshold set by the deployment policy (and potentially tightened by the operator).

If Γ\Gamma is at or above the floor, the state gate passes. If Γ\Gamma has fallen below the floor, the decision is REJECT_STATE with reason code GAMMA_BELOW_FLOOR.

This gate runs in both state_gate and state_plus_action_gate modes. In observe mode, the comparison still runs but the decision is always PASS.

In state_plus_action_gate mode, when the state gate passes and a proposed action is present, the engine previews the action. Rosetta’s ActionPhysicsMapper translates the domain action into a simulation move direction, and the engine runs a preview tick — a simulation step that does not advance real time.

If the preview shows adverse warning signals or loss events, the decision is REJECT_ACTION with reason code ACTION_PREVIEW_UNSAFE. If the preview is clean, the action passes.

The preview is non-destructive: the engine forks its state, runs the tick, examines the result, and discards the fork. The real engine state is unchanged.

If no ActionPhysicsMapper is registered for the action type, the gate falls back to the state-only result. The action is not blocked for lack of a mapper — it proceeds with whatever the state gate decided.

When the session path produces a rejection, the system generates an escalation directive based on how close Γ\Gamma is to the floor — a value called gamma headroom.

Gamma headroomEscalation type
Comfortable (0.5\geq 0.5) or no active warningNone
Moderate (0.1 to 0.5) with active warningREFORMULATE
Critical (below 0.1) with active warningHUMAN_ESCALATION

REFORMULATE tells the integrating system that the agent should try a different approach. The metrics are close enough to the floor that the current path is dangerous, but there is room for the agent to adjust.

HUMAN_ESCALATION tells the integrating system that the situation requires human review. The system is too close to the floor for autonomous recovery. The Sentinel provides a narrative assessment, and the HITL system manages the operator workflow.

The escalation directive includes the current gamma headroom and the projected steps to breach, giving the integrating system quantitative context for routing the escalation.

When adaptive escalation is enabled in the deployment policy, the session tracks retry attempts per actor and per intent. This prevents an agent from submitting the same failing request repeatedly and overwhelming operators with duplicate escalations.

Adaptive tracking adds three capabilities:

Retry budgets. Each intent gets a configurable number of reformulation attempts for state rejections and action rejections independently. The budget decrements with each attempt. When the budget is exhausted, the escalation upgrades to HUMAN_ESCALATION.

Novelty scoring. Each retry attempt is compared against recent attempts using a weighted similarity model (40% strategy, 30% action, 20% mapped effect, 10% target). Low-novelty retries — attempts that barely differ from previous ones — consume more budget. Near-duplicate attempts consume the budget fastest. This discourages repetitive submissions.

Stall detection. If the agent’s retries are not improving gamma headroom, or if the intent has been open too long, the system escalates to human review. Specifically: consecutive attempts with insufficient headroom improvement, or an intent exceeding the maximum age, both trigger escalation.

The adaptive system also provides a suggested adjustment direction when the escalation is REFORMULATE: change target, reduce magnitude, try a different action type, try a different strategy, or wait and retry. This heuristic helps integrating systems guide agent behavior.

Adaptive tracking uses a monotonic merge rule: once an intent reaches HUMAN_ESCALATION, it never downgrades. This prevents oscillation between reformulation and escalation.

The evaluation response contains:

FieldContent
DecisionPASS, REJECT_STATE, REJECT_ACTION, REJECT_BASIN_COLLAPSE, REJECT_PARADOX, or a validation error
Reason codeThe specific reason for the decision
Evaluation detailCurrent Γ\Gamma, Λ\Lambda, stability, gamma floor, engine tick, warning signals, action gate outcome, hazard gate outcome, adaptive tracking data
EscalationType (REFORMULATE or HUMAN_ESCALATION), gamma headroom, steps to breach
Override outcomeIf an override token was submitted: whether it was applied or rejected, and why
Policy and adapter versionsFor audit trail correlation

For the full request and response schemas, see the Evaluation Lifecycle technical reference.