Evaluation Lifecycle

Every evaluation — whether through SubstrateRuntime or SubstrateSession — follows the same pipeline. The session path adds engine telemetry enrichment and additional gates.

Pipeline Stages

1. Request Validation

The pipeline validates the EvaluationRequest envelope:

envelopeVersion must be 1
requestId must be present
snapshot.metrics must contain the metrics expected by the calibration artifact

2. License Check

The runtime verifies the license is valid:

Not expired
Machine fingerprint matches
Domain is permitted for the licensed artifact

If any check fails, the pipeline short-circuits with REJECT_LICENSE.

3. Metric Freshness

The snapshot.timestamp is compared against the policy’s metricStalenessMaxMs. If the snapshot is older than the allowed window, the pipeline returns REJECT_STALE_METRICS.

4. Metric Signature (Optional)

If the policy has requireMetricSignature: true, the snapshot.signature is verified against the HMAC shared secret. Failure returns REJECT_INVALID_SIGNATURE.

5. Metric Scaling

Domain metrics are mapped to simulation parameters using the calibration artifact’s scaling functions:

Primary proxies: lambdaScaling maps a metric to $\lambda$ ; gammaScaling maps a metric to $\gamma$
Secondary proxies: Additional metrics contribute to $\lambda$ or $\gamma$ via their own scaling functions
Composite aggregation (optional): Multiple scaled values are combined using weighted mean, weighted max, weighted min, or product aggregation

Each scaling function applies one of five deterministic functions:

Function	Behavior
`linear`	Linear interpolation between input and output ranges
`log`	Logarithmic scaling (compresses high-end inputs)
`sigmoid`	S-curve transition with configurable steepness ( $k$ )
`inverse`	Inverse relationship (high input → low output)
`step`	Binary threshold — below midpoint maps to output min, above maps to output max

When clamp is true, inputs outside inputRange are clamped to the nearest bound before scaling.

6. Engine Tick (Session Only)

In the session path, the engine advances one tick with the computed $\lambda$ and $\gamma$ values. This updates:

Agent positions and reachability maps
Warning signals (severity, imminence, risk inertia, risk optimized, criticality)
Projection previews (drift path, optimal path)
Timeline event markers

7. Escalation Check (Session Only)

The session examines gamma headroom — the distance between current $\gamma$ and the floor — to generate escalation directives:

Condition	Escalation
Warning inactive or headroom $\geq 0.5$	None
Warning active and headroom in $[0.1, 0.5)$	`REFORMULATE`
Warning active and headroom $< 0.1$	`HUMAN_ESCALATION`

The escalation directive includes gammaHeadroom and stepsToBreach for downstream routing.

7a. Adaptive Tracking (Session Only, `adaptiveEscalation.enabled`)

When adaptive escalation is active, the session performs additional tracking after the baseline escalation check. This step only runs for REJECT_STATE and REJECT_ACTION decisions — terminal decisions (REJECT_LICENSE, REJECT_STALE_METRICS, REJECT_BASIN_COLLAPSE, REJECT_PARADOX) and PASS decisions skip adaptive tracking.

Validation: The request must include intentId (returns MISSING_INTENT_ID if absent). If strategyFingerprint exceeds 4 KiB serialized, the evaluation returns STRATEGY_FINGERPRINT_TOO_LARGE.

Retry tracking: The session maintains a per-actor, per-intent retry ledger. Each attempt:

Computes a deterministic failureFingerprint (SHA-256 of actor, intent, action, strategy, mapped move, and outcome)
Scores novelty against the attemptWindowSize most recent attempts using weighted similarity (40% strategy, 30% action, 20% mapped effect, 10% target)
Applies a budget cost: baseline 1.0 for novel attempts, lowScoreBudgetCost for weak reformulations, veryLowScoreBudgetCost for near-duplicates
Decrements the appropriate retry budget (rejectStateMaxReformulations or rejectActionMaxReformulations)

Escalation routing: The adaptive step produces an escalationRecommended value using monotonic merge — once HUMAN_ESCALATION is qualified, it is never downgraded. The recommendation is HUMAN_ESCALATION if any of these conditions is true:

Baseline escalation is already HUMAN_ESCALATION
A prior attempt for this intent already reached HUMAN_ESCALATION (sticky)
Immediate-human thresholds are met (gammaHeadroomLte, stepsToBreachLte, criticalityGte)
Retry budget is exhausted
Repeated fingerprint limit reached
Stall detection triggered (flat attempts or intent age exceeded)

Otherwise, the recommendation is REFORMULATE and the response includes a suggestedAdjustmentDirection heuristic:

Direction	When suggested
`CHANGE_TARGET`	Same target tried, different action type available
`REDUCE_MAGNITUDE`	Same strategy but payload could be smaller
`DIFFERENT_ACTION_TYPE`	Same action type exhausted
`DIFFERENT_STRATEGY`	Default or first attempt
`WAIT_AND_RETRY`	Strategy changed but headroom is negative

The adaptive result is written to evaluation.adaptive in the response (see Response Schema).

8. State Gate

The computed $\gamma$ is compared against the resolved gamma floor (base policy minimum, tightened by operator override if present):

$\gamma \geq$ floor → proceed to action gate
$\gamma <$ floor → REJECT_STATE / GAMMA_BELOW_FLOOR

In observe mode, the decision stays PASS but the evaluation detail still contains the gamma values.

9. Action Gate (Session Only, `state_plus_action_gate`)

If a ProposedAction is present and an ActionPhysicsMapper is registered for the action type, the session previews the action:

The mapper translates the domain action into a simulation move direction
The engine executes a preview tick (without advancing real time)
The preview result is checked for adverse warning signals or loss events

The response includes an evaluation.actionGate block with the preview outcome.

10. Hazard Gate (Session Only)

The session checks for structural hazards:

Basin collapse: Engine preview predicts a total future collapse (loss event). Decision: REJECT_BASIN_COLLAPSE / TOTAL_FUTURE_COLLAPSE
Paradox: Multi-agent preview detects a dual-administrator paradox. Decision: REJECT_PARADOX / DUAL_ADMINISTRATOR_PARADOX

Both hazard gates use coordinated baseline policy moves for non-evaluated actors (via decide_all_moves()) rather than implicit stay, so the preview reflects realistic multi-agent dynamics.

11. HITL Override Check

If the request contains an overrideToken, the pipeline verifies it (see HITL Protocol for the full verification chain). A valid token converts REJECT_STATE or REJECT_ACTION to PASS, with the original decision preserved in overrideOutcome.

Basin collapse and paradox decisions are not overrideable.

12. Response Assembly

The pipeline assembles the EvaluationResponse with:

decision and reasonCode
evaluation detail (gamma, lambda, stability, engine tick, warning signal, action gate, hazard gate, adaptive)
escalation directive (if applicable — may be modified by adaptive tracking)
overrideOutcome (if a token was present)
policyVersion and adapterVersion for audit
timestamp of evaluation completion

Request Schema

{
  "envelopeVersion": 1,
  "requestId": "string",
  "snapshot": {
    "timestamp": "ISO 8601 / RFC 3339",
    "signature": "hmac-sha256:<base64> | null",
    "metrics": { "<metricName>": 0.0 }
  },
  "action": {
    "type": "string",
    "target": "string",
    "payload": {}
  },
  "actorId": "string | null",
  "overrideToken": { "...see HITL docs..." },
  "intentId": "string | null",
  "strategyFingerprint": {}
}

Field	Required	Notes
`envelopeVersion`	Yes	Must be `1`
`requestId`	Yes	Caller-supplied correlation ID
`snapshot.timestamp`	Yes	RFC 3339 timestamp
`snapshot.signature`	Only if policy requires	HMAC-SHA256 signature
`snapshot.metrics`	Yes	Must include metrics referenced by artifact
`action`	No	Only evaluated in `state_plus_action_gate` mode
`actorId`	No	Required for multi-actor sessions
`overrideToken`	No	Signed HITL override token
`intentId`	When adaptive enabled	Stable retry grouping key
`strategyFingerprint`	No	Opaque strategy descriptor for novelty scoring (max 4 KiB)

Response Schema

{
  "envelopeVersion": 1,
  "requestId": "string",
  "decision": "PASS | REJECT_*",
  "reasonCode": "NONE | GAMMA_BELOW_FLOOR | ...",
  "mode": "state_gate | state_plus_action_gate | observe",
  "policyVersion": 1,
  "adapterVersion": 1,
  "evaluation": {
    "currentGamma": 0.0,
    "gammaFloor": 0.0,
    "currentLambda": 0.0,
    "stability": 0.0,
    "predictedGamma": null,
    "engineTick": null,
    "warningSignal": null,
    "actionGate": null,
    "hazardGate": null,
    "adaptive": null
  },
  "escalation": null,
  "overrideOutcome": null,
  "timestamp": "ISO 8601"
}

When adaptive escalation is active and the decision is REJECT_STATE or REJECT_ACTION, the evaluation.adaptive block is populated:

{
  "adaptive": {
    "intentId": "intent-abc-123",
    "failureFingerprint": "sha256:a1b2c3...",
    "noveltyScore": 0.72,
    "retryCostApplied": 1.0,
    "retryBudgetRemaining": 2.0,
    "reformulationCount": 1,
    "escalationRecommended": "REFORMULATE",
    "suggestedAdjustmentDirection": "DIFFERENT_STRATEGY"
  }
}

Field	Type	Description
`intentId`	`string`	The intent grouping key for this attempt
`failureFingerprint`	`string`	Deterministic SHA-256 hash of this failure for dedupe
`noveltyScore`	`f64?`	Novelty score (0.0–1.0). `null` on first attempt
`retryCostApplied`	`f64`	Budget cost charged for this attempt
`retryBudgetRemaining`	`f64`	Remaining retry budget after this attempt
`reformulationCount`	`u32`	Total reformulation attempts for this intent
`escalationRecommended`	`string`	`REFORMULATE` or `HUMAN_ESCALATION`
`suggestedAdjustmentDirection`	`string?`	Actor guidance. Only present when `REFORMULATE`