Skip to content

Evaluation Lifecycle

Every evaluation — whether through SubstrateRuntime or SubstrateSession — follows the same pipeline. The session path adds engine telemetry enrichment and additional gates.

The pipeline validates the EvaluationRequest envelope:

  • envelopeVersion must be 1
  • requestId must be present
  • snapshot.metrics must contain the metrics expected by the calibration artifact

The runtime verifies the license is valid:

  • Not expired
  • Machine fingerprint matches
  • Domain is permitted for the licensed artifact

If any check fails, the pipeline short-circuits with REJECT_LICENSE.

The snapshot.timestamp is compared against the policy’s metricStalenessMaxMs. If the snapshot is older than the allowed window, the pipeline returns REJECT_STALE_METRICS.

If the policy has requireMetricSignature: true, the snapshot.signature is verified against the HMAC shared secret. Failure returns REJECT_INVALID_SIGNATURE.

Domain metrics are mapped to simulation parameters using the calibration artifact’s scaling functions:

  • Primary proxies: lambdaScaling maps a metric to λ\lambda; gammaScaling maps a metric to γ\gamma
  • Secondary proxies: Additional metrics contribute to λ\lambda or γ\gamma via their own scaling functions
  • Composite aggregation (optional): Multiple scaled values are combined using weighted mean, weighted max, weighted min, or product aggregation

Each scaling function applies one of five deterministic functions:

FunctionBehavior
linearLinear interpolation between input and output ranges
logLogarithmic scaling (compresses high-end inputs)
sigmoidS-curve transition with configurable steepness (kk)
inverseInverse relationship (high input → low output)
stepBinary threshold — below midpoint maps to output min, above maps to output max

When clamp is true, inputs outside inputRange are clamped to the nearest bound before scaling.

In the session path, the engine advances one tick with the computed λ\lambda and γ\gamma values. This updates:

  • Agent positions and reachability maps
  • Warning signals (severity, imminence, risk inertia, risk optimized, criticality)
  • Projection previews (drift path, optimal path)
  • Timeline event markers

The session examines gamma headroom — the distance between current γ\gamma and the floor — to generate escalation directives:

ConditionEscalation
Warning inactive or headroom 0.5\geq 0.5None
Warning active and headroom in [0.1,0.5)[0.1, 0.5)REFORMULATE
Warning active and headroom <0.1< 0.1HUMAN_ESCALATION

The escalation directive includes gammaHeadroom and stepsToBreach for downstream routing.

7a. Adaptive Tracking (Session Only, adaptiveEscalation.enabled)

Section titled “7a. Adaptive Tracking (Session Only, adaptiveEscalation.enabled)”

When adaptive escalation is active, the session performs additional tracking after the baseline escalation check. This step only runs for REJECT_STATE and REJECT_ACTION decisions — terminal decisions (REJECT_LICENSE, REJECT_STALE_METRICS, REJECT_BASIN_COLLAPSE, REJECT_PARADOX) and PASS decisions skip adaptive tracking.

Validation: The request must include intentId (returns MISSING_INTENT_ID if absent). If strategyFingerprint exceeds 4 KiB serialized, the evaluation returns STRATEGY_FINGERPRINT_TOO_LARGE.

Retry tracking: The session maintains a per-actor, per-intent retry ledger. Each attempt:

  1. Computes a deterministic failureFingerprint (SHA-256 of actor, intent, action, strategy, mapped move, and outcome)
  2. Scores novelty against the attemptWindowSize most recent attempts using weighted similarity (40% strategy, 30% action, 20% mapped effect, 10% target)
  3. Applies a budget cost: baseline 1.0 for novel attempts, lowScoreBudgetCost for weak reformulations, veryLowScoreBudgetCost for near-duplicates
  4. Decrements the appropriate retry budget (rejectStateMaxReformulations or rejectActionMaxReformulations)

Escalation routing: The adaptive step produces an escalationRecommended value using monotonic merge — once HUMAN_ESCALATION is qualified, it is never downgraded. The recommendation is HUMAN_ESCALATION if any of these conditions is true:

  • Baseline escalation is already HUMAN_ESCALATION
  • A prior attempt for this intent already reached HUMAN_ESCALATION (sticky)
  • Immediate-human thresholds are met (gammaHeadroomLte, stepsToBreachLte, criticalityGte)
  • Retry budget is exhausted
  • Repeated fingerprint limit reached
  • Stall detection triggered (flat attempts or intent age exceeded)

Otherwise, the recommendation is REFORMULATE and the response includes a suggestedAdjustmentDirection heuristic:

DirectionWhen suggested
CHANGE_TARGETSame target tried, different action type available
REDUCE_MAGNITUDESame strategy but payload could be smaller
DIFFERENT_ACTION_TYPESame action type exhausted
DIFFERENT_STRATEGYDefault or first attempt
WAIT_AND_RETRYStrategy changed but headroom is negative

The adaptive result is written to evaluation.adaptive in the response (see Response Schema).

The computed γ\gamma is compared against the resolved gamma floor (base policy minimum, tightened by operator override if present):

  • γ\gamma \geq floor → proceed to action gate
  • γ<\gamma < floor → REJECT_STATE / GAMMA_BELOW_FLOOR

In observe mode, the decision stays PASS but the evaluation detail still contains the gamma values.

9. Action Gate (Session Only, state_plus_action_gate)

Section titled “9. Action Gate (Session Only, state_plus_action_gate)”

If a ProposedAction is present and an ActionPhysicsMapper is registered for the action type, the session previews the action:

  1. The mapper translates the domain action into a simulation move direction
  2. The engine executes a preview tick (without advancing real time)
  3. The preview result is checked for adverse warning signals or loss events

The response includes an evaluation.actionGate block with the preview outcome.

The session checks for structural hazards:

  • Basin collapse: Engine preview predicts a total future collapse (loss event). Decision: REJECT_BASIN_COLLAPSE / TOTAL_FUTURE_COLLAPSE
  • Paradox: Multi-agent preview detects a dual-administrator paradox. Decision: REJECT_PARADOX / DUAL_ADMINISTRATOR_PARADOX

Both hazard gates use coordinated baseline policy moves for non-evaluated actors (via decide_all_moves()) rather than implicit stay, so the preview reflects realistic multi-agent dynamics.

If the request contains an overrideToken, the pipeline verifies it (see HITL Protocol for the full verification chain). A valid token converts REJECT_STATE or REJECT_ACTION to PASS, with the original decision preserved in overrideOutcome.

Basin collapse and paradox decisions are not overrideable.

The pipeline assembles the EvaluationResponse with:

  • decision and reasonCode
  • evaluation detail (gamma, lambda, stability, engine tick, warning signal, action gate, hazard gate, adaptive)
  • escalation directive (if applicable — may be modified by adaptive tracking)
  • overrideOutcome (if a token was present)
  • policyVersion and adapterVersion for audit
  • timestamp of evaluation completion
{
"envelopeVersion": 1,
"requestId": "string",
"snapshot": {
"timestamp": "ISO 8601 / RFC 3339",
"signature": "hmac-sha256:<base64> | null",
"metrics": { "<metricName>": 0.0 }
},
"action": {
"type": "string",
"target": "string",
"payload": {}
},
"actorId": "string | null",
"overrideToken": { "...see HITL docs..." },
"intentId": "string | null",
"strategyFingerprint": {}
}
FieldRequiredNotes
envelopeVersionYesMust be 1
requestIdYesCaller-supplied correlation ID
snapshot.timestampYesRFC 3339 timestamp
snapshot.signatureOnly if policy requiresHMAC-SHA256 signature
snapshot.metricsYesMust include metrics referenced by artifact
actionNoOnly evaluated in state_plus_action_gate mode
actorIdNoRequired for multi-actor sessions
overrideTokenNoSigned HITL override token
intentIdWhen adaptive enabledStable retry grouping key
strategyFingerprintNoOpaque strategy descriptor for novelty scoring (max 4 KiB)
{
"envelopeVersion": 1,
"requestId": "string",
"decision": "PASS | REJECT_*",
"reasonCode": "NONE | GAMMA_BELOW_FLOOR | ...",
"mode": "state_gate | state_plus_action_gate | observe",
"policyVersion": 1,
"adapterVersion": 1,
"evaluation": {
"currentGamma": 0.0,
"gammaFloor": 0.0,
"currentLambda": 0.0,
"stability": 0.0,
"predictedGamma": null,
"engineTick": null,
"warningSignal": null,
"actionGate": null,
"hazardGate": null,
"adaptive": null
},
"escalation": null,
"overrideOutcome": null,
"timestamp": "ISO 8601"
}

When adaptive escalation is active and the decision is REJECT_STATE or REJECT_ACTION, the evaluation.adaptive block is populated:

{
"adaptive": {
"intentId": "intent-abc-123",
"failureFingerprint": "sha256:a1b2c3...",
"noveltyScore": 0.72,
"retryCostApplied": 1.0,
"retryBudgetRemaining": 2.0,
"reformulationCount": 1,
"escalationRecommended": "REFORMULATE",
"suggestedAdjustmentDirection": "DIFFERENT_STRATEGY"
}
}
FieldTypeDescription
intentIdstringThe intent grouping key for this attempt
failureFingerprintstringDeterministic SHA-256 hash of this failure for dedupe
noveltyScoref64?Novelty score (0.0–1.0). null on first attempt
retryCostAppliedf64Budget cost charged for this attempt
retryBudgetRemainingf64Remaining retry budget after this attempt
reformulationCountu32Total reformulation attempts for this intent
escalationRecommendedstringREFORMULATE or HUMAN_ESCALATION
suggestedAdjustmentDirectionstring?Actor guidance. Only present when REFORMULATE