Evaluation Lifecycle
Every evaluation — whether through SubstrateRuntime or SubstrateSession — follows the same pipeline. The session path adds engine telemetry enrichment and additional gates.
Pipeline Stages
Section titled “Pipeline Stages”1. Request Validation
Section titled “1. Request Validation”The pipeline validates the EvaluationRequest envelope:
envelopeVersionmust be1requestIdmust be presentsnapshot.metricsmust contain the metrics expected by the calibration artifact
2. License Check
Section titled “2. License Check”The runtime verifies the license is valid:
- Not expired
- Machine fingerprint matches
- Domain is permitted for the licensed artifact
If any check fails, the pipeline short-circuits with REJECT_LICENSE.
3. Metric Freshness
Section titled “3. Metric Freshness”The snapshot.timestamp is compared against the policy’s metricStalenessMaxMs. If the snapshot is older than the allowed window, the pipeline returns REJECT_STALE_METRICS.
4. Metric Signature (Optional)
Section titled “4. Metric Signature (Optional)”If the policy has requireMetricSignature: true, the snapshot.signature is verified against the HMAC shared secret. Failure returns REJECT_INVALID_SIGNATURE.
5. Metric Scaling
Section titled “5. Metric Scaling”Domain metrics are mapped to simulation parameters using the calibration artifact’s scaling functions:
- Primary proxies:
lambdaScalingmaps a metric to ;gammaScalingmaps a metric to - Secondary proxies: Additional metrics contribute to or via their own scaling functions
- Composite aggregation (optional): Multiple scaled values are combined using weighted mean, weighted max, weighted min, or product aggregation
Each scaling function applies one of five deterministic functions:
| Function | Behavior |
|---|---|
linear | Linear interpolation between input and output ranges |
log | Logarithmic scaling (compresses high-end inputs) |
sigmoid | S-curve transition with configurable steepness () |
inverse | Inverse relationship (high input → low output) |
step | Binary threshold — below midpoint maps to output min, above maps to output max |
When clamp is true, inputs outside inputRange are clamped to the nearest bound before scaling.
6. Engine Tick (Session Only)
Section titled “6. Engine Tick (Session Only)”In the session path, the engine advances one tick with the computed and values. This updates:
- Agent positions and reachability maps
- Warning signals (severity, imminence, risk inertia, risk optimized, criticality)
- Projection previews (drift path, optimal path)
- Timeline event markers
7. Escalation Check (Session Only)
Section titled “7. Escalation Check (Session Only)”The session examines gamma headroom — the distance between current and the floor — to generate escalation directives:
| Condition | Escalation |
|---|---|
| Warning inactive or headroom | None |
| Warning active and headroom in | REFORMULATE |
| Warning active and headroom | HUMAN_ESCALATION |
The escalation directive includes gammaHeadroom and stepsToBreach for downstream routing.
7a. Adaptive Tracking (Session Only, adaptiveEscalation.enabled)
Section titled “7a. Adaptive Tracking (Session Only, adaptiveEscalation.enabled)”When adaptive escalation is active, the session performs additional tracking after the baseline escalation check. This step only runs for REJECT_STATE and REJECT_ACTION decisions — terminal decisions (REJECT_LICENSE, REJECT_STALE_METRICS, REJECT_BASIN_COLLAPSE, REJECT_PARADOX) and PASS decisions skip adaptive tracking.
Validation: The request must include intentId (returns MISSING_INTENT_ID if absent). If strategyFingerprint exceeds 4 KiB serialized, the evaluation returns STRATEGY_FINGERPRINT_TOO_LARGE.
Retry tracking: The session maintains a per-actor, per-intent retry ledger. Each attempt:
- Computes a deterministic
failureFingerprint(SHA-256 of actor, intent, action, strategy, mapped move, and outcome) - Scores novelty against the
attemptWindowSizemost recent attempts using weighted similarity (40% strategy, 30% action, 20% mapped effect, 10% target) - Applies a budget cost: baseline 1.0 for novel attempts,
lowScoreBudgetCostfor weak reformulations,veryLowScoreBudgetCostfor near-duplicates - Decrements the appropriate retry budget (
rejectStateMaxReformulationsorrejectActionMaxReformulations)
Escalation routing: The adaptive step produces an escalationRecommended value using monotonic merge — once HUMAN_ESCALATION is qualified, it is never downgraded. The recommendation is HUMAN_ESCALATION if any of these conditions is true:
- Baseline escalation is already
HUMAN_ESCALATION - A prior attempt for this intent already reached
HUMAN_ESCALATION(sticky) - Immediate-human thresholds are met (
gammaHeadroomLte,stepsToBreachLte,criticalityGte) - Retry budget is exhausted
- Repeated fingerprint limit reached
- Stall detection triggered (flat attempts or intent age exceeded)
Otherwise, the recommendation is REFORMULATE and the response includes a suggestedAdjustmentDirection heuristic:
| Direction | When suggested |
|---|---|
CHANGE_TARGET | Same target tried, different action type available |
REDUCE_MAGNITUDE | Same strategy but payload could be smaller |
DIFFERENT_ACTION_TYPE | Same action type exhausted |
DIFFERENT_STRATEGY | Default or first attempt |
WAIT_AND_RETRY | Strategy changed but headroom is negative |
The adaptive result is written to evaluation.adaptive in the response (see Response Schema).
8. State Gate
Section titled “8. State Gate”The computed is compared against the resolved gamma floor (base policy minimum, tightened by operator override if present):
- floor → proceed to action gate
- floor →
REJECT_STATE/GAMMA_BELOW_FLOOR
In observe mode, the decision stays PASS but the evaluation detail still contains the gamma values.
9. Action Gate (Session Only, state_plus_action_gate)
Section titled “9. Action Gate (Session Only, state_plus_action_gate)”If a ProposedAction is present and an ActionPhysicsMapper is registered for the action type, the session previews the action:
- The mapper translates the domain action into a simulation move direction
- The engine executes a preview tick (without advancing real time)
- The preview result is checked for adverse warning signals or loss events
The response includes an evaluation.actionGate block with the preview outcome.
10. Hazard Gate (Session Only)
Section titled “10. Hazard Gate (Session Only)”The session checks for structural hazards:
- Basin collapse: Engine preview predicts a total future collapse (loss event). Decision:
REJECT_BASIN_COLLAPSE/TOTAL_FUTURE_COLLAPSE - Paradox: Multi-agent preview detects a dual-administrator paradox. Decision:
REJECT_PARADOX/DUAL_ADMINISTRATOR_PARADOX
Both hazard gates use coordinated baseline policy moves for non-evaluated actors (via decide_all_moves()) rather than implicit stay, so the preview reflects realistic multi-agent dynamics.
11. HITL Override Check
Section titled “11. HITL Override Check”If the request contains an overrideToken, the pipeline verifies it (see HITL Protocol for the full verification chain). A valid token converts REJECT_STATE or REJECT_ACTION to PASS, with the original decision preserved in overrideOutcome.
Basin collapse and paradox decisions are not overrideable.
12. Response Assembly
Section titled “12. Response Assembly”The pipeline assembles the EvaluationResponse with:
decisionandreasonCodeevaluationdetail (gamma, lambda, stability, engine tick, warning signal, action gate, hazard gate, adaptive)escalationdirective (if applicable — may be modified by adaptive tracking)overrideOutcome(if a token was present)policyVersionandadapterVersionfor audittimestampof evaluation completion
Request Schema
Section titled “Request Schema”{ "envelopeVersion": 1, "requestId": "string", "snapshot": { "timestamp": "ISO 8601 / RFC 3339", "signature": "hmac-sha256:<base64> | null", "metrics": { "<metricName>": 0.0 } }, "action": { "type": "string", "target": "string", "payload": {} }, "actorId": "string | null", "overrideToken": { "...see HITL docs..." }, "intentId": "string | null", "strategyFingerprint": {}}| Field | Required | Notes |
|---|---|---|
envelopeVersion | Yes | Must be 1 |
requestId | Yes | Caller-supplied correlation ID |
snapshot.timestamp | Yes | RFC 3339 timestamp |
snapshot.signature | Only if policy requires | HMAC-SHA256 signature |
snapshot.metrics | Yes | Must include metrics referenced by artifact |
action | No | Only evaluated in state_plus_action_gate mode |
actorId | No | Required for multi-actor sessions |
overrideToken | No | Signed HITL override token |
intentId | When adaptive enabled | Stable retry grouping key |
strategyFingerprint | No | Opaque strategy descriptor for novelty scoring (max 4 KiB) |
Response Schema
Section titled “Response Schema”{ "envelopeVersion": 1, "requestId": "string", "decision": "PASS | REJECT_*", "reasonCode": "NONE | GAMMA_BELOW_FLOOR | ...", "mode": "state_gate | state_plus_action_gate | observe", "policyVersion": 1, "adapterVersion": 1, "evaluation": { "currentGamma": 0.0, "gammaFloor": 0.0, "currentLambda": 0.0, "stability": 0.0, "predictedGamma": null, "engineTick": null, "warningSignal": null, "actionGate": null, "hazardGate": null, "adaptive": null }, "escalation": null, "overrideOutcome": null, "timestamp": "ISO 8601"}When adaptive escalation is active and the decision is REJECT_STATE or REJECT_ACTION, the evaluation.adaptive block is populated:
{ "adaptive": { "intentId": "intent-abc-123", "failureFingerprint": "sha256:a1b2c3...", "noveltyScore": 0.72, "retryCostApplied": 1.0, "retryBudgetRemaining": 2.0, "reformulationCount": 1, "escalationRecommended": "REFORMULATE", "suggestedAdjustmentDirection": "DIFFERENT_STRATEGY" }}| Field | Type | Description |
|---|---|---|
intentId | string | The intent grouping key for this attempt |
failureFingerprint | string | Deterministic SHA-256 hash of this failure for dedupe |
noveltyScore | f64? | Novelty score (0.0–1.0). null on first attempt |
retryCostApplied | f64 | Budget cost charged for this attempt |
retryBudgetRemaining | f64 | Remaining retry budget after this attempt |
reformulationCount | u32 | Total reformulation attempts for this intent |
escalationRecommended | string | REFORMULATE or HUMAN_ESCALATION |
suggestedAdjustmentDirection | string? | Actor guidance. Only present when REFORMULATE |