Skip to content

Calibration & Validation

Calibration defines the mathematics between raw domain metrics and the engine’s simulation parameters. A well-calibrated system produces Λ\Lambda and Γ\Gamma values that faithfully represent the domain’s actual stability dynamics. A poorly calibrated system either misses real hazards or triggers false rejections.

The output of calibration is a calibration artifact — a JSON file that contains:

  • A primary scaling function mapping one domain metric to Λ\Lambda
  • A primary scaling function mapping one domain metric to Γ\Gamma
  • Optional secondary proxies that contribute additional metrics to Λ\Lambda or Γ\Gamma
  • Optional composite aggregation rules for combining multiple scaled values
  • Optional corpus quality metrics from validation

The artifact is loaded at runtime by the evaluation pipeline. Every evaluation uses it to translate the incoming metric snapshot into simulation parameters.

Each scaling function specifies five things: which metric to read, what mathematical function to apply, the expected input range, the desired output range, and whether to clamp out-of-range values.

Linear scaling maps the input range directly to the output range. A metric at 50% of its input range produces an output at 50% of the output range. This is the simplest and most predictable mapping — use it when every unit of change in the domain metric produces a proportional change in simulation behavior.

Logarithmic scaling compresses the upper portion of the input range. The first doubling of the input produces a larger output change than the last doubling. This is appropriate for metrics with diminishing sensitivity at high values. The AI Safety adapter uses logarithmic scaling for capabilityIndex because the stability impact of going from capability 1 to 100 is far greater than going from 900 to 1000.

Sigmoid scaling produces an S-curve controlled by a steepness parameter (kk). Output changes slowly at both extremes and rapidly near the midpoint. Higher kk values produce a sharper transition. This is useful for metrics that have a critical zone — a narrow band where small changes in the metric produce large changes in stability.

Inverse scaling reverses the relationship: higher input values produce lower output values. Use this when the domain metric has an inverse relationship with the simulation parameter. The Military adapter uses inverse scaling for supplyLineLength because longer supply lines reduce structural stability.

Step scaling creates a binary threshold at the midpoint of the input range. Below the midpoint, the output is at its minimum; above, it is at its maximum. This is appropriate for discrete compliance metrics or hard pass/fail conditions.

When clamping is enabled, metrics that fall outside the defined input range are pinned to the nearest bound before the scaling function runs. A capability index of 1500 with an input range of [1, 1000] is treated as 1000.

When clamping is disabled, out-of-range values pass through the scaling function unconstrained and may produce output values outside the specified output range.

Many domains require more than one metric to adequately capture Λ\Lambda or Γ\Gamma. Secondary proxies provide this. Each proxy specifies a metric, which parameter it contributes to (Λ\Lambda or Γ\Gamma), and its own scaling function configuration.

For example, an AI Safety deployment might define:

MetricMaps toScalingRationale
capabilityIndexΛ\Lambda (primary)LogarithmicCore growth pressure
autonomyLevelΛ\Lambda (secondary)SigmoidAdditional destabilizing factor
alignmentScoreΓ\Gamma (primary)LinearCore structural stability
guardrailCoverageΓ\Gamma (secondary)LinearAdditional stabilizing factor
humanOversightFreqΓ\Gamma (secondary)LogarithmicAdditional stabilizing factor

When secondary proxies are present, the system needs a rule for combining multiple scaled values into a single Λ\Lambda or Γ\Gamma. The composite aggregation block defines this rule.

Four aggregation strategies are available:

Weighted mean — each proxy’s scaled value is multiplied by its weight, and the results are summed and divided by the total weight. This is the most common strategy and produces a balanced combination.

Weighted max — each proxy’s scaled value is multiplied by its weight, and the highest result is used. This strategy is conservative: the most dangerous contributor dominates.

Weighted min — each proxy’s scaled value is multiplied by its weight, and the lowest result is used. This strategy captures bottleneck dynamics: the weakest contributor dominates.

Product — proxy values are raised to the power of their weights and multiplied together. This strategy produces strong interaction effects: if any proxy drops to zero, the entire result drops to zero.

When no composite block is defined, the system uses a default aggregation strategy for combining primary and secondary proxy values.

A calibration corpus is a dataset of known-outcome scenarios used to validate that the scaling functions produce accurate simulation results. Each scenario specifies input metrics and the expected evaluation outcome (pass, reject, hazard class).

The corpus serves two purposes:

  1. Validation. After defining scaling functions, run the corpus through the evaluation pipeline and compare predicted outcomes against known outcomes. This catches calibration errors before deployment.

  2. Regression testing. When scaling functions are modified — adjusted ranges, changed function types, added secondary proxies — the corpus ensures the changes do not break previously correct evaluations.

Three metrics quantify calibration quality:

Mean agent F1 — the average F1 score across all agents in the corpus. F1 balances precision (did the system correctly identify rejections?) with recall (did the system catch all scenarios that should be rejected?). A mean F1 above 0.90 indicates strong calibration.

Mean absolute delta — the average absolute difference between the predicted Γ\Gamma and the observed Γ\Gamma across all corpus scenarios. This measures how closely the scaling functions track reality. A mean absolute delta below 0.05 indicates tight calibration.

Within-tolerance percentage — the percentage of corpus evaluations where the predicted outcome matches the expected outcome within a defined tolerance band. This is the broadest quality measure. A within-tolerance percentage above 90% is the minimum for production deployment.

Good calibration has three observable properties:

  1. No false rejections in the corpus. The system does not block actions that the corpus defines as safe. False rejections erode operator trust and cause workarounds.

  2. No missed hazards in the corpus. The system catches every scenario that the corpus defines as dangerous. Missed hazards defeat the purpose of the safety layer.

  3. Stable behavior at boundaries. Metrics near the edges of their input ranges produce sensible Λ\Lambda and Γ\Gamma values — no spikes, no discontinuities, no extreme sensitivity to small changes.

For the full calibration artifact schema, see the Calibration Artifact reference.