Calibration & Validation

Calibration defines the mathematics between raw domain metrics and the engine’s simulation parameters. A well-calibrated system produces $\Lambda$ and $\Gamma$ values that faithfully represent the domain’s actual stability dynamics. A poorly calibrated system either misses real hazards or triggers false rejections.

What Calibration Produces

The output of calibration is a calibration artifact — a JSON file that contains:

A primary scaling function mapping one domain metric to $\Lambda$
A primary scaling function mapping one domain metric to $\Gamma$
Optional secondary proxies that contribute additional metrics to $\Lambda$ or $\Gamma$
Optional composite aggregation rules for combining multiple scaled values
Optional corpus quality metrics from validation

The artifact is loaded at runtime by the evaluation pipeline. Every evaluation uses it to translate the incoming metric snapshot into simulation parameters.

Scaling Functions

Each scaling function specifies five things: which metric to read, what mathematical function to apply, the expected input range, the desired output range, and whether to clamp out-of-range values.

Linear scaling maps the input range directly to the output range. A metric at 50% of its input range produces an output at 50% of the output range. This is the simplest and most predictable mapping — use it when every unit of change in the domain metric produces a proportional change in simulation behavior.

Logarithmic scaling compresses the upper portion of the input range. The first doubling of the input produces a larger output change than the last doubling. This is appropriate for metrics with diminishing sensitivity at high values. The AI Safety adapter uses logarithmic scaling for capabilityIndex because the stability impact of going from capability 1 to 100 is far greater than going from 900 to 1000.

Sigmoid scaling produces an S-curve controlled by a steepness parameter ( $k$ ). Output changes slowly at both extremes and rapidly near the midpoint. Higher $k$ values produce a sharper transition. This is useful for metrics that have a critical zone — a narrow band where small changes in the metric produce large changes in stability.

Inverse scaling reverses the relationship: higher input values produce lower output values. Use this when the domain metric has an inverse relationship with the simulation parameter. The Military adapter uses inverse scaling for supplyLineLength because longer supply lines reduce structural stability.

Step scaling creates a binary threshold at the midpoint of the input range. Below the midpoint, the output is at its minimum; above, it is at its maximum. This is appropriate for discrete compliance metrics or hard pass/fail conditions.

Clamping

When clamping is enabled, metrics that fall outside the defined input range are pinned to the nearest bound before the scaling function runs. A capability index of 1500 with an input range of [1, 1000] is treated as 1000.

When clamping is disabled, out-of-range values pass through the scaling function unconstrained and may produce output values outside the specified output range.

Secondary Proxies

Many domains require more than one metric to adequately capture $\Lambda$ or $\Gamma$ . Secondary proxies provide this. Each proxy specifies a metric, which parameter it contributes to ( $\Lambda$ or $\Gamma$ ), and its own scaling function configuration.

For example, an AI Safety deployment might define:

Metric	Maps to	Scaling	Rationale
`capabilityIndex`	$\Lambda$ (primary)	Logarithmic	Core growth pressure
`autonomyLevel`	$\Lambda$ (secondary)	Sigmoid	Additional destabilizing factor
`alignmentScore`	$\Gamma$ (primary)	Linear	Core structural stability
`guardrailCoverage`	$\Gamma$ (secondary)	Linear	Additional stabilizing factor
`humanOversightFreq`	$\Gamma$ (secondary)	Logarithmic	Additional stabilizing factor

Composite Aggregation

When secondary proxies are present, the system needs a rule for combining multiple scaled values into a single $\Lambda$ or $\Gamma$ . The composite aggregation block defines this rule.

Four aggregation strategies are available:

Weighted mean — each proxy’s scaled value is multiplied by its weight, and the results are summed and divided by the total weight. This is the most common strategy and produces a balanced combination.

Weighted max — each proxy’s scaled value is multiplied by its weight, and the highest result is used. This strategy is conservative: the most dangerous contributor dominates.

Weighted min — each proxy’s scaled value is multiplied by its weight, and the lowest result is used. This strategy captures bottleneck dynamics: the weakest contributor dominates.

Product — proxy values are raised to the power of their weights and multiplied together. This strategy produces strong interaction effects: if any proxy drops to zero, the entire result drops to zero.

When no composite block is defined, the system uses a default aggregation strategy for combining primary and secondary proxy values.

The Calibration Corpus

A calibration corpus is a dataset of known-outcome scenarios used to validate that the scaling functions produce accurate simulation results. Each scenario specifies input metrics and the expected evaluation outcome (pass, reject, hazard class).

The corpus serves two purposes:

Validation. After defining scaling functions, run the corpus through the evaluation pipeline and compare predicted outcomes against known outcomes. This catches calibration errors before deployment.
Regression testing. When scaling functions are modified — adjusted ranges, changed function types, added secondary proxies — the corpus ensures the changes do not break previously correct evaluations.

Corpus Quality Metrics

Three metrics quantify calibration quality:

Mean agent F1 — the average F1 score across all agents in the corpus. F1 balances precision (did the system correctly identify rejections?) with recall (did the system catch all scenarios that should be rejected?). A mean F1 above 0.90 indicates strong calibration.

Mean absolute delta — the average absolute difference between the predicted $\Gamma$ and the observed $\Gamma$ across all corpus scenarios. This measures how closely the scaling functions track reality. A mean absolute delta below 0.05 indicates tight calibration.

Within-tolerance percentage — the percentage of corpus evaluations where the predicted outcome matches the expected outcome within a defined tolerance band. This is the broadest quality measure. A within-tolerance percentage above 90% is the minimum for production deployment.

What Good Calibration Looks Like

Good calibration has three observable properties:

No false rejections in the corpus. The system does not block actions that the corpus defines as safe. False rejections erode operator trust and cause workarounds.
No missed hazards in the corpus. The system catches every scenario that the corpus defines as dangerous. Missed hazards defeat the purpose of the safety layer.
Stable behavior at boundaries. Metrics near the edges of their input ranges produce sensible $\Lambda$ and $\Gamma$ values — no spikes, no discontinuities, no extreme sensitivity to small changes.

For the full calibration artifact schema, see the Calibration Artifact reference.