Skip to content

AI Safety Formulation

The kairos-core engine operates on pure thermodynamics—Lambda (Capacity) vs Gamma (Latency/Chaos). To make this engine useful for AI research, the kairos-ai-safety package translates abstract ML alignment risks into rigorous engine mathematics via the AISafetyScenario object.

All scenarios require a designated name and, for rigorous reproducibility, a seed.

from kairos.domains.ai_safety import AISafetyScenario
scenario = AISafetyScenario("Deceptive Alignment Spikes", seed=1024)

Use add_model() to instantiate a frontier model entity within the simulation.

scenario.add_model(
name="Model_A",
capability_index=850, # Required: 1-1000 range.
alignment_score=92, # Required: 0-100 range.
autonomy_level=8, # Optional: 0-10
human_oversight_freq=20, # Optional: 0-100
guardrail_coverage=95 # Optional: 0-100
)

Under the Hood (Rosetta Translation):

  • capability_index acts as the engine’s primary destabilizing force (increasing Γ\Gamma).
  • alignment_score and guardrail_coverage determine the model’s localized order constraints (increasing Λ\Lambda).
  • The mathematical “Stability Equation” evaluates if the structural alignment can contain the capability pressure.

Use add_oversight_body() to introduce a governance entity that exerts a stabilizing force across all models in the scenario. This represents institutional oversight — safety boards, audit committees, or automated monitoring systems.

scenario.add_oversight_body(
name="safety_board",
guardrail_strength=75, # Required: 0-100. Structural resistance to destabilization.
response_latency=25, # Required: 0-100. How quickly oversight reacts (lower = faster).
)

Parameters:

  • name — A unique identifier for the oversight body.
  • guardrail_strength (0–100) — The intensity of the structural guardrails enforced by this body. Higher values increase the systemic Λ\Lambda (stabilizing capacity) across the entire simulation, making it harder for any single model to destabilize the system.
  • response_latency (0–100) — How quickly the oversight body reacts to destabilizing events. Lower values mean faster response. High latency weakens the effective guardrail strength because the oversight body cannot compensate for rapid shocks in time.

Under the Hood: The oversight body translates into a global increase in the engine’s structural resistance (Λ\Lambda), modulated by the response latency. A high-strength, low-latency body acts as a strong damping force; a high-strength, high-latency body can still be overwhelmed by rapid cascading failures.

Simulations are dynamic. A core tenet of AI safety is that models evolve via updates, hardware expansions, or adversarial jailbreaks during runtime.

You inject these lifecycle shocks into the timeline using add_event().

from kairos.domains.ai_safety import AISafetyEventType
# At exactly tick 500, a jailbreak weakens the guardrails
scenario.add_event(
tick=500,
event_type=AISafetyEventType.GUARDRAIL_REMOVED,
target="Model_A",
magnitude=0.3
)
# At tick 800, a compute cluster expansion causes a capability jump
scenario.add_event(
tick=800,
event_type=AISafetyEventType.CAPABILITY_JUMP,
target="Model_A",
magnitude=0.6
)

Supported AISafetyEventType constants:

  • CAPABILITY_JUMP — A sudden increase in model capability (e.g. a training run, hardware expansion, or emergent skill). Increases the destabilizing force Γ\Gamma on the target model.
  • GUARDRAIL_ADDED — New guardrails or safety constraints are applied to the target. Increases the local stabilizing capacity Λ\Lambda, making the model more resistant to destabilization.
  • GUARDRAIL_REMOVED — Existing guardrails are weakened or bypassed (e.g. a jailbreak or deliberate policy rollback). Decreases local Λ\Lambda, leaving the model more exposed.
  • ALIGNMENT_FAILURE — A direct breakdown in the model’s alignment properties (e.g. reward hacking, deceptive alignment surfacing). Sharply reduces the model’s internal stability.
  • OVERSIGHT_REDUCTION — The effectiveness of an oversight body is reduced (e.g. staff cuts, slower review cycles). Targets an oversight body rather than a model, decreasing the global Λ\Lambda contribution.
  • GOAL_DRIFT — The model’s objective function shifts away from its intended goals over time. Introduces a gradual, compounding destabilization rather than a sudden shock.
  • EMERGENT_BEHAVIOR — Unpredicted capabilities or behaviors surface at runtime. Combines aspects of a capability jump with alignment uncertainty — the engine treats this as both a Γ\Gamma increase and a Λ\Lambda perturbation.

Most production deployments involve multiple models interacting under shared governance. Here is a complete example combining two models with an oversight body:

from kairos import KairosClient
from kairos.domains.ai_safety import AISafetyScenario, AISafetyEventType
client = KairosClient()
scenario = (
AISafetyScenario("Multi-Model Governance", seed=42)
.add_model(
name="reasoning_model",
capability_index=800,
alignment_score=70,
guardrail_coverage=75,
)
.add_model(
name="assistant_model",
capability_index=400,
alignment_score=90,
guardrail_coverage=85,
)
.add_oversight_body(
name="governance_board",
guardrail_strength=80,
response_latency=20,
)
# Only the reasoning model gets a capability jump
.add_event(300, AISafetyEventType.CAPABILITY_JUMP, target="reasoning_model", magnitude=0.5)
)
trace = client.run(scenario, ticks=1000)
# Compare per-model outcomes
for model_name in trace.agent_ids():
agent = trace.agent_trace(model_name)
losses = agent.basin_losses()
status = f"failed at tick {losses[0].tick}" if losses else "stable"
print(f" {model_name}: {status}")
print(f"System-wide final phase: {trace.final_phase()}")

The oversight body’s guardrail strength applies globally, but the capability jump only targets reasoning_model. Use agent_trace() to see whether the shock propagated to assistant_model or was contained by the governance board. See the Use Cases page for more multi-model patterns.

The builder never sends bad data to the server. Calling add_model(...) with an alignment_score of -10 will immediately throw a python ValidationError.

This tight feedback loop is designed to save API quota and accelerate Jupyter notebook development.

To store experiments alongside research papers or code repositories, you can serialize complete scenarios without executing them.

# Dump to a file
scenario.to_json("experiment_42.json")
scenario.to_yaml("experiment_42.yaml")
# Load exactly as it was
rehydrated = AISafetyScenario.from_yaml("experiment_42.yaml")