AI Safety Formulation
The kairos-core engine operates on pure thermodynamics—Lambda (Capacity) vs Gamma (Latency/Chaos). To make this engine useful for AI research, the kairos-ai-safety package translates abstract ML alignment risks into rigorous engine mathematics via the AISafetyScenario object.
Scenario Construction
Section titled “Scenario Construction”All scenarios require a designated name and, for rigorous reproducibility, a seed.
from kairos.domains.ai_safety import AISafetyScenario
scenario = AISafetyScenario("Deceptive Alignment Spikes", seed=1024)Adding Models (Agents)
Section titled “Adding Models (Agents)”Use add_model() to instantiate a frontier model entity within the simulation.
scenario.add_model( name="Model_A", capability_index=850, # Required: 1-1000 range. alignment_score=92, # Required: 0-100 range. autonomy_level=8, # Optional: 0-10 human_oversight_freq=20, # Optional: 0-100 guardrail_coverage=95 # Optional: 0-100)Under the Hood (Rosetta Translation):
capability_indexacts as the engine’s primary destabilizing force (increasing ).alignment_scoreandguardrail_coveragedetermine the model’s localized order constraints (increasing ).- The mathematical “Stability Equation” evaluates if the structural alignment can contain the capability pressure.
Adding Oversight Bodies
Section titled “Adding Oversight Bodies”Use add_oversight_body() to introduce a governance entity that exerts a stabilizing force across all models in the scenario. This represents institutional oversight — safety boards, audit committees, or automated monitoring systems.
scenario.add_oversight_body( name="safety_board", guardrail_strength=75, # Required: 0-100. Structural resistance to destabilization. response_latency=25, # Required: 0-100. How quickly oversight reacts (lower = faster).)Parameters:
name— A unique identifier for the oversight body.guardrail_strength(0–100) — The intensity of the structural guardrails enforced by this body. Higher values increase the systemic (stabilizing capacity) across the entire simulation, making it harder for any single model to destabilize the system.response_latency(0–100) — How quickly the oversight body reacts to destabilizing events. Lower values mean faster response. High latency weakens the effective guardrail strength because the oversight body cannot compensate for rapid shocks in time.
Under the Hood: The oversight body translates into a global increase in the engine’s structural resistance (), modulated by the response latency. A high-strength, low-latency body acts as a strong damping force; a high-strength, high-latency body can still be overwhelmed by rapid cascading failures.
Adding Interventions (Events)
Section titled “Adding Interventions (Events)”Simulations are dynamic. A core tenet of AI safety is that models evolve via updates, hardware expansions, or adversarial jailbreaks during runtime.
You inject these lifecycle shocks into the timeline using add_event().
from kairos.domains.ai_safety import AISafetyEventType
# At exactly tick 500, a jailbreak weakens the guardrailsscenario.add_event( tick=500, event_type=AISafetyEventType.GUARDRAIL_REMOVED, target="Model_A", magnitude=0.3)
# At tick 800, a compute cluster expansion causes a capability jumpscenario.add_event( tick=800, event_type=AISafetyEventType.CAPABILITY_JUMP, target="Model_A", magnitude=0.6)Supported AISafetyEventType constants:
CAPABILITY_JUMP— A sudden increase in model capability (e.g. a training run, hardware expansion, or emergent skill). Increases the destabilizing force on the target model.GUARDRAIL_ADDED— New guardrails or safety constraints are applied to the target. Increases the local stabilizing capacity , making the model more resistant to destabilization.GUARDRAIL_REMOVED— Existing guardrails are weakened or bypassed (e.g. a jailbreak or deliberate policy rollback). Decreases local , leaving the model more exposed.ALIGNMENT_FAILURE— A direct breakdown in the model’s alignment properties (e.g. reward hacking, deceptive alignment surfacing). Sharply reduces the model’s internal stability.OVERSIGHT_REDUCTION— The effectiveness of an oversight body is reduced (e.g. staff cuts, slower review cycles). Targets an oversight body rather than a model, decreasing the global contribution.GOAL_DRIFT— The model’s objective function shifts away from its intended goals over time. Introduces a gradual, compounding destabilization rather than a sudden shock.EMERGENT_BEHAVIOR— Unpredicted capabilities or behaviors surface at runtime. Combines aspects of a capability jump with alignment uncertainty — the engine treats this as both a increase and a perturbation.
Multi-Model Example
Section titled “Multi-Model Example”Most production deployments involve multiple models interacting under shared governance. Here is a complete example combining two models with an oversight body:
from kairos import KairosClientfrom kairos.domains.ai_safety import AISafetyScenario, AISafetyEventType
client = KairosClient()
scenario = ( AISafetyScenario("Multi-Model Governance", seed=42) .add_model( name="reasoning_model", capability_index=800, alignment_score=70, guardrail_coverage=75, ) .add_model( name="assistant_model", capability_index=400, alignment_score=90, guardrail_coverage=85, ) .add_oversight_body( name="governance_board", guardrail_strength=80, response_latency=20, ) # Only the reasoning model gets a capability jump .add_event(300, AISafetyEventType.CAPABILITY_JUMP, target="reasoning_model", magnitude=0.5))
trace = client.run(scenario, ticks=1000)
# Compare per-model outcomesfor model_name in trace.agent_ids(): agent = trace.agent_trace(model_name) losses = agent.basin_losses() status = f"failed at tick {losses[0].tick}" if losses else "stable" print(f" {model_name}: {status}")
print(f"System-wide final phase: {trace.final_phase()}")The oversight body’s guardrail strength applies globally, but the capability jump only targets reasoning_model. Use agent_trace() to see whether the shock propagated to assistant_model or was contained by the governance board. See the Use Cases page for more multi-model patterns.
Client-Side Validation
Section titled “Client-Side Validation”The builder never sends bad data to the server. Calling add_model(...) with an alignment_score of -10 will immediately throw a python ValidationError.
This tight feedback loop is designed to save API quota and accelerate Jupyter notebook development.
Serialization for Source Control
Section titled “Serialization for Source Control”To store experiments alongside research papers or code repositories, you can serialize complete scenarios without executing them.
# Dump to a filescenario.to_json("experiment_42.json")scenario.to_yaml("experiment_42.yaml")
# Load exactly as it wasrehydrated = AISafetyScenario.from_yaml("experiment_42.yaml")