Skip to content

AI Safety Formulation

The kairos-ai-safety package provides a domain vocabulary for modeling AI capability growth, alignment quality, oversight, and safety interventions through the AISafetyScenario object.

All scenarios require a designated name and, for rigorous reproducibility, a seed.

from kairos.domains.ai_safety import AISafetyScenario
scenario = AISafetyScenario("Deceptive Alignment Spikes", seed=1024)

Use add_model() to instantiate a frontier model entity within the simulation.

scenario.add_model(
name="Model_A",
capability_index=850, # Required: 1-1000 range.
alignment_score=92, # Required: 0-100 range.
autonomy_level=8, # Optional: 0-10
human_oversight_freq=20, # Optional: 0-100
guardrail_coverage=95 # Optional: 0-100
)

Use add_oversight_body() to introduce a governance entity that exerts a stabilizing force across all models in the scenario. This represents institutional oversight — safety boards, audit committees, or automated monitoring systems.

scenario.add_oversight_body(
name="safety_board",
guardrail_strength=75, # Required: 0-100. Structural resistance to destabilization.
response_latency=25, # Required: 0-100. How quickly oversight reacts (lower = faster).
)

Parameters:

  • name — A unique identifier for the oversight body.
  • guardrail_strength (0–100) — The intensity of the structural guardrails enforced by this body. Higher values increase system resilience across the simulation, making it harder for any single model to destabilize the system.
  • response_latency (0–100) — How quickly the oversight body reacts to destabilizing events. Lower values mean faster response. High latency weakens the effective guardrail strength because the oversight body cannot compensate for rapid shocks in time.

A high-strength, low-latency oversight body acts as a strong stabilizing force; a high-strength, high-latency body can still be overwhelmed by rapid cascading failures.

Simulations are dynamic. A core tenet of AI safety is that models evolve via updates, hardware expansions, or adversarial jailbreaks during runtime.

You inject these lifecycle shocks into the timeline using add_event().

from kairos.domains.ai_safety import AISafetyEventType
# At exactly tick 500, a jailbreak weakens the guardrails
scenario.add_event(
tick=500,
event_type=AISafetyEventType.GUARDRAIL_REMOVED,
target="Model_A",
magnitude=0.3
)
# At tick 800, a compute cluster expansion causes a capability jump
scenario.add_event(
tick=800,
event_type=AISafetyEventType.CAPABILITY_JUMP,
target="Model_A",
magnitude=0.6
)

Supported AISafetyEventType constants:

  • CAPABILITY_JUMP — A sudden increase in model capability (e.g. a training run, hardware expansion, or emergent skill). Increases pressure on the target model.
  • GUARDRAIL_ADDED — New guardrails or safety constraints are applied to the target. Makes the model more resistant to destabilization.
  • GUARDRAIL_REMOVED — Existing guardrails are weakened or bypassed (e.g. a jailbreak or deliberate policy rollback). Leaves the model more exposed.
  • ALIGNMENT_FAILURE — A direct breakdown in the model’s alignment properties (e.g. reward hacking, deceptive alignment surfacing). Sharply reduces the model’s stability.
  • OVERSIGHT_REDUCTION — The effectiveness of an oversight body is reduced (e.g. staff cuts, slower review cycles). Targets an oversight body rather than a model.
  • GOAL_DRIFT — The model’s objective function shifts away from its intended goals over time. Introduces a gradual, compounding destabilization rather than a sudden shock.
  • EMERGENT_BEHAVIOR — Unpredicted capabilities or behaviors surface at runtime. Combines aspects of a capability jump with alignment uncertainty.

Most production deployments involve multiple models interacting under shared governance. Here is a complete example combining two models with an oversight body:

from kairos import KairosClient
from kairos.domains.ai_safety import AISafetyScenario, AISafetyEventType
client = KairosClient()
scenario = (
AISafetyScenario("Multi-Model Governance", seed=42)
.add_model(
name="reasoning_model",
capability_index=800,
alignment_score=70,
guardrail_coverage=75,
)
.add_model(
name="assistant_model",
capability_index=400,
alignment_score=90,
guardrail_coverage=85,
)
.add_oversight_body(
name="governance_board",
guardrail_strength=80,
response_latency=20,
)
# Only the reasoning model gets a capability jump
.add_event(300, AISafetyEventType.CAPABILITY_JUMP, target="reasoning_model", magnitude=0.5)
)
trace = client.run(scenario, ticks=1000)
# Compare per-model outcomes
for model_name in trace.agent_ids():
agent = trace.agent_trace(model_name)
losses = agent.basin_losses()
status = f"failed at tick {losses[0].tick}" if losses else "stable"
print(f" {model_name}: {status}")
print(f"System-wide final phase: {trace.final_phase()}")

The oversight body’s guardrail strength applies globally, but the capability jump only targets reasoning_model. Use agent_trace() to see whether the shock propagated to assistant_model or was contained by the governance board. See the Use Cases page for more multi-model patterns.

The builder never sends bad data to the server. Calling add_model(...) with an alignment_score of -10 will immediately throw a python ValidationError.

This tight feedback loop is designed to save API quota and accelerate Jupyter notebook development.

To store experiments alongside research papers or code repositories, you can serialize complete scenarios without executing them.

# Dump to a file
scenario.to_json("experiment_42.json")
scenario.to_yaml("experiment_42.yaml")
# Load exactly as it was
rehydrated = AISafetyScenario.from_yaml("experiment_42.yaml")