SLM Backend

The SLM backend uses a locally-running small language model (SmolLM2-1.7B-Instruct) to generate richer, more contextual telemetry narratives. All inference runs on-device — no network calls, no data leaves the machine.

Requirements

The slm feature flag must be enabled at build time
A downloaded and verified model bundle

# Build with SLM support
cargo build -p kairos-cli --features tui --profile release-native
# (kairos-sentinel/slm is pulled in transitively)

Model Bundle

The SLM backend uses SmolLM2-1.7B-Instruct in GGUF Q4_K_M quantization, running inference via the candle framework.

Bundle Structure

models/smollm2-1.7b-instruct-q4km/
├── manifest.json          # Bundle metadata and model configuration
├── generation.json        # Generation parameters (temperature, top_p, etc.)
├── tokenizer.json         # HuggingFace tokenizer
└── smollm2-1.7b-instruct-q4km.gguf   # Quantized model weights

Manifest

{
  "manifestVersion": "1",
  "modelId": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
  "modelFamily": "SmolLM2",
  "weightFormat": "GGUF",
  "quantization": "Q4_K_M",
  "promptVersion": "sentinel.phase_b.v1",
  "expectedMaxContextLength": 2048,
  "expectedMaxOutputTokens": 120,
  "evaluationCorpusVersion": "testbed-v1"
}

Field	Description
`modelId`	HuggingFace model identifier
`modelFamily`	Model architecture family
`weightFormat`	Weight serialization format
`quantization`	Quantization method (Q4_K_M = 4-bit with K-quant medium)
`promptVersion`	Versioned prompt template identifier
`expectedMaxContextLength`	Maximum context window in tokens
`expectedMaxOutputTokens`	Maximum output length in tokens
`evaluationCorpusVersion`	Corpus version the model was evaluated against

Model Download

The testbed includes download and verification scripts:

# From the kairos-testbed directory
./scripts/download-model.sh     # Downloads GGUF weights from HuggingFace
./scripts/verify-bundle.sh      # Verifies bundle integrity

Or as part of the full testbed setup:

./scripts/setup.sh   # Includes model download as step 3

Inference Pipeline

The SLM backend:

Renders a prompt from the SentinelTelemetryFeed using a versioned prompt template
Tokenizes the prompt using the bundled HuggingFace tokenizer
Runs inference through candle’s GGUF loader with the quantized weights
Parses the output to extract narrative, risk level, and highlights
Validates the response against expected structure before returning

The prompt engineering is versioned (sentinel.phase_b.v1) to ensure reproducibility when the prompt template changes.

Characteristics

Property	Value
Model	SmolLM2-1.7B-Instruct
Quantization	GGUF Q4_K_M (~1 GB on disk)
Inference framework	candle (Rust-native)
Latency	~100ms per summary (CPU)
Dependencies	`candle-core`, `candle-transformers`, `tokenizers`
Feature flag	`kairos-sentinel/slm`
Backend identifier	`slm` in `SentinelSummary.backend`
Network	None — fully offline inference

When to Use SLM

The SLM backend is the right choice when:

Narrative quality matters — richer, more contextual descriptions than templates
~100ms latency is acceptable — not suitable for sub-millisecond requirements
Model management is feasible — you can deploy and verify the ~1 GB model bundle
Offline operation is required — all inference runs locally, no network calls

For deterministic, zero-dependency operation, see the template backend.

Comparison

Aspect	Template	SLM
Latency	< 1ms	~100ms
Narrative quality	Structured, formulaic	Contextual, natural
Determinism	Fully deterministic	Non-deterministic (sampling)
Disk footprint	None	~1 GB model bundle
Dependencies	None	candle, tokenizers
Feature flag	Always on	`slm`