SLM Backend
The SLM backend uses a locally-running small language model (SmolLM2-1.7B-Instruct) to generate richer, more contextual telemetry narratives. All inference runs on-device — no network calls, no data leaves the machine.
Requirements
Section titled “Requirements”- The
slmfeature flag must be enabled at build time - A downloaded and verified model bundle
# Build with SLM supportcargo build -p kairos-cli --features tui --profile release-native# (kairos-sentinel/slm is pulled in transitively)Model Bundle
Section titled “Model Bundle”The SLM backend uses SmolLM2-1.7B-Instruct in GGUF Q4_K_M quantization, running inference via the candle framework.
Bundle Structure
Section titled “Bundle Structure”models/smollm2-1.7b-instruct-q4km/├── manifest.json # Bundle metadata and model configuration├── generation.json # Generation parameters (temperature, top_p, etc.)├── tokenizer.json # HuggingFace tokenizer└── smollm2-1.7b-instruct-q4km.gguf # Quantized model weightsManifest
Section titled “Manifest”{ "manifestVersion": "1", "modelId": "HuggingFaceTB/SmolLM2-1.7B-Instruct", "modelFamily": "SmolLM2", "weightFormat": "GGUF", "quantization": "Q4_K_M", "promptVersion": "sentinel.phase_b.v1", "expectedMaxContextLength": 2048, "expectedMaxOutputTokens": 120, "evaluationCorpusVersion": "testbed-v1"}| Field | Description |
|---|---|
modelId | HuggingFace model identifier |
modelFamily | Model architecture family |
weightFormat | Weight serialization format |
quantization | Quantization method (Q4_K_M = 4-bit with K-quant medium) |
promptVersion | Versioned prompt template identifier |
expectedMaxContextLength | Maximum context window in tokens |
expectedMaxOutputTokens | Maximum output length in tokens |
evaluationCorpusVersion | Corpus version the model was evaluated against |
Model Download
Section titled “Model Download”The testbed includes download and verification scripts:
# From the kairos-testbed directory./scripts/download-model.sh # Downloads GGUF weights from HuggingFace./scripts/verify-bundle.sh # Verifies bundle integrityOr as part of the full testbed setup:
./scripts/setup.sh # Includes model download as step 3Inference Pipeline
Section titled “Inference Pipeline”The SLM backend:
- Renders a prompt from the
SentinelTelemetryFeedusing a versioned prompt template - Tokenizes the prompt using the bundled HuggingFace tokenizer
- Runs inference through candle’s GGUF loader with the quantized weights
- Parses the output to extract narrative, risk level, and highlights
- Validates the response against expected structure before returning
The prompt engineering is versioned (sentinel.phase_b.v1) to ensure reproducibility when the prompt template changes.
Characteristics
Section titled “Characteristics”| Property | Value |
|---|---|
| Model | SmolLM2-1.7B-Instruct |
| Quantization | GGUF Q4_K_M (~1 GB on disk) |
| Inference framework | candle (Rust-native) |
| Latency | ~100ms per summary (CPU) |
| Dependencies | candle-core, candle-transformers, tokenizers |
| Feature flag | kairos-sentinel/slm |
| Backend identifier | slm in SentinelSummary.backend |
| Network | None — fully offline inference |
When to Use SLM
Section titled “When to Use SLM”The SLM backend is the right choice when:
- Narrative quality matters — richer, more contextual descriptions than templates
- ~100ms latency is acceptable — not suitable for sub-millisecond requirements
- Model management is feasible — you can deploy and verify the ~1 GB model bundle
- Offline operation is required — all inference runs locally, no network calls
For deterministic, zero-dependency operation, see the template backend.
Comparison
Section titled “Comparison”| Aspect | Template | SLM |
|---|---|---|
| Latency | < 1ms | ~100ms |
| Narrative quality | Structured, formulaic | Contextual, natural |
| Determinism | Fully deterministic | Non-deterministic (sampling) |
| Disk footprint | None | ~1 GB model bundle |
| Dependencies | None | candle, tokenizers |
| Feature flag | Always on | slm |