Skip to content

SLM Backend

The SLM backend uses a locally-running small language model (SmolLM2-1.7B-Instruct) to generate richer, more contextual telemetry narratives. All inference runs on-device — no network calls, no data leaves the machine.

  • The slm feature flag must be enabled at build time
  • A downloaded and verified model bundle
Terminal window
# Build with SLM support
cargo build -p kairos-cli --features tui --profile release-native
# (kairos-sentinel/slm is pulled in transitively)

The SLM backend uses SmolLM2-1.7B-Instruct in GGUF Q4_K_M quantization, running inference via the candle framework.

models/smollm2-1.7b-instruct-q4km/
├── manifest.json # Bundle metadata and model configuration
├── generation.json # Generation parameters (temperature, top_p, etc.)
├── tokenizer.json # HuggingFace tokenizer
└── smollm2-1.7b-instruct-q4km.gguf # Quantized model weights
{
"manifestVersion": "1",
"modelId": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
"modelFamily": "SmolLM2",
"weightFormat": "GGUF",
"quantization": "Q4_K_M",
"promptVersion": "sentinel.phase_b.v1",
"expectedMaxContextLength": 2048,
"expectedMaxOutputTokens": 120,
"evaluationCorpusVersion": "testbed-v1"
}
FieldDescription
modelIdHuggingFace model identifier
modelFamilyModel architecture family
weightFormatWeight serialization format
quantizationQuantization method (Q4_K_M = 4-bit with K-quant medium)
promptVersionVersioned prompt template identifier
expectedMaxContextLengthMaximum context window in tokens
expectedMaxOutputTokensMaximum output length in tokens
evaluationCorpusVersionCorpus version the model was evaluated against

The testbed includes download and verification scripts:

Terminal window
# From the kairos-testbed directory
./scripts/download-model.sh # Downloads GGUF weights from HuggingFace
./scripts/verify-bundle.sh # Verifies bundle integrity

Or as part of the full testbed setup:

Terminal window
./scripts/setup.sh # Includes model download as step 3

The SLM backend:

  1. Renders a prompt from the SentinelTelemetryFeed using a versioned prompt template
  2. Tokenizes the prompt using the bundled HuggingFace tokenizer
  3. Runs inference through candle’s GGUF loader with the quantized weights
  4. Parses the output to extract narrative, risk level, and highlights
  5. Validates the response against expected structure before returning

The prompt engineering is versioned (sentinel.phase_b.v1) to ensure reproducibility when the prompt template changes.

PropertyValue
ModelSmolLM2-1.7B-Instruct
QuantizationGGUF Q4_K_M (~1 GB on disk)
Inference frameworkcandle (Rust-native)
Latency~100ms per summary (CPU)
Dependenciescandle-core, candle-transformers, tokenizers
Feature flagkairos-sentinel/slm
Backend identifierslm in SentinelSummary.backend
NetworkNone — fully offline inference

The SLM backend is the right choice when:

  • Narrative quality matters — richer, more contextual descriptions than templates
  • ~100ms latency is acceptable — not suitable for sub-millisecond requirements
  • Model management is feasible — you can deploy and verify the ~1 GB model bundle
  • Offline operation is required — all inference runs locally, no network calls

For deterministic, zero-dependency operation, see the template backend.

AspectTemplateSLM
Latency< 1ms~100ms
Narrative qualityStructured, formulaicContextual, natural
DeterminismFully deterministicNon-deterministic (sampling)
Disk footprintNone~1 GB model bundle
DependenciesNonecandle, tokenizers
Feature flagAlways onslm