CALM — Catastrophically Abridged Language Models

CALM logo

Overview

CALM is Lilush's compact language model system — a family of small Mamba SSM models that serve as domain-specific completion sources, embeddings models for semantic search, etc.

CALM uses a Mamba selective state space model (SSM) as its sequence mixer. CALM is included in the lilush build only; it is not part of the lilu minimal runtime.

CALM is still in the experimental phase.

Related documents:

Design Goals

Architecture Overview

The model is a decoder-only language model following the pre-norm residual pattern with a Mamba selective SSM sequence mixer.

Input token IDs (L)
        │
        ▼
┌─────────────────┐
│ Token Embedding │  (vocab_size × d_model)
└────────┬────────┘
         │
         ▼
┌─────────────────┐ ─┐
│ LayerNorm       │  │
│ Mixer (Mamba    │  │  × n_layers
│  SSM)           │  │
│ + Residual      │  │
│ LayerNorm       │  │
│ FFN (GELU)      │  │
│ + Residual      │  │
└────────┬────────┘ ─┘
         │
         ▼
┌─────────────────┐
│ Final LayerNorm │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ LM Head         │  (d_model × vocab_size, tied with embedding)
└────────┬────────┘
         │
         ▼
   Logits (L, vocab_size)

Block structure

Each block follows pre-norm residual convention:

x → LayerNorm → Mixer (Mamba) → (+x) → LayerNorm → FFN → (+x)

The FFN is a standard two-layer MLP with GELU activation:

FFN(x) = W2 · GELU(W1 · x)

Where W1 projects from d_model to d_model × ffn_expand and W2 projects back. No bias on either linear layer (following modern practice for small models).

Weight tying

The LM head (output projection) shares weights with the token embedding matrix. This halves the parameter cost of the vocabulary — significant when the embedding table is a large fraction of total parameters (as it is for our small models with 320-token vocabulary).

No positional embeddings

The Mamba operator's SSM recurrence implicitly encodes positional information through its sequential state evolution. Explicit positional embeddings (sinusoidal or learned) are not used. This simplifies the architecture and removes a parameter/computation cost.

Mamba Operator

The Mamba operator is the sequence mixer based on selective state space models (Mamba-1, Gu & Dao 2023, arXiv:2312.00752). It uses a sequential recurrence that has O(1) per-token decode cost.

Mamba operator computation

Given input x of shape (L, d_model):

1. Project:    z, x' = split(W_in(x))         W_in: [d_model → 2 × d_inner]
2. Conv1d:     x' = SiLU(DepthwiseConv1D(x', k=d_conv))   causal, per-channel
3. SSM proj:   Δ_raw, B, C = split(W_x(x'))   W_x: [d_inner → dt_rank + 2×d_state]
4. Δ project:  Δ = softplus(W_dt(Δ_raw) + bias)  W_dt: [dt_rank → d_inner]
5. Discretize: A_bar = exp(Δ ⊙ A)             A: (d_inner × d_state), log-space
               B_bar = Δ ⊙ B
6. SSM scan:   h[t] = A_bar[t] ⊙ h[t-1] + B_bar[t] ⊙ x'[t]
               y[t] = C[t] · h[t] + D · x'[t]
7. Gate:       y = y ⊙ SiLU(z)
8. Out proj:   out = W_out(y)                  W_out: [d_inner → d_model]

Mamba parameters

ParameterDescriptionDefault
expandInner expansion: d_inner = d_model × expand2
d_stateSSM state dimensions per channel16
d_convShort convolution kernel size4
dt_rankΔ projection bottleneckceil(d_model / 16)

A parameterization

The transition matrix A uses the full (d_inner, d_state) parameterization from Mamba-1. Each (channel, state) pair has its own learned decay rate. A is stored in log-space as A_log and negated during discretization: A = -exp(A_log). This ensures stability (all eigenvalues are negative, producing decaying dynamics).

Initialization follows the S4D convention: A_log[c][n] = log(n + 1), giving each state dimension a different initial timescale.

D skip connection

Each channel has a learned skip parameter D[c] that adds a direct path from the SSM input to output: y[t][c] += D[c] × x'[t][c]. Initialized to ones. A_log and D are excluded from weight decay during training.

State-cached generation

Mamba's SSM state enables efficient autoregressive generation:

  1. Prompt processing: Full forward pass over the prompt populates ssm_state and conv_state in each block.

  2. Token generation: Each new token runs calm_mamba_step() per block, updating the persistent state in O(d_model²) time — independent of prompt length.

  3. Multi-candidate: The state after prompt processing is snapshot and restored for each candidate (~180 KB for Mini).

Model Configurations

All configurations use:

expand scales with model size: 2 (Nano/Micro), 3 (Mini), 4 (Small).

Named_modeln_layersexpandffn_expandParamsWeights (fp32)
Nano64322~283K~1.1 MB
Micro96523~741K~2.9 MB
Mini128634~2.16M~8.4 MB
Small192844~7.29M~28.4 MB

Use calm benchmark to measure inference latency for each configuration on your hardware.

Scaling constraints

All dimension limits are enforced either by header field sizes or by explicit validation in calm_model_new(). The current limits are well above what the four model configs require.

DimensionTypeHard limitCode limitCurrent max use
d_modeluint16_t65,535none192 (Small)
n_layersuint8_t25516 (CALM_MAX_LAYERS)8 (Small)
ffn_expanduint8_t255> 04 (Mini/Small)
l_maxuint16_t65,535none768
vocab_sizeuint16_t65,535none (default 320)320
param_countuint32_t~4.3Bnone~7.29M (Small)
d_inner= d_model × expand768 (Small)
d_stateuint8_t (header)255none16
d_convuint8_t (header)255none4
expanduint8_t (header)255none4 (Small)
dt_rankuint8_t (header)255none12 (Small)

The CALM_MAX_LAYERS constant is in calm.h.

Inference

Forward pass

For a single completion request:

  1. Build input sequence: Context frames + current input, tokenized per the tokenizer spec. Result: token ID array of length L ≤ l_max (768 by default).

  2. Embedding lookup: Map token IDs to d_model-dimensional vectors.

  3. Blocks: For each block, apply LayerNorm → Mamba → residual → LayerNorm → FFN → residual.

  4. Output: Apply final LayerNorm, then LM head (= embedding matrix transposed) to get logits over vocabulary.

  5. Sample/select: Apply temperature, top-k, or greedy selection to get the next token.

  6. Repeat: Append predicted token to input, run forward pass again. Stop at <EOS>, structural boundary, or token limit.

Autoregressive generation

After the initial prompt forward pass, each generated token uses calm_mamba_step() which runs in O(d_model²) time per block — independent of prompt length. The SSM state and conv state persist across tokens. For multi-candidate generation, the state after prompt processing is snapshot (~180 KB for Mini) and restored per candidate.

Sampling

CALM supports greedy decoding and a multi-stage sampling pipeline, configurable at inference time. Shell commands have low entropy at most positions — after git c, there are only a few plausible continuations.

When any sampler is enabled, token selection follows a sequential filtering pipeline:

  1. Softmax — logits are converted to probabilities (with temperature scaling)

  2. Min-p filtering (min_p) — removes tokens whose probability is below max_prob × min_p. Adapts to the distribution shape: keeps more tokens when the model is uncertain, fewer when confident.

  3. Top-k filtering (top_k) — keeps only the k highest-probability tokens.

  4. Top-p filtering (top_p, nucleus sampling) — keeps tokens in descending probability order until cumulative probability exceeds the threshold. Adapts the candidate set size to the distribution.

  5. Re-normalize and sample — the surviving probabilities are re-normalized and a token is drawn. If all probabilities were zeroed by the filters, falls back to argmax over the original logits.

All three filters compose and each can be independently disabled by setting its value to 0. Greedy decoding (argmax) is used when all three are disabled (top_k=0, top_p=0, min_p=0).

The default completion source uses top_k=5 with temperature=0.8.

Shell Integration

Shell command completion is CALM's primary interactive domain. The shell domain model uses session context (working directory, git status, command history, traditional completions) to predict command continuations.

The calm builtin

CALM is managed through the calm shell builtin:

CommandDescription
calm statusShow model status, architecture info, sampler defaults
calm enableEnable CALM predictions
calm disableDisable CALM predictions
calm init [--size S]Initialize a new model with random weights
calm dataset [-n N]Build CTDS dataset from shell history
calm dataset --from FILE [-o PATH]Convert text training data to CTDS
calm dataset --view [-i N] [-c N]View CTDS dataset contents
calm train [options]Train the model (foreground)
calm evaluate [-m MODEL] [-d DATASET]Evaluate model loss on a dataset
calm benchmarkMeasure inference latency across model sizes
calm generate [-m MODEL] [-i INPUT] [options]One-shot completion generation (default --max-tokens 256)
calm model [name\|auto]List available models or switch the active model
calm meta [options]View or modify model metadata and sampler defaults
calm resetDelete model files

CALM completions are disabled by default. Run calm enable to activate after initializing and training a model.

calm init creates a model with random weights at ~/.local/share/lilush/calm/<domain>.cwgt (e.g. shell.cwgt). The --size option selects the model configuration: nano (default), micro, mini, or small. It does not download a pretrained model.

calm generate runs one-shot completion on the given input. Input is provided via --input / -i or read from stdin (pipe-friendly). Accepts --top-k, --top-p / -p, --min-p, --temperature, --max-tokens, --candidates, and --raw / -r options. In raw mode, <n> patterns in the input are replaced with the corresponding special token IDs (e.g. <BOS>, <CMD>, <ATN>), giving full control over the token sequence. By default, output includes model info, scored candidates, and generation stats (tokens, time, tok/s). Use --quiet / -q to output only the completion text. Use --full to prepend the prompt to each completion. Use --special-tokens / -s to render special tokens as <BOS> etc. instead of stripping them.

For calm dataset and calm train CLI details, see CALM Training and CALM Dataset.

Completion source

CALM acts as a completion source (src/shell/shell/completion/source/calm.lua) that provides ghost text predictions alongside traditional completions.

Lazy loading: The model is loaded on the first completion request, not at shell startup, to avoid startup latency from reading the weight file.

Hot-reload: On each completion request, the weight file's mtime is checked. If the file has changed (e.g., after calm train), the model is automatically reloaded.

Model resolution: The active model is resolved via the model registry (~/.config/lilush/calm/registry.json). The default model name is shell. Override with the LILUSH_CALM_MODEL environment variable (takes an absolute path). Use calm model <n> to switch manually, or calm model auto to re-enable mode-based auto-switching.

Template-driven prompt construction: The completion source inspects the loaded model's template to decide how to build the prompt. Context gathering is driven by the template's field names, not the domain name:

This design means any domain model that includes context fields in its template (e.g. CWD:cwd;GIT:git) automatically gets that context without domain-specific code in the completion source.

Sampler configuration: Inference parameters are resolved from two sources (highest priority first): model sampler defaults (stored in the CWGT header), and hardcoded fallbacks.

ParameterHardcoded fallback
top_k5
top_p0 (disabled)
min_p0 (disabled)
temperature0.7
max_tokens20
num_candidates3

File locations

PathPurpose
~/.local/share/lilush/calm/*.cwgtModel weight files (one per domain)
~/.local/share/lilush/calm/train.ctdsDefault training dataset
~/.config/lilush/calm/registry.jsonModel registry (mode→model mapping)

MNEME storage

CALM training data is sourced from ~/.local/share/lilush/shell.mneme, the shell's main MNEME database. The calm dataset command reads history from the shell keyspace (sorted set entries) and completion candidates from the completions keyspace.

Lua API

Tokenizer operations

For detailed tokenizer documentation, see CALM_TOKENIZER.md.

local calm = require("calm")

-- Tokenize text (byte-level identity mapping)
local tokens = calm.tokenize("git commit -m 'fix'")
-- Returns byte values: {103, 105, 116, 32, 99, 111, ...}

-- Decode tokens back to text
local text = calm.detokenize(tokens)

-- Build a full input sequence with context
-- First argument is a model userdata or template spec string
local seq = calm.build_sequence(model, {
    cwd = "/home/user/lilush",
    git = "main+3",
    history = {
        { cmd = "git diff --stat", exit = 0 },
        { cmd = "git status",      exit = 0 },
    },
    completions = { "commit", "checkout", "cherry-pick", "clone" },
    input = "git com",
})

Model operations

local calm = require("calm")

-- Load model from weight file
local model, err = calm.load_model("/path/to/weights.cwgt")

-- Initialize a new model with random weights
local model = calm.new_model({
    d_model = 128, n_layers = 6, ffn_expand = 4,
    expand = 2, d_state = 16, d_conv = 4,
    l_max = 768,
})

-- Get model info
local info = model:info()
-- info.d_state, info.d_conv, info.expand, info.dt_rank, info.d_inner = d_model*expand

-- Run inference: generate completions
-- seq is a token ID array from calm.build_sequence()
local completions = model:complete(seq, {
    max_tokens = 20,        -- max tokens to generate
    top_k = 5,              -- top-k sampling (0 = disabled)
    top_p = 0.9,            -- nucleus (top-p) sampling (0.0 = disabled)
    min_p = 0.05,           -- min-p filtering (0.0 = disabled)
    temperature = 0.8,      -- sampling temperature (1.0 = neutral)
    use_stop_conditions = true, -- apply model's stop conditions
    num_candidates = 3,     -- number of independent completions to generate
})
-- completions = {
--   { tokens = {...}, text = "commit -m \"", score = -2.3 },
--   { tokens = {...}, text = "checkout ", score = -3.1 },
--   { tokens = {...}, text = "clone ", score = -4.7 },
-- }

-- Single forward pass — returns logits for last position only [vocab_size]
local logits = model:forward(token_ids)

-- Forward pass with loss — returns loss value
local loss = model:forward_loss(token_ids)

-- Generate embeddings — returns d_model-dimensional vector
local embedding = model:embed(token_ids, {
    pool = "mean",        -- pooling: "mean" or "last" (default: "mean")
    normalize = true,     -- L2-normalize output (default: false)
})

-- Save model weights
model:save("/path/to/weights.cwgt")

-- Unload model, free memory
model:close()

Training and dataset operations

See CALM Training — Lua API and CALM Dataset — Lua API.

C Implementation

Source layout

src/calm/
  calm.h                 -- Public C API (model + tokenizer + mixer types)
  calm_lua.c             -- Lua bindings
  tokenizer.c            -- Byte-level tokenizer (see CALM_TOKENIZER.md)
  tensor.c               -- Tensor operations: matmul, elementwise, SiLU, softplus
  layernorm.c            -- LayerNorm forward + backward
  mamba.c                -- Mamba operator: SSM scan, state-cached decode,
                            forward + backward + weight init
  model.c                -- Full model: embedding, blocks, mixer dispatch,
                            LM head, forward pass, state-cached generation
  train.c                -- Backward pass, loss, optimizer (Adam/SGD),
                            EWC, contrastive training, selective weight decay
  serialize.c            -- Weight save/load (CWGT v5 format)
  Makefile

Memory management

The model uses three memory regions:

Weights: Allocated once at load time (or init time for training from scratch). Size = param_count × 4 bytes (fp32). Laid out as a single contiguous block in a fixed order matching the serialization format.

Activations: Allocated once at init time, sized for the maximum sequence length. Reused across forward passes. For inference, this is the working memory for intermediate results. For training, this includes activation storage needed for the backward pass.

Optimizer state (training only): Allocated when training starts. Adam requires 2 × param_count × 4 bytes for moment estimates. Freed after training completes.

No per-call heap allocation. All buffers are pre-allocated based on model config.

Activation memory estimate (Mini config, inference)

Mini (d_model=128, d_inner=384, d_ffn=512):

Shared buffers (in calm_activations_t, reused across blocks):

BufferShapeSize
residual768 × 128384 KB
ln_out768 × 128384 KB
ffn_mid768 × 5121.5 MB
logits768 × 320960 KB

Mamba buffers (in calm_mamba_activations_t, reused across blocks):

BufferShapeSize
z (gate branch)768 × 3841.1 MB
x (SSM branch)768 × 3841.1 MB
x_conv (transposed)384 × 7681.1 MB
conv_out384 × 7681.1 MB
x_post (after conv+SiLU)768 × 3841.1 MB
ssm_proj768 × 40120 KB
dt (discretized Δ)768 × 3841.1 MB
y (SSM output)768 × 3841.1 MB
mixer_out (after gating)768 × 3841.1 MB
in_proj_out768 × 7682.3 MB
A (precomputed)384 × 1624 KB
Total (approx)~13 MB

All buffers are reused across blocks — only one block's activations are live at a time during inference.

Core tensor operations

The model requires a small set of operations, all operating on contiguous fp32 arrays:

OperationUsageNotes
MatMul (A × B)Projections, FFN, LM headInner loop: fused multiply-accumulate
Element-wise multiplyGatingAuto-vectorizes
Element-wise addResidual connections
GELUFFN activationApproximate: 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
SiLUMamba conv activation + gatingx × sigmoid(x)
SoftplusMamba Δ projectionlog(1 + exp(x)) with overflow guard
LayerNormPre-norm in each block, final normMean + variance + normalize + scale/shift
SoftmaxOutput samplingApplied to final logits, 320-wide
Depthwise Conv1DShort filter (Mamba)Kernel size 4, causal
Cross-entropy lossTrainingLog-softmax + NLL

Weight serialization order

Weights are serialized in a fixed order matching the model structure. See CWGT v5 format for the binary file specification.

token_emb.weight              [vocab_size × d_model]

for each block i = 0..n_layers-1:
  block[i].ln1.weight         [d_model]
  block[i].ln1.bias           [d_model]
  block[i].mixer.in_proj      [d_model × (2 × d_inner)]
  block[i].mixer.conv1d       [d_inner × d_conv]
  block[i].mixer.x_proj       [d_inner × (dt_rank + 2 × d_state)]
  block[i].mixer.dt_proj_w    [dt_rank × d_inner]
  block[i].mixer.dt_proj_b    [d_inner]
  block[i].mixer.A_log        [d_inner × d_state]
  block[i].mixer.D            [d_inner]
  block[i].mixer.out_proj     [d_inner × d_model]
  block[i].ln2.weight         [d_model]
  block[i].ln2.bias           [d_model]
  block[i].ffn_fc1.weight     [d_model × (d_model × ffn_expand)]
  block[i].ffn_fc2.weight     [(d_model × ffn_expand) × d_model]

ln_f.weight                   [d_model]
ln_f.bias                     [d_model]

(lm_head.weight tied to token_emb.weight — not serialized)