CALM — Catastrophically Abridged Language Models

Overview

CALM is Lilush's compact language model system — a family of small Mamba SSM models that serve as domain-specific completion sources, embeddings models for semantic search, etc.

CALM uses a Mamba selective state space model (SSM) as its sequence mixer. CALM is included in the lilush build only; it is not part of the lilu minimal runtime.

CALM is still in the experimental phase.

Design Goals

Smallest viable model
Fast CPU inference
Multi-domain
Fully self-contained, no external dependencies

The same C code supports both inference and training. All training — including base model training from scratch — runs through the C implementation in the lilush binary, with Lua driving the training loop. No external toolchain (Python, PyTorch) is required at any stage.

There is, however, a companion python CALM implementation for GPU training.

Architecture Overview

The model is a decoder-only language model following the pre-norm residual pattern with a Mamba selective SSM sequence mixer.

Input token IDs (L)
        │
        ▼
┌─────────────────┐
│ Token Embedding │  (vocab_size × d_model)
└────────┬────────┘
         │
         ▼
┌─────────────────┐ ─┐
│ LayerNorm       │  │
│ Mixer (Mamba    │  │  × n_layers
│  SSM)           │  │
│ + Residual      │  │
│ LayerNorm       │  │
│ FFN (GELU)      │  │
│ + Residual      │  │
└────────┬────────┘ ─┘
         │
         ▼
┌─────────────────┐
│ Final LayerNorm │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ LM Head         │  (d_model × vocab_size, tied with embedding)
└────────┬────────┘
         │
         ▼
   Logits (L, vocab_size)

Block structure

Each block follows pre-norm residual convention:

x → LayerNorm → Mixer (Mamba) → (+x) → LayerNorm → FFN → (+x)

The FFN is a standard two-layer MLP with GELU activation:

FFN(x) = W2 · GELU(W1 · x)

Where W1 projects from d_model to d_model × ffn_expand and W2 projects back. No bias on either linear layer (following modern practice for small models).

Weight tying

The LM head (output projection) shares weights with the token embedding matrix. This halves the parameter cost of the vocabulary — significant when the embedding table is a large fraction of total parameters (as it is for our small models with 320-token vocabulary).

No positional embeddings

The Mamba operator's SSM recurrence implicitly encodes positional information through its sequential state evolution. Explicit positional embeddings (sinusoidal or learned) are not used. This simplifies the architecture and removes a parameter/computation cost.

Mamba Operator

The Mamba operator is the sequence mixer based on selective state space models (Mamba-1, Gu & Dao 2023, arXiv:2312.00752). It uses a sequential recurrence that has O(1) per-token decode cost.

Mamba operator computation

Given input x of shape (L, d_model):

1. Project:    z, x' = split(W_in(x))         W_in: [d_model → 2 × d_inner]
2. Conv1d:     x' = SiLU(DepthwiseConv1D(x', k=d_conv))   causal, per-channel
3. SSM proj:   Δ_raw, B, C = split(W_x(x'))   W_x: [d_inner → dt_rank + 2×d_state]
4. Δ project:  Δ = softplus(W_dt(Δ_raw) + bias)  W_dt: [dt_rank → d_inner]
5. Discretize: A_bar = exp(Δ ⊙ A)             A: (d_inner × d_state), log-space
               B_bar = Δ ⊙ B
6. SSM scan:   h[t] = A_bar[t] ⊙ h[t-1] + B_bar[t] ⊙ x'[t]
               y[t] = C[t] · h[t] + D · x'[t]
7. Gate:       y = y ⊙ SiLU(z)
8. Out proj:   out = W_out(y)                  W_out: [d_inner → d_model]

Mamba parameters

Parameter	Description	Default
`expand`	Inner expansion: `d_inner = d_model × expand`	2
`d_state`	SSM state dimensions per channel	16
`d_conv`	Short convolution kernel size	4
`dt_rank`	Δ projection bottleneck	`ceil(d_model / 16)`

A parameterization

The transition matrix A uses the full (d_inner, d_state) parameterization from Mamba-1. Each (channel, state) pair has its own learned decay rate. A is stored in log-space as A_log and negated during discretization: A = -exp(A_log). This ensures stability (all eigenvalues are negative, producing decaying dynamics).

Initialization follows the S4D convention: A_log[c][n] = log(n + 1), giving each state dimension a different initial timescale.

D skip connection

Each channel has a learned skip parameter D[c] that adds a direct path from the SSM input to output: y[t][c] += D[c] × x'[t][c]. Initialized to ones. A_log and D are excluded from weight decay during training.

State-cached generation

Mamba's SSM state enables efficient autoregressive generation:

Prompt processing: Full forward pass over the prompt populates ssm_state and conv_state in each block.
Token generation: Each new token runs calm_mamba_step() per block, updating the persistent state in O(d_model²) time — independent of prompt length.
Multi-candidate: The state after prompt processing is snapshot and restored for each candidate (~180 KB for Mini).

Model Configurations

All configurations use:

vocab_size: 320 (byte-level tokenizer: 256 byte + 22 special + 42 reserved). See CALM_TOKENIZER.md for details.
l_max: 768 (context window)
tie_weights: true
d_state: 16, d_conv: 4

expand scales with model size: 2 (Nano/Micro), 3 (Mini), 4 (Small).

Name	d_model	n_layers	expand	ffn_expand	Params	Weights (fp32)
Nano	64	3	2	2	~283K	~1.1 MB
Micro	96	5	2	3	~741K	~2.9 MB
Mini	128	6	3	4	~2.16M	~8.4 MB
Small	192	8	4	4	~7.29M	~28.4 MB

Use calm benchmark to measure inference latency for each configuration on your hardware.

Scaling constraints

All dimension limits are enforced either by header field sizes or by explicit validation in calm_model_new(). The current limits are well above what the four model configs require.

Dimension	Type	Hard limit	Code limit	Current max use
`d_model`	uint16_t	65,535	none	192 (Small)
`n_layers`	uint8_t	255	16 (`CALM_MAX_LAYERS`)	8 (Small)
`ffn_expand`	uint8_t	255	> 0	4 (Mini/Small)
`l_max`	uint16_t	65,535	none	768
`vocab_size`	uint16_t	65,535	none (default 320)	320
`param_count`	uint32_t	~4.3B	none	~7.29M (Small)
`d_inner`	—	—	= `d_model × expand`	768 (Small)
`d_state`	uint8_t (header)	255	none	16
`d_conv`	uint8_t (header)	255	none	4
`expand`	uint8_t (header)	255	none	4 (Small)
`dt_rank`	uint8_t (header)	255	none	12 (Small)

The CALM_MAX_LAYERS constant is in calm.h.

Inference

Forward pass

For a single completion request:

Build input sequence: Context frames + current input, tokenized per the tokenizer spec. Result: token ID array of length L ≤ l_max (768 by default).
Embedding lookup: Map token IDs to d_model-dimensional vectors.
Blocks: For each block, apply LayerNorm → Mamba → residual → LayerNorm → FFN → residual.
Output: Apply final LayerNorm, then LM head (= embedding matrix transposed) to get logits over vocabulary.
Sample/select: Apply temperature, top-k, or greedy selection to get the next token.
Repeat: Append predicted token to input, run forward pass again. Stop at <EOS>, structural boundary, or token limit.

Autoregressive generation

After the initial prompt forward pass, each generated token uses calm_mamba_step() which runs in O(d_model²) time per block — independent of prompt length. The SSM state and conv state persist across tokens. For multi-candidate generation, the state after prompt processing is snapshot (~180 KB for Mini) and restored per candidate.

Sampling

CALM supports greedy decoding and a multi-stage sampling pipeline, configurable at inference time. Shell commands have low entropy at most positions — after git c, there are only a few plausible continuations.

When any sampler is enabled, token selection follows a sequential filtering pipeline:

Softmax — logits are converted to probabilities (with temperature scaling)
Min-p filtering (min_p) — removes tokens whose probability is below max_prob × min_p. Adapts to the distribution shape: keeps more tokens when the model is uncertain, fewer when confident.
Top-k filtering (top_k) — keeps only the k highest-probability tokens.
Top-p filtering (top_p, nucleus sampling) — keeps tokens in descending probability order until cumulative probability exceeds the threshold. Adapts the candidate set size to the distribution.
Re-normalize and sample — the surviving probabilities are re-normalized and a token is drawn. If all probabilities were zeroed by the filters, falls back to argmax over the original logits.

All three filters compose and each can be independently disabled by setting its value to 0. Greedy decoding (argmax) is used when all three are disabled (top_k=0, top_p=0, min_p=0).

The default completion source uses top_k=5 with temperature=0.8.

Shell Integration

Shell command completion is CALM's primary interactive domain. The shell domain model uses session context (working directory, git status, command history, traditional completions) to predict command continuations.

The `calm` builtin

CALM is managed through the calm shell builtin:

Command	Description
`calm status`	Show model status, architecture info, sampler defaults
`calm enable`	Enable CALM predictions
`calm disable`	Disable CALM predictions
`calm init [--size S]`	Initialize a new model with random weights
`calm dataset [-n N]`	Build CTDS dataset from shell history
`calm dataset --from FILE [-o PATH]`	Convert text training data to CTDS
`calm dataset --view [-i N] [-c N]`	View CTDS dataset contents
`calm train [options]`	Train the model (foreground)
`calm evaluate [-m MODEL] [-d DATASET]`	Evaluate model loss on a dataset
`calm benchmark`	Measure inference latency across model sizes
`calm generate [-m MODEL] [-i INPUT] [options]`	One-shot completion generation (default `--max-tokens 256`)
`calm model [name\\|auto]`	List available models or switch the active model
`calm meta [options]`	View or modify model metadata and sampler defaults
`calm reset`	Delete model files

CALM completions are disabled by default. Run calm enable to activate after initializing and training a model.

calm init creates a model with random weights at ~/.local/share/lilush/calm/<domain>.cwgt (e.g. shell.cwgt). The --size option selects the model configuration: nano (default), micro, mini, or small. It does not download a pretrained model.

calm generate runs one-shot completion on the given input. Input is provided via --input / -i or read from stdin (pipe-friendly). Accepts --top-k, --top-p / -p, --min-p, --temperature, --max-tokens, --candidates, and --raw / -r options. In raw mode, <n> patterns in the input are replaced with the corresponding special token IDs (e.g. <BOS>, <CMD>, <ATN>), giving full control over the token sequence. By default, output includes model info, scored candidates, and generation stats (tokens, time, tok/s). Use --quiet / -q to output only the completion text. Use --full to prepend the prompt to each completion. Use --special-tokens / -s to render special tokens as <BOS> etc. instead of stripping them.

For calm dataset and calm train CLI details, see CALM Training and CALM Dataset.

Completion source

CALM acts as a completion source (src/shell/shell/completion/source/calm.lua) that provides ghost text predictions alongside traditional completions.

Lazy loading: The model is loaded on the first completion request, not at shell startup, to avoid startup latency from reading the weight file.

Hot-reload: On each completion request, the weight file's mtime is checked. If the file has changed (e.g., after calm train), the model is automatically reloaded.

Model resolution: The active model is resolved via the model registry (~/.config/lilush/calm/registry.json). The default model name is shell. Override with the LILUSH_CALM_MODEL environment variable (takes an absolute path). Use calm model <n> to switch manually, or calm model auto to re-enable mode-based auto-switching.

Template-driven prompt construction: The completion source inspects the loaded model's template to decide how to build the prompt. Context gathering is driven by the template's field names, not the domain name:

If the template includes cwd, git, history, completions, or env fields, the corresponding shell context is gathered automatically (working directory, git status, command history, traditional completion candidates, environment hints).
The input field receives the raw user input.
For templates with no named fields (bare tokens only, e.g. BOS;ATN), the raw input text is appended after <ATN>.
For templates with unrecognized field names, the completion source parses the input for inline field:value patterns. For example, with template BOS;WORD:headword/POS:pos;ATN and input pos:n. headword:anything you want, the fields are parsed as pos="n." and headword="anything you want". If no field patterns are found in the input, the entire input is mapped to the input field (if present in the template) or the first named field.

This design means any domain model that includes context fields in its template (e.g. CWD:cwd;GIT:git) automatically gets that context without domain-specific code in the completion source.

Sampler configuration: Inference parameters are resolved from two sources (highest priority first): model sampler defaults (stored in the CWGT header), and hardcoded fallbacks.

Parameter	Hardcoded fallback
`top_k`	`5`
`top_p`	`0` (disabled)
`min_p`	`0` (disabled)
`temperature`	`0.7`
`max_tokens`	`20`
`num_candidates`	`3`

File locations

Path	Purpose
`~/.local/share/lilush/calm/*.cwgt`	Model weight files (one per domain)
`~/.local/share/lilush/calm/train.ctds`	Default training dataset
`~/.config/lilush/calm/registry.json`	Model registry (mode→model mapping)

MNEME storage

CALM training data is sourced from ~/.local/share/lilush/shell.mneme, the shell's main MNEME database. The calm dataset command reads history from the shell keyspace (sorted set entries) and completion candidates from the completions keyspace.

Lua API

Tokenizer operations

For detailed tokenizer documentation, see CALM_TOKENIZER.md.

local calm = require("calm")

-- Tokenize text (byte-level identity mapping)
local tokens = calm.tokenize("git commit -m 'fix'")
-- Returns byte values: {103, 105, 116, 32, 99, 111, ...}

-- Decode tokens back to text
local text = calm.detokenize(tokens)

-- Build a full input sequence with context
-- First argument is a model userdata or template spec string
local seq = calm.build_sequence(model, {
    cwd = "/home/user/lilush",
    git = "main+3",
    history = {
        { cmd = "git diff --stat", exit = 0 },
        { cmd = "git status",      exit = 0 },
    },
    completions = { "commit", "checkout", "cherry-pick", "clone" },
    input = "git com",
})

Model operations

local calm = require("calm")

-- Load model from weight file
local model, err = calm.load_model("/path/to/weights.cwgt")

-- Initialize a new model with random weights
local model = calm.new_model({
    d_model = 128, n_layers = 6, ffn_expand = 4,
    expand = 2, d_state = 16, d_conv = 4,
    l_max = 768,
})

-- Get model info
local info = model:info()
-- info.d_state, info.d_conv, info.expand, info.dt_rank, info.d_inner = d_model*expand

-- Run inference: generate completions
-- seq is a token ID array from calm.build_sequence()
local completions = model:complete(seq, {
    max_tokens = 20,        -- max tokens to generate
    top_k = 5,              -- top-k sampling (0 = disabled)
    top_p = 0.9,            -- nucleus (top-p) sampling (0.0 = disabled)
    min_p = 0.05,           -- min-p filtering (0.0 = disabled)
    temperature = 0.8,      -- sampling temperature (1.0 = neutral)
    use_stop_conditions = true, -- apply model's stop conditions
    num_candidates = 3,     -- number of independent completions to generate
})
-- completions = {
--   { tokens = {...}, text = "commit -m \"", score = -2.3 },
--   { tokens = {...}, text = "checkout ", score = -3.1 },
--   { tokens = {...}, text = "clone ", score = -4.7 },
-- }

-- Single forward pass — returns logits for last position only [vocab_size]
local logits = model:forward(token_ids)

-- Forward pass with loss — returns loss value
local loss = model:forward_loss(token_ids)

-- Generate embeddings — returns d_model-dimensional vector
local embedding = model:embed(token_ids, {
    pool = "mean",        -- pooling: "mean" or "last" (default: "mean")
    normalize = true,     -- L2-normalize output (default: false)
})

-- Save model weights
model:save("/path/to/weights.cwgt")

-- Unload model, free memory
model:close()

Training and dataset operations

See CALM Training — Lua API and CALM Dataset — Lua API.

C Implementation

Source layout

src/calm/
  calm.h                 -- Public C API (model + tokenizer + mixer types)
  calm_lua.c             -- Lua bindings
  tokenizer.c            -- Byte-level tokenizer (see CALM_TOKENIZER.md)
  tensor.c               -- Tensor operations: matmul, elementwise, SiLU, softplus
  layernorm.c            -- LayerNorm forward + backward
  mamba.c                -- Mamba operator: SSM scan, state-cached decode,
                            forward + backward + weight init
  model.c                -- Full model: embedding, blocks, mixer dispatch,
                            LM head, forward pass, state-cached generation
  train.c                -- Backward pass, loss, optimizer (Adam/SGD),
                            EWC, contrastive training, selective weight decay
  serialize.c            -- Weight save/load (CWGT v5 format)
  Makefile

Memory management

The model uses three memory regions:

Weights: Allocated once at load time (or init time for training from scratch). Size = param_count × 4 bytes (fp32). Laid out as a single contiguous block in a fixed order matching the serialization format.

Activations: Allocated once at init time, sized for the maximum sequence length. Reused across forward passes. For inference, this is the working memory for intermediate results. For training, this includes activation storage needed for the backward pass.

Optimizer state (training only): Allocated when training starts. Adam requires 2 × param_count × 4 bytes for moment estimates. Freed after training completes.

No per-call heap allocation. All buffers are pre-allocated based on model config.

Activation memory estimate (Mini config, inference)

Mini (d_model=128, d_inner=384, d_ffn=512):

Shared buffers (in calm_activations_t, reused across blocks):

Buffer	Shape	Size
residual	768 × 128	384 KB
ln_out	768 × 128	384 KB
ffn_mid	768 × 512	1.5 MB
logits	768 × 320	960 KB

Mamba buffers (in calm_mamba_activations_t, reused across blocks):

Buffer	Shape	Size
z (gate branch)	768 × 384	1.1 MB
x (SSM branch)	768 × 384	1.1 MB
x_conv (transposed)	384 × 768	1.1 MB
conv_out	384 × 768	1.1 MB
x_post (after conv+SiLU)	768 × 384	1.1 MB
ssm_proj	768 × 40	120 KB
dt (discretized Δ)	768 × 384	1.1 MB
y (SSM output)	768 × 384	1.1 MB
mixer_out (after gating)	768 × 384	1.1 MB
in_proj_out	768 × 768	2.3 MB
A (precomputed)	384 × 16	24 KB
Total (approx)		~13 MB

All buffers are reused across blocks — only one block's activations are live at a time during inference.

Core tensor operations

The model requires a small set of operations, all operating on contiguous fp32 arrays:

Operation	Usage	Notes
MatMul (A × B)	Projections, FFN, LM head	Inner loop: fused multiply-accumulate
Element-wise multiply	Gating	Auto-vectorizes
Element-wise add	Residual connections
GELU	FFN activation	Approximate: `0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))`
SiLU	Mamba conv activation + gating	`x × sigmoid(x)`
Softplus	Mamba Δ projection	`log(1 + exp(x))` with overflow guard
LayerNorm	Pre-norm in each block, final norm	Mean + variance + normalize + scale/shift
Softmax	Output sampling	Applied to final logits, 320-wide
Depthwise Conv1D	Short filter (Mamba)	Kernel size 4, causal
Cross-entropy loss	Training	Log-softmax + NLL

Weight serialization order

Weights are serialized in a fixed order matching the model structure. See CWGT v5 format for the binary file specification.

token_emb.weight              [vocab_size × d_model]

for each block i = 0..n_layers-1:
  block[i].ln1.weight         [d_model]
  block[i].ln1.bias           [d_model]
  block[i].mixer.in_proj      [d_model × (2 × d_inner)]
  block[i].mixer.conv1d       [d_inner × d_conv]
  block[i].mixer.x_proj       [d_inner × (dt_rank + 2 × d_state)]
  block[i].mixer.dt_proj_w    [dt_rank × d_inner]
  block[i].mixer.dt_proj_b    [d_inner]
  block[i].mixer.A_log        [d_inner × d_state]
  block[i].mixer.D            [d_inner]
  block[i].mixer.out_proj     [d_inner × d_model]
  block[i].ln2.weight         [d_model]
  block[i].ln2.bias           [d_model]
  block[i].ffn_fc1.weight     [d_model × (d_model × ffn_expand)]
  block[i].ffn_fc2.weight     [(d_model × ffn_expand) × d_model]

ln_f.weight                   [d_model]
ln_f.bias                     [d_model]

(lm_head.weight tied to token_emb.weight — not serialized)