
CALM is Lilush's compact language model system — a family of small Mamba SSM models that serve as domain-specific completion sources, embeddings models for semantic search, etc.
CALM uses a Mamba selective state space model (SSM) as its sequence mixer.
CALM is included in the lilush build only; it is not part of the lilu
minimal runtime.
CALM is still in the experimental phase.
Related documents:
CALM Tokenizer — tokenizer specification and CWGT weight file format.
CALM Training — training, fine-tuning, EWC, weight initialization.
CALM Dataset — CTDS dataset format, training data, pipeline utilities.
Smallest viable model
Fast CPU inference
Multi-domain
Fully self-contained, no external dependencies
The same C code supports both inference and
training. All training — including base model training from scratch —
runs through the C implementation in the lilush binary, with Lua
driving the training loop. No external toolchain (Python, PyTorch) is
required at any stage.
There is, however, a companion python CALM implementation for GPU training.
The model is a decoder-only language model following the pre-norm residual pattern with a Mamba selective SSM sequence mixer.
Input token IDs (L)
│
▼
┌─────────────────┐
│ Token Embedding │ (vocab_size × d_model)
└────────┬────────┘
│
▼
┌─────────────────┐ ─┐
│ LayerNorm │ │
│ Mixer (Mamba │ │ × n_layers
│ SSM) │ │
│ + Residual │ │
│ LayerNorm │ │
│ FFN (GELU) │ │
│ + Residual │ │
└────────┬────────┘ ─┘
│
▼
┌─────────────────┐
│ Final LayerNorm │
└────────┬────────┘
│
▼
┌─────────────────┐
│ LM Head │ (d_model × vocab_size, tied with embedding)
└────────┬────────┘
│
▼
Logits (L, vocab_size)
Each block follows pre-norm residual convention:
x → LayerNorm → Mixer (Mamba) → (+x) → LayerNorm → FFN → (+x)
The FFN is a standard two-layer MLP with GELU activation:
FFN(x) = W2 · GELU(W1 · x)
Where W1 projects from d_model to d_model × ffn_expand and W2
projects back. No bias on either linear layer (following modern practice
for small models).
The LM head (output projection) shares weights with the token embedding matrix. This halves the parameter cost of the vocabulary — significant when the embedding table is a large fraction of total parameters (as it is for our small models with 320-token vocabulary).
The Mamba operator's SSM recurrence implicitly encodes positional information through its sequential state evolution. Explicit positional embeddings (sinusoidal or learned) are not used. This simplifies the architecture and removes a parameter/computation cost.
The Mamba operator is the sequence mixer based on selective state space models (Mamba-1, Gu & Dao 2023, arXiv:2312.00752). It uses a sequential recurrence that has O(1) per-token decode cost.
Given input x of shape (L, d_model):
1. Project: z, x' = split(W_in(x)) W_in: [d_model → 2 × d_inner]
2. Conv1d: x' = SiLU(DepthwiseConv1D(x', k=d_conv)) causal, per-channel
3. SSM proj: Δ_raw, B, C = split(W_x(x')) W_x: [d_inner → dt_rank + 2×d_state]
4. Δ project: Δ = softplus(W_dt(Δ_raw) + bias) W_dt: [dt_rank → d_inner]
5. Discretize: A_bar = exp(Δ ⊙ A) A: (d_inner × d_state), log-space
B_bar = Δ ⊙ B
6. SSM scan: h[t] = A_bar[t] ⊙ h[t-1] + B_bar[t] ⊙ x'[t]
y[t] = C[t] · h[t] + D · x'[t]
7. Gate: y = y ⊙ SiLU(z)
8. Out proj: out = W_out(y) W_out: [d_inner → d_model]
| Parameter | Description | Default |
|---|---|---|
expand | Inner expansion: d_inner = d_model × expand | 2 |
d_state | SSM state dimensions per channel | 16 |
d_conv | Short convolution kernel size | 4 |
dt_rank | Δ projection bottleneck | ceil(d_model / 16) |
The transition matrix A uses the full (d_inner, d_state) parameterization
from Mamba-1. Each (channel, state) pair has its own learned decay rate.
A is stored in log-space as A_log and negated during discretization:
A = -exp(A_log). This ensures stability (all eigenvalues are negative,
producing decaying dynamics).
Initialization follows the S4D convention: A_log[c][n] = log(n + 1),
giving each state dimension a different initial timescale.
Each channel has a learned skip parameter D[c] that adds a direct path
from the SSM input to output: y[t][c] += D[c] × x'[t][c]. Initialized
to ones. A_log and D are excluded from weight decay during training.
Mamba's SSM state enables efficient autoregressive generation:
Prompt processing: Full forward pass over the prompt populates
ssm_state and conv_state in each block.
Token generation: Each new token runs calm_mamba_step() per block,
updating the persistent state in O(d_model²) time — independent of
prompt length.
Multi-candidate: The state after prompt processing is snapshot and restored for each candidate (~180 KB for Mini).
All configurations use:
vocab_size: 320 (byte-level tokenizer: 256 byte + 22 special + 42 reserved).
See CALM_TOKENIZER.md for details.
l_max: 768 (context window)
tie_weights: true
d_state: 16, d_conv: 4
expand scales with model size: 2 (Nano/Micro), 3 (Mini), 4 (Small).
| Name | d_model | n_layers | expand | ffn_expand | Params | Weights (fp32) |
|---|---|---|---|---|---|---|
| Nano | 64 | 3 | 2 | 2 | ~283K | ~1.1 MB |
| Micro | 96 | 5 | 2 | 3 | ~741K | ~2.9 MB |
| Mini | 128 | 6 | 3 | 4 | ~2.16M | ~8.4 MB |
| Small | 192 | 8 | 4 | 4 | ~7.29M | ~28.4 MB |
Use calm benchmark to measure inference latency for each configuration
on your hardware.
All dimension limits are enforced either by header field sizes or by
explicit validation in calm_model_new(). The current limits are
well above what the four model configs require.
| Dimension | Type | Hard limit | Code limit | Current max use |
|---|---|---|---|---|
d_model | uint16_t | 65,535 | none | 192 (Small) |
n_layers | uint8_t | 255 | 16 (CALM_MAX_LAYERS) | 8 (Small) |
ffn_expand | uint8_t | 255 | > 0 | 4 (Mini/Small) |
l_max | uint16_t | 65,535 | none | 768 |
vocab_size | uint16_t | 65,535 | none (default 320) | 320 |
param_count | uint32_t | ~4.3B | none | ~7.29M (Small) |
d_inner | — | — | = d_model × expand | 768 (Small) |
d_state | uint8_t (header) | 255 | none | 16 |
d_conv | uint8_t (header) | 255 | none | 4 |
expand | uint8_t (header) | 255 | none | 4 (Small) |
dt_rank | uint8_t (header) | 255 | none | 12 (Small) |
The CALM_MAX_LAYERS constant is in calm.h.
For a single completion request:
Build input sequence: Context frames + current input, tokenized per the tokenizer spec. Result: token ID array of length L ≤ l_max (768 by default).
Embedding lookup: Map token IDs to d_model-dimensional vectors.
Blocks: For each block, apply LayerNorm → Mamba → residual → LayerNorm → FFN → residual.
Output: Apply final LayerNorm, then LM head (= embedding matrix transposed) to get logits over vocabulary.
Sample/select: Apply temperature, top-k, or greedy selection to get the next token.
Repeat: Append predicted token to input, run forward pass again.
Stop at <EOS>, structural boundary, or token limit.
After the initial prompt forward pass, each generated token uses
calm_mamba_step() which runs in O(d_model²) time per block —
independent of prompt length. The SSM state and conv state persist across
tokens. For multi-candidate generation, the state after prompt processing
is snapshot (~180 KB for Mini) and restored per candidate.
CALM supports greedy decoding and a multi-stage sampling pipeline,
configurable at inference time. Shell commands have low entropy at
most positions — after git c, there are only a few plausible
continuations.
When any sampler is enabled, token selection follows a sequential filtering pipeline:
Softmax — logits are converted to probabilities (with temperature scaling)
Min-p filtering (min_p) — removes tokens whose probability is
below max_prob × min_p. Adapts to the distribution shape: keeps
more tokens when the model is uncertain, fewer when confident.
Top-k filtering (top_k) — keeps only the k highest-probability tokens.
Top-p filtering (top_p, nucleus sampling) — keeps tokens in
descending probability order until cumulative probability exceeds
the threshold. Adapts the candidate set size to the distribution.
Re-normalize and sample — the surviving probabilities are re-normalized and a token is drawn. If all probabilities were zeroed by the filters, falls back to argmax over the original logits.
All three filters compose and each can be independently disabled by
setting its value to 0. Greedy decoding (argmax) is used when all
three are disabled (top_k=0, top_p=0, min_p=0).
The default completion source uses top_k=5 with temperature=0.8.
Shell command completion is CALM's primary interactive domain. The shell domain model uses session context (working directory, git status, command history, traditional completions) to predict command continuations.
calm builtinCALM is managed through the calm shell builtin:
| Command | Description |
|---|---|
calm status | Show model status, architecture info, sampler defaults |
calm enable | Enable CALM predictions |
calm disable | Disable CALM predictions |
calm init [--size S] | Initialize a new model with random weights |
calm dataset [-n N] | Build CTDS dataset from shell history |
calm dataset --from FILE [-o PATH] | Convert text training data to CTDS |
calm dataset --view [-i N] [-c N] | View CTDS dataset contents |
calm train [options] | Train the model (foreground) |
calm evaluate [-m MODEL] [-d DATASET] | Evaluate model loss on a dataset |
calm benchmark | Measure inference latency across model sizes |
calm generate [-m MODEL] [-i INPUT] [options] | One-shot completion generation (default --max-tokens 256) |
calm model [name\|auto] | List available models or switch the active model |
calm meta [options] | View or modify model metadata and sampler defaults |
calm reset | Delete model files |
CALM completions are disabled by default. Run calm enable to activate
after initializing and training a model.
calm init creates a model with random weights at
~/.local/share/lilush/calm/<domain>.cwgt (e.g. shell.cwgt). The
--size option selects the model configuration: nano (default),
micro, mini, or small. It does not download a pretrained model.
calm generate runs one-shot completion on the given input. Input is
provided via --input / -i or read from stdin (pipe-friendly). Accepts
--top-k, --top-p / -p, --min-p, --temperature, --max-tokens,
--candidates, and --raw / -r options. In raw mode, <n> patterns
in the input are replaced with the corresponding special token IDs
(e.g. <BOS>, <CMD>, <ATN>), giving full control over the token sequence.
By default, output includes model info, scored candidates, and generation
stats (tokens, time, tok/s). Use --quiet / -q to output only the
completion text. Use --full to prepend the prompt to each completion.
Use --special-tokens / -s to render special tokens as <BOS> etc.
instead of stripping them.
For calm dataset and calm train CLI details, see
CALM Training and CALM Dataset.
CALM acts as a completion source (src/shell/shell/completion/source/calm.lua)
that provides ghost text predictions alongside traditional completions.
Lazy loading: The model is loaded on the first completion request, not at shell startup, to avoid startup latency from reading the weight file.
Hot-reload: On each completion request, the weight file's mtime is
checked. If the file has changed (e.g., after calm train), the model
is automatically reloaded.
Model resolution: The active model is resolved via the model
registry (~/.config/lilush/calm/registry.json). The default model
name is shell. Override with the LILUSH_CALM_MODEL environment
variable (takes an absolute path). Use calm model <n> to switch
manually, or calm model auto to re-enable mode-based auto-switching.
Template-driven prompt construction: The completion source inspects the loaded model's template to decide how to build the prompt. Context gathering is driven by the template's field names, not the domain name:
If the template includes cwd, git, history, completions, or
env fields, the corresponding shell context is gathered automatically
(working directory, git status, command history, traditional completion
candidates, environment hints).
The input field receives the raw user input.
For templates with no named fields (bare tokens only, e.g. BOS;ATN),
the raw input text is appended after <ATN>.
For templates with unrecognized field names, the completion source
parses the input for inline field:value patterns. For example,
with template BOS;WORD:headword/POS:pos;ATN and input
pos:n. headword:anything you want, the fields are parsed as
pos="n." and headword="anything you want". If no field patterns
are found in the input, the entire input is mapped to the input
field (if present in the template) or the first named field.
This design means any domain model that includes context fields in its
template (e.g. CWD:cwd;GIT:git) automatically gets that context
without domain-specific code in the completion source.
Sampler configuration: Inference parameters are resolved from two sources (highest priority first): model sampler defaults (stored in the CWGT header), and hardcoded fallbacks.
| Parameter | Hardcoded fallback |
|---|---|
top_k | 5 |
top_p | 0 (disabled) |
min_p | 0 (disabled) |
temperature | 0.7 |
max_tokens | 20 |
num_candidates | 3 |
| Path | Purpose |
|---|---|
~/.local/share/lilush/calm/*.cwgt | Model weight files (one per domain) |
~/.local/share/lilush/calm/train.ctds | Default training dataset |
~/.config/lilush/calm/registry.json | Model registry (mode→model mapping) |
CALM training data is sourced from ~/.local/share/lilush/shell.mneme,
the shell's main MNEME database. The calm dataset command reads
history from the shell keyspace (sorted set entries) and completion
candidates from the completions keyspace.
For detailed tokenizer documentation, see CALM_TOKENIZER.md.
local calm = require("calm")
-- Tokenize text (byte-level identity mapping)
local tokens = calm.tokenize("git commit -m 'fix'")
-- Returns byte values: {103, 105, 116, 32, 99, 111, ...}
-- Decode tokens back to text
local text = calm.detokenize(tokens)
-- Build a full input sequence with context
-- First argument is a model userdata or template spec string
local seq = calm.build_sequence(model, {
cwd = "/home/user/lilush",
git = "main+3",
history = {
{ cmd = "git diff --stat", exit = 0 },
{ cmd = "git status", exit = 0 },
},
completions = { "commit", "checkout", "cherry-pick", "clone" },
input = "git com",
})
local calm = require("calm")
-- Load model from weight file
local model, err = calm.load_model("/path/to/weights.cwgt")
-- Initialize a new model with random weights
local model = calm.new_model({
d_model = 128, n_layers = 6, ffn_expand = 4,
expand = 2, d_state = 16, d_conv = 4,
l_max = 768,
})
-- Get model info
local info = model:info()
-- info.d_state, info.d_conv, info.expand, info.dt_rank, info.d_inner = d_model*expand
-- Run inference: generate completions
-- seq is a token ID array from calm.build_sequence()
local completions = model:complete(seq, {
max_tokens = 20, -- max tokens to generate
top_k = 5, -- top-k sampling (0 = disabled)
top_p = 0.9, -- nucleus (top-p) sampling (0.0 = disabled)
min_p = 0.05, -- min-p filtering (0.0 = disabled)
temperature = 0.8, -- sampling temperature (1.0 = neutral)
use_stop_conditions = true, -- apply model's stop conditions
num_candidates = 3, -- number of independent completions to generate
})
-- completions = {
-- { tokens = {...}, text = "commit -m \"", score = -2.3 },
-- { tokens = {...}, text = "checkout ", score = -3.1 },
-- { tokens = {...}, text = "clone ", score = -4.7 },
-- }
-- Single forward pass — returns logits for last position only [vocab_size]
local logits = model:forward(token_ids)
-- Forward pass with loss — returns loss value
local loss = model:forward_loss(token_ids)
-- Generate embeddings — returns d_model-dimensional vector
local embedding = model:embed(token_ids, {
pool = "mean", -- pooling: "mean" or "last" (default: "mean")
normalize = true, -- L2-normalize output (default: false)
})
-- Save model weights
model:save("/path/to/weights.cwgt")
-- Unload model, free memory
model:close()
src/calm/
calm.h -- Public C API (model + tokenizer + mixer types)
calm_lua.c -- Lua bindings
tokenizer.c -- Byte-level tokenizer (see CALM_TOKENIZER.md)
tensor.c -- Tensor operations: matmul, elementwise, SiLU, softplus
layernorm.c -- LayerNorm forward + backward
mamba.c -- Mamba operator: SSM scan, state-cached decode,
forward + backward + weight init
model.c -- Full model: embedding, blocks, mixer dispatch,
LM head, forward pass, state-cached generation
train.c -- Backward pass, loss, optimizer (Adam/SGD),
EWC, contrastive training, selective weight decay
serialize.c -- Weight save/load (CWGT v5 format)
Makefile
The model uses three memory regions:
Weights: Allocated once at load time (or init time for training
from scratch). Size = param_count × 4 bytes (fp32). Laid out as a
single contiguous block in a fixed order matching the serialization
format.
Activations: Allocated once at init time, sized for the maximum sequence length. Reused across forward passes. For inference, this is the working memory for intermediate results. For training, this includes activation storage needed for the backward pass.
Optimizer state (training only): Allocated when training starts.
Adam requires 2 × param_count × 4 bytes for moment estimates. Freed
after training completes.
No per-call heap allocation. All buffers are pre-allocated based on model config.
Mini (d_model=128, d_inner=384, d_ffn=512):
Shared buffers (in calm_activations_t, reused across blocks):
| Buffer | Shape | Size |
|---|---|---|
| residual | 768 × 128 | 384 KB |
| ln_out | 768 × 128 | 384 KB |
| ffn_mid | 768 × 512 | 1.5 MB |
| logits | 768 × 320 | 960 KB |
Mamba buffers (in calm_mamba_activations_t, reused across blocks):
| Buffer | Shape | Size |
|---|---|---|
| z (gate branch) | 768 × 384 | 1.1 MB |
| x (SSM branch) | 768 × 384 | 1.1 MB |
| x_conv (transposed) | 384 × 768 | 1.1 MB |
| conv_out | 384 × 768 | 1.1 MB |
| x_post (after conv+SiLU) | 768 × 384 | 1.1 MB |
| ssm_proj | 768 × 40 | 120 KB |
| dt (discretized Δ) | 768 × 384 | 1.1 MB |
| y (SSM output) | 768 × 384 | 1.1 MB |
| mixer_out (after gating) | 768 × 384 | 1.1 MB |
| in_proj_out | 768 × 768 | 2.3 MB |
| A (precomputed) | 384 × 16 | 24 KB |
| Total (approx) | ~13 MB |
All buffers are reused across blocks — only one block's activations are live at a time during inference.
The model requires a small set of operations, all operating on contiguous fp32 arrays:
| Operation | Usage | Notes |
|---|---|---|
| MatMul (A × B) | Projections, FFN, LM head | Inner loop: fused multiply-accumulate |
| Element-wise multiply | Gating | Auto-vectorizes |
| Element-wise add | Residual connections | |
| GELU | FFN activation | Approximate: 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))) |
| SiLU | Mamba conv activation + gating | x × sigmoid(x) |
| Softplus | Mamba Δ projection | log(1 + exp(x)) with overflow guard |
| LayerNorm | Pre-norm in each block, final norm | Mean + variance + normalize + scale/shift |
| Softmax | Output sampling | Applied to final logits, 320-wide |
| Depthwise Conv1D | Short filter (Mamba) | Kernel size 4, causal |
| Cross-entropy loss | Training | Log-softmax + NLL |
Weights are serialized in a fixed order matching the model structure. See CWGT v5 format for the binary file specification.
token_emb.weight [vocab_size × d_model]
for each block i = 0..n_layers-1:
block[i].ln1.weight [d_model]
block[i].ln1.bias [d_model]
block[i].mixer.in_proj [d_model × (2 × d_inner)]
block[i].mixer.conv1d [d_inner × d_conv]
block[i].mixer.x_proj [d_inner × (dt_rank + 2 × d_state)]
block[i].mixer.dt_proj_w [dt_rank × d_inner]
block[i].mixer.dt_proj_b [d_inner]
block[i].mixer.A_log [d_inner × d_state]
block[i].mixer.D [d_inner]
block[i].mixer.out_proj [d_inner × d_model]
block[i].ln2.weight [d_model]
block[i].ln2.bias [d_model]
block[i].ffn_fc1.weight [d_model × (d_model × ffn_expand)]
block[i].ffn_fc2.weight [(d_model × ffn_expand) × d_model]
ln_f.weight [d_model]
ln_f.bias [d_model]
(lm_head.weight tied to token_emb.weight — not serialized)