CALM Training

Model Creation

From scratch (random weights):

local model = calm.new_model({
    d_model    = 128,
    n_layers   = 6,
    ffn_expand = 4,
    expand     = 2,        -- d_inner = d_model × expand (default 2)
    d_state    = 16,       -- SSM state dimensions (default 16)
    d_conv     = 4,        -- short conv kernel size (default 4)
    l_max      = 768,
    seed       = 42,
    domain     = "shell",
    template   = "BOS;CWD:cwd;GIT:git;...;ATN;CMD:input",
    stop_conditions = "| ; && ||",
    sampler_defaults = {
        temperature = 0.8, top_k = 5,
        max_tokens = 20, num_candidates = 5,
    },
})

From existing weights:

local model = calm.load_model("/path/to/weights.cwgt")

This loads the weight file, validates the header, allocates activations, and prepares the model. The model is immediately ready for inference or further training.

Trainer Configuration

local trainer = calm.trainer(model, {
    lr           = 1e-4,   -- learning rate
    optimizer    = "adam",  -- "adam" or "sgd"
    beta1        = 0.9,    -- Adam first moment decay
    beta2        = 0.999,  -- Adam second moment decay
    eps          = 1e-8,   -- Adam epsilon
    weight_decay = 0.01,   -- decoupled weight decay (AdamW)
    grad_clip    = 1.0,    -- max gradient norm (0 = no clipping)
    ewc_lambda   = 0.0,    -- EWC regularization strength (0 = disabled)
})

The trainer allocates all required buffers on creation:

Buffer	Size (Mini config, L=768)
Gradient buffer	~7 MB
Adam m + v	~14 MB
Training activations	~15 MB
Backward scratch	~3 MB

All buffers are freed when trainer:close() is called.

Adam (default) converges faster and is recommended for most training. Uses ~3x the weight memory for optimizer state.

SGD is simpler and uses no additional state beyond the gradient buffer. Requires higher learning rates (typically 10-100x Adam's LR) and more steps to converge. Useful for quick fine-tuning with tight memory constraints.

Loss Function

Standard cross-entropy loss over next-token prediction:

loss = CrossEntropy(logits[:-1, :], target_ids[1:])

Loss is computed on all tokens after the <ATN> boundary. The cmd_pos value stores the 0-indexed position of <ATN> in the sequence. The loss loop starts at this position, predicting tokens[cmd_pos + 1] onward — so every token after <ATN> is a prediction target. If <ATN> is absent, cmd_pos is 0 and loss covers the entire sequence (full-sequence mode).

For shell datasets, <ATN> appears immediately before <CMD>, so <CMD> itself is the first prediction target, followed by the command bytes. For other domains (e.g. dictionary), the first content token after <ATN> is the first target. The mechanism is the same — <ATN> marks where loss begins.

Training Loop

Single batch step:

local loss = trainer:step(sequences, cmd_positions)

sequences: table of token tables, e.g. { {1,3,8,20,2}, {1,3,8,30,40,2} }
cmd_positions: table of 0-indexed <ATN> positions (or 0 for full-sequence loss), e.g. { 1, 1 }
Returns: average cross-entropy loss over the batch

Each step() call:

Zeros all gradients
Runs forward + backward for each sequence, accumulating gradients
Averages gradients over the batch
Adds EWC penalty gradient (if ewc_lambda > 0 and Fisher is computed)
Clips gradients to grad_clip norm
Applies optimizer update (Adam or SGD)

Training from scratch:

local calm = require("calm")

local model = calm.new_model({
    d_model = 128, n_layers = 6, ffn_expand = 4,
    expand = 2, d_state = 16, d_conv = 4,
    l_max = 768, seed = 42,
})

local trainer = calm.trainer(model, {
    lr = 1e-3, optimizer = "adam",
    weight_decay = 0.01, grad_clip = 1.0,
})

local ds = calm.load_dataset("/path/to/train.ctds")

local num_epochs = 10
local batch_size = 32
local batches_per_epoch = math.ceil(ds:count() / batch_size)

for epoch = 1, num_epochs do
    ds:shuffle()
    local epoch_loss = 0
    for b = 0, batches_per_epoch - 1 do
        local seqs, cmds = ds:batch(b, batch_size)
        if #seqs == 0 then break end
        local loss = trainer:step(seqs, cmds)
        epoch_loss = epoch_loss + loss
    end
    print(string.format("epoch %d  avg_loss=%.4f", epoch, epoch_loss / batches_per_epoch))
end

trainer:save("/path/to/trained.cwgt")
trainer:close()
model:close()
ds:close()

Fine-tuning existing weights:

Same as above, but start from calm.load_model() instead of calm.new_model(), and use a lower learning rate:

local model = calm.load_model("/path/to/base_model.cwgt")
local trainer = calm.trainer(model, {
    lr = 5e-5, optimizer = "adam",
    weight_decay = 0.01, grad_clip = 1.0,
})
-- ... training loop ...

EWC Regularization

Elastic Weight Consolidation prevents catastrophic forgetting during fine-tuning. When the model fine-tunes on new user history, EWC penalizes large changes to parameters that were important for previously learned patterns.

loss_ewc = λ × Σ_i F_i × (θ_i - θ*_i)²

Where F_i is the Fisher information (diagonal approximation) for parameter i, θ*_i is the parameter value after previous training, and λ controls the regularization strength.

Workflow:

-- 1. Train on initial data
local model = calm.load_model("base.cwgt")
local trainer = calm.trainer(model, { lr = 1e-4, optimizer = "adam" })

for epoch = 1, 5 do
    -- ... training loop on dataset A ...
end

-- 2. Compute Fisher information and save anchor
trainer:compute_fisher(validation_seqs, validation_cmd_pos)

-- 3. Save model (includes EWC data: Fisher diagonal + anchor weights)
trainer:save("model_with_ewc.cwgt")
trainer:close()

-- 4. Later: fine-tune on new data with EWC protection
local model2 = calm.load_model("model_with_ewc.cwgt")
local trainer2 = calm.trainer(model2, {
    lr = 1e-4, optimizer = "adam",
    ewc_lambda = 10.0,  -- regularization strength
})

for epoch = 1, 5 do
    -- ... training loop on dataset B ...
    -- EWC penalty automatically prevents drift from anchor
end

trainer2:save("model_updated.cwgt")
trainer2:close()
model2:close()

After compute_fisher(), the Fisher information diagonal F[i] is stored per parameter, approximating how important each parameter is for the current task. The current weights are saved as the "anchor" theta*. Both are written into the weight file when saved.

During subsequent training with ewc_lambda > 0, each optimizer step adds a penalty gradient:

grad[i] += ewc_lambda * F[i] * (theta[i] - theta*[i])

Choosing ewc_lambda:

Value	Behavior
0	No EWC (free adaptation)
0.1 - 1.0	Mild: allows substantial adaptation
1.0 - 10.0	Moderate: balances old and new
10.0 - 100.0	Strong: heavily preserves old behavior
> 100.0	Very strong: new data barely changes the model

Start with ewc_lambda = 10.0 and adjust based on whether the model retains enough of its original capability.

Scenario	Optimizer	LR	Notes
From scratch (Nano/Micro)	Adam	1e-3	Small models tolerate higher LR
From scratch (Mini/Small)	Adam	3e-4	Larger models need lower LR
Fine-tuning	Adam	1e-4 to 5e-5	Lower to avoid forgetting
Quick adaptation	SGD	0.01	Few steps, coarse updates

Weight Initialization

Embedding weights: Normal distribution, std = initializer_range (0.02)
Linear layer weights: Normal distribution, std = initializer_range
LayerNorm weights: initialized to 1.0
LayerNorm biases: initialized to 0.0
Residual-path outputs (mixer out_proj, ffn_fc2): scaled init, std = initializer_range / √(2 × n_layers) (GPT-2 style)
dt_proj_weight: Uniform(-σ, σ), σ = dt_rank^(-0.5)
dt_proj_bias: inverse softplus of log-uniform in [0.001, 0.1] (critical for training stability — controls initial Δ timescales)
A_log: S4D convention: log(arange(1, d_state+1)) repeated per channel
D: initialized to ones (identity skip connection)
A_log and D are excluded from weight decay during training

Practical Tips

Batch size: Batch size 1 works but is noisy. Batch sizes of 8-32 provide smoother gradients. The training step averages gradients over the batch, so larger batches give more stable updates at the cost of more computation per step.

Monitoring convergence: Watch the loss value returned by trainer:step(). For memorization (overfitting to a small dataset), loss should approach 0. For generalization, track loss on a held-out validation set separately using model:forward().

Memory budget:

Component	Nano (d=64)	Mini (d=128)	Small (d=192)
Weights	~1 MB	~8 MB	~28 MB
Gradients	~1 MB	~8 MB	~28 MB
Adam state	~2 MB	~16 MB	~56 MB
Train activations	~2 MB	~15 MB	~40 MB
Scratch buffers	~1 MB	~3 MB	~6 MB
Total	~7 MB	~50 MB	~158 MB

All training memory is freed when trainer:close() is called. The model retains only its weights and inference activation buffers.

Shell Builtin: `calm train`

calm train accepts options:

--model / -m — model path or "new" to create from scratch (default: user model)
--dataset / -d — CTDS dataset path (default: ~/.local/share/lilush/calm/train.ctds)
--output / -o — output weight file path (default: same as model source)
--epochs / -e — training epochs (default: 3)
--batch-size / -b — batch size (default: 8)
--lr — learning rate (default: 5e-5)
--optimizer — adam or sgd (default: adam)
--ewc-lambda — EWC regularization strength (default: 0 = disabled)
--grad-clip — gradient clipping norm (default: 1.0)
--weight-decay — weight decay (default: 0.01)
--warmup-steps — LR warmup steps; when >0, enables cosine annealing schedule (default: 0 = flat LR)
--min-lr — minimum learning rate for cosine schedule (default: 1e-6)
--size / -s — model size preset for --model new (nano/micro/mini/small)
--d-model, --n-layers, --ffn-expand — explicit dimensions for --model new

When training a model from scratch (--model new), use a higher learning rate with cosine schedule for best results:

calm train --model new --size nano --lr 1e-3 --warmup-steps 100 --epochs 10 --batch-size 32

calm train runs in the foreground. Use job start calm train ... for background execution.

calm evaluate computes average, min, and max per-sequence loss on a CTDS dataset. Useful for checking model quality on a held-out test set.

calm benchmark creates temporary models at each size (Nano through Small) and measures forward pass and completion latency.

Lua API

Training operations

local calm = require("calm")

-- Start a training session (from existing or freshly initialized model)
local trainer = calm.trainer(model, {
    lr = 1e-4,              -- learning rate
    optimizer = "adam",      -- "adam" or "sgd"
    weight_decay = 0.01,    -- AdamW weight decay
    ewc_lambda = 0.5,       -- EWC regularization strength (0 = disabled)
    grad_clip = 1.0,        -- gradient clipping norm
})

-- Feed a batch of training sequences
-- sequences is an array of token ID arrays
-- cmd_positions is an array of ATN positions (for loss masking)
local loss = trainer:step(sequences, cmd_positions)

-- Compute and store Fisher information for EWC
-- (call after training, before saving weights)
trainer:compute_fisher(validation_sequences, validation_cmd_positions)

-- Save updated weights (includes EWC data if computed)
trainer:save("/path/to/weights.cwgt")

-- Adjust learning rate dynamically
trainer:set_lr(5e-5)

-- Contrastive training step (for embedding models)
-- queries and positives are arrays of token ID arrays
local loss = trainer:contrastive_step(queries, positives, {
    temperature = 0.07,  -- InfoNCE temperature (default: 0.07)
    pool = "mean",       -- pooling: "mean" or "last" (default: "mean")
})

-- Manual gradient accumulation (forward+backward without optimizer step)
trainer:accumulate(tokens, cmd_pos)

-- Access/clear gradients
local grad = trainer:get_grad(param_index)
trainer:zero_grad()

-- Free training state (optimizer moments, etc.)
trainer:close()

Overview

Quick Start

Model Creation

Trainer Configuration

Loss Function

Training Loop

EWC Regularization

Learning Rate Schedule

Gradient Computation

Weight Initialization

Practical Tips

Shell Builtin: `calm train`

Lua API

Training operations

Overview

Quick Start

Model Creation

Trainer Configuration

Loss Function

Training Loop

EWC Regularization

Learning Rate Schedule

Gradient Computation

Weight Initialization

Practical Tips

Shell Builtin: calm train

Lua API

Training operations

Shell Builtin: `calm train`