This document covers the CTDS binary dataset format, training data
source format, the continual learning workflow, and the calm.pipeline
utility module.
Related documents:
CALM — model architecture, inference, shell integration
CALM Training — training, fine-tuning, EWC
CALM Tokenizer — tokenizer specification
Training data is stored in a binary format (CTDS — CALM Training Data Set) for efficient loading without runtime tokenization overhead.
Header (14 bytes):
magic: 4 bytes "CTDS"
vocab_version: uint32 (unused, set to 0 for byte-level tokenizer)
count: uint32 number of sequences
max_len: uint16 maximum sequence length
Per-sequence metadata:
lengths: count × uint16 token count per sequence
cmd_positions: count × uint16 0-indexed ATN position per sequence
Token data:
tokens: (sum of lengths) × uint16 all tokens contiguous
All integers are little-endian.
local calm = require("calm")
local ds = calm.load_dataset("/path/to/train.ctds")
print(ds:count()) -- number of sequences
ds:shuffle() -- randomize iteration order
-- Get a batch (0-indexed batch number, batch size)
local seqs, cmd_pos = ds:batch(0, 32)
-- seqs = { {tok1, tok2, ...}, {tok1, tok2, ...}, ... }
-- cmd_pos = { 5, 8, 3, ... }
ds:close()
Shuffling permutes an index array, not the underlying data. Calling
shuffle() before each epoch provides different iteration orders.
Use pipeline.write_ctds() for the simplest approach, or write
manually from Lua using calm.build_sequence():
local calm = require("calm")
local pipeline = require("calm.pipeline")
local examples = {
{ cwd = "/home/user", input = "git status", history = {} },
{ cwd = "/var/log", input = "tail -f syslog", history = {} },
}
local f = io.open("/tmp/train.ctds", "wb")
-- Tokenize all sequences first to get counts
local all_seqs = {}
local all_cmds = {}
local max_len = 0
for _, ex in ipairs(examples) do
local seq = calm.build_sequence(pipeline.TEMPLATES.shell, ex)
-- Find ATN position
local cmd_pos = 0
for i, tok in ipairs(seq) do
if tok == calm.ATN then cmd_pos = i - 1; break end -- 0-indexed
end
all_seqs[#all_seqs + 1] = seq
all_cmds[#all_cmds + 1] = cmd_pos
if #seq > max_len then max_len = #seq end
end
-- Write header
f:write("CTDS")
-- vocab_version (uint32 LE, unused — set to 0)
f:write(string.char(0, 0, 0, 0))
-- count (uint32 LE)
local c = #all_seqs
f:write(string.char(c % 256, math.floor(c/256) % 256,
math.floor(c/65536) % 256, math.floor(c/16777216) % 256))
-- max_len (uint16 LE)
f:write(string.char(max_len % 256, math.floor(max_len/256)))
-- Write lengths
for _, seq in ipairs(all_seqs) do
local len = #seq
f:write(string.char(len % 256, math.floor(len/256)))
end
-- Write cmd_positions
for _, cp in ipairs(all_cmds) do
f:write(string.char(cp % 256, math.floor(cp/256)))
end
-- Write tokens
for _, seq in ipairs(all_seqs) do
for _, tok in ipairs(seq) do
f:write(string.char(tok % 256, math.floor(tok/256)))
end
end
f:close()
The raw training examples (before tokenization) are plain text files with one example per block, separated by blank lines:
<CWD>/home/user/lilush
<GIT>main
<HIST>git diff --stat<EXIT>0
<HIST>git status<EXIT>0
<COMP>commit<NEXT>checkout<NEXT>cherry-pick<NEXT>clone
<CMD>git commit -m "fix tokenizer"
<CWD>/var/log
<CMD>files_matching log kat -m
<CWD>/home/user
<HIST>make -C src/litls<EXIT>0
<HIST>make -C src/litls<EXIT>2
<CMD>make -C src/litls clean && make -C src/litls
Each line starts with a special token name (e.g. <CWD>, <HIST>, <CMD>):
Frame lines: <FRAME>content<SUBTOKEN>content... (<END> added
automatically during encoding)
Subtokens: <EXIT> within <HIST> lines, <NEXT> within
<COMP> lines
<ATN> for explicit attention boundary (rest of line ignored)
Lines without a special token prefix: raw text content
The calm dataset --from FILE command parses these via
pipeline.parse_text_dataset(), builds token sequences per the
tokenizer spec, and writes the binary CTDS dataset.
This format is simple enough to generate programmatically (for synthetic data) or write by hand (for curated examples).
The typical workflow for training CALM on your own shell history:
# 1. Initialize a model with random weights (first time only)
calm init # Nano (default)
calm init --size micro # Micro
calm init --size mini # Mini
# 2. Enable training metadata collection
export CALM_SAVE_TRAIN_DATA=1
# 3. Collect history by using the shell normally (automatic)
# Completion candidates are stored in MNEME by calm_store
# 4. Build dataset from shell history
calm dataset
# 5. Train (runs in foreground; use 'job start calm train' for background)
calm train
# 6. Enable predictions (first time only)
calm enable
# 7. The shell automatically hot-reloads the updated weights
# (mtime-based detection, no restart needed)
For more control over training:
calm train --epochs 5 --lr 5e-5 --optimizer adam --weight-decay 0.01
For offline training from curated text data:
# Convert text file to CTDS
calm dataset --from training_data.txt --output /tmp/train.ctds
# Train a new model from scratch
calm train --model new --size mini -d /tmp/train.ctds -o /tmp/model.cwgt --epochs 10
# Evaluate
calm evaluate -m /tmp/model.cwgt -d /tmp/train.ctds
calm datasetcalm dataset accepts these options:
| Flag | Description | Default |
|---|---|---|
--max-entries / -n | Max shell history entries | 0 (all) |
--from / -f | Plain-text training data file | (shell history) |
--output / -o | Output CTDS path | ~/.local/share/lilush/calm/train.ctds |
--view / -V | View dataset contents | |
--ds | CTDS file path (for --view) | |
--index / -i | Start index for --view (1-based) | 1 |
--count / -c | Number of sequences for --view | |
--hist-frames / -H | History frames per sequence | 5 |
--comp-frames | Completion candidates per sequence | 3 |
--max-dup | Max duplicate commands | 3 |
--min-cmd-len | Min command length in chars | 2 |
--include-failed | Include non-zero exit commands | false |
--include-trivial | Include trivial commands | false |
calm dataset runs in the foreground. Use job start calm dataset ...
for background execution.
The calm.pipeline module (src/calm/calm/pipeline.lua) provides
shared utilities used by both tool scripts and shell builtins:
resolve_paths() — standard file paths under ~/.local/share/lilush/calm/
ensure_calm_dir() — recursive mkdir for the CALM data directory
write_ctds(path, sequences, cmd_positions) — write CTDS binary files
find_cmd_pos(seq) — find ATN position in a sequence (0-indexed, or 0 if absent)
atomic_save(tmp, final) — rename-based atomic file replacement
local calm = require("calm")
-- Load a pre-tokenized binary dataset
local dataset, err = calm.load_dataset("/path/to/train.ctds")
-- Get count
print(dataset:count())
-- Get a batch of sequences (0-indexed batch number)
local batch, cmd_positions = dataset:batch(batch_idx, batch_size)
-- batch = { {token_ids...}, {token_ids...}, ... }
-- cmd_positions = { 15, 22, 18, ... }
-- Shuffle dataset (in-place, for epoch randomization)
dataset:shuffle()
dataset:close()