CALM Dataset

CTDS Dataset Format

Training data is stored in a binary format (CTDS — CALM Training Data Set) for efficient loading without runtime tokenization overhead.

File layout

Header (14 bytes):
  magic:          4 bytes    "CTDS"
  vocab_version:  uint32     (unused, set to 0 for byte-level tokenizer)
  count:          uint32     number of sequences
  max_len:        uint16     maximum sequence length

Per-sequence metadata:
  lengths:        count × uint16    token count per sequence
  cmd_positions:  count × uint16    0-indexed ATN position per sequence

Token data:
  tokens:         (sum of lengths) × uint16    all tokens contiguous

All integers are little-endian.

Loading and iterating

local calm = require("calm")
local ds = calm.load_dataset("/path/to/train.ctds")

print(ds:count())        -- number of sequences

ds:shuffle()             -- randomize iteration order

-- Get a batch (0-indexed batch number, batch size)
local seqs, cmd_pos = ds:batch(0, 32)
-- seqs = { {tok1, tok2, ...}, {tok1, tok2, ...}, ... }
-- cmd_pos = { 5, 8, 3, ... }

ds:close()

Shuffling permutes an index array, not the underlying data. Calling shuffle() before each epoch provides different iteration orders.

Creating CTDS files

Use pipeline.write_ctds() for the simplest approach, or write manually from Lua using calm.build_sequence():

local calm = require("calm")

local pipeline = require("calm.pipeline")

local examples = {
    { cwd = "/home/user", input = "git status", history = {} },
    { cwd = "/var/log",   input = "tail -f syslog", history = {} },
}

local f = io.open("/tmp/train.ctds", "wb")

-- Tokenize all sequences first to get counts
local all_seqs = {}
local all_cmds = {}
local max_len = 0

for _, ex in ipairs(examples) do
    local seq = calm.build_sequence(pipeline.TEMPLATES.shell, ex)
    -- Find ATN position
    local cmd_pos = 0
    for i, tok in ipairs(seq) do
        if tok == calm.ATN then cmd_pos = i - 1; break end  -- 0-indexed
    end
    all_seqs[#all_seqs + 1] = seq
    all_cmds[#all_cmds + 1] = cmd_pos
    if #seq > max_len then max_len = #seq end
end

-- Write header
f:write("CTDS")
-- vocab_version (uint32 LE, unused — set to 0)
f:write(string.char(0, 0, 0, 0))
-- count (uint32 LE)
local c = #all_seqs
f:write(string.char(c % 256, math.floor(c/256) % 256,
                    math.floor(c/65536) % 256, math.floor(c/16777216) % 256))
-- max_len (uint16 LE)
f:write(string.char(max_len % 256, math.floor(max_len/256)))

-- Write lengths
for _, seq in ipairs(all_seqs) do
    local len = #seq
    f:write(string.char(len % 256, math.floor(len/256)))
end

-- Write cmd_positions
for _, cp in ipairs(all_cmds) do
    f:write(string.char(cp % 256, math.floor(cp/256)))
end

-- Write tokens
for _, seq in ipairs(all_seqs) do
    for _, tok in ipairs(seq) do
        f:write(string.char(tok % 256, math.floor(tok/256)))
    end
end

f:close()

Training Data Source Format

The raw training examples (before tokenization) are plain text files with one example per block, separated by blank lines:

<CWD>/home/user/lilush
<GIT>main
<HIST>git diff --stat<EXIT>0
<HIST>git status<EXIT>0
<COMP>commit<NEXT>checkout<NEXT>cherry-pick<NEXT>clone
<CMD>git commit -m "fix tokenizer"

<CWD>/var/log
<CMD>files_matching log kat -m

<CWD>/home/user
<HIST>make -C src/litls<EXIT>0
<HIST>make -C src/litls<EXIT>2
<CMD>make -C src/litls clean && make -C src/litls

Each line starts with a special token name (e.g. <CWD>, <HIST>, <CMD>):

Frame lines: <FRAME>content<SUBTOKEN>content... (<END> added automatically during encoding)
Subtokens: <EXIT> within <HIST> lines, <NEXT> within <COMP> lines
<ATN> for explicit attention boundary (rest of line ignored)
Lines without a special token prefix: raw text content

The calm dataset --from FILE command parses these via pipeline.parse_text_dataset(), builds token sequences per the tokenizer spec, and writes the binary CTDS dataset.

This format is simple enough to generate programmatically (for synthetic data) or write by hand (for curated examples).

Continual Learning Workflow

The typical workflow for training CALM on your own shell history:

# 1. Initialize a model with random weights (first time only)
calm init                             # Nano (default)
calm init --size micro                # Micro
calm init --size mini                 # Mini

# 2. Enable training metadata collection
export CALM_SAVE_TRAIN_DATA=1

# 3. Collect history by using the shell normally (automatic)
#    Completion candidates are stored in MNEME by calm_store

# 4. Build dataset from shell history
calm dataset

# 5. Train (runs in foreground; use 'job start calm train' for background)
calm train

# 6. Enable predictions (first time only)
calm enable

# 7. The shell automatically hot-reloads the updated weights
#    (mtime-based detection, no restart needed)

For more control over training:

calm train --epochs 5 --lr 5e-5 --optimizer adam --weight-decay 0.01

For offline training from curated text data:

# Convert text file to CTDS
calm dataset --from training_data.txt --output /tmp/train.ctds

# Train a new model from scratch
calm train --model new --size mini -d /tmp/train.ctds -o /tmp/model.cwgt --epochs 10

# Evaluate
calm evaluate -m /tmp/model.cwgt -d /tmp/train.ctds

Shell Builtin: `calm dataset`

calm dataset accepts these options:

Flag	Description	Default
`--max-entries` / `-n`	Max shell history entries	0 (all)
`--from` / `-f`	Plain-text training data file	(shell history)
`--output` / `-o`	Output CTDS path	`~/.local/share/lilush/calm/train.ctds`
`--view` / `-V`	View dataset contents
`--ds`	CTDS file path (for `--view`)
`--index` / `-i`	Start index for `--view` (1-based)	1
`--count` / `-c`	Number of sequences for `--view`
`--hist-frames` / `-H`	History frames per sequence	5
`--comp-frames`	Completion candidates per sequence	3
`--max-dup`	Max duplicate commands	3
`--min-cmd-len`	Min command length in chars	2
`--include-failed`	Include non-zero exit commands	false
`--include-trivial`	Include trivial commands	false

calm dataset runs in the foreground. Use job start calm dataset ... for background execution.

Pipeline Module

The calm.pipeline module (src/calm/calm/pipeline.lua) provides shared utilities used by both tool scripts and shell builtins:

resolve_paths() — standard file paths under ~/.local/share/lilush/calm/
ensure_calm_dir() — recursive mkdir for the CALM data directory
write_ctds(path, sequences, cmd_positions) — write CTDS binary files
find_cmd_pos(seq) — find ATN position in a sequence (0-indexed, or 0 if absent)
atomic_save(tmp, final) — rename-based atomic file replacement

Overview

CTDS Dataset Format

File layout

Loading and iterating

Creating CTDS files

Training Data Source Format

Continual Learning Workflow

Shell Builtin: `calm dataset`

Pipeline Module

Lua API

Dataset operations

Overview

CTDS Dataset Format

File layout

Loading and iterating

Creating CTDS files

Training Data Source Format

Continual Learning Workflow

Shell Builtin: calm dataset

Pipeline Module

Lua API

Dataset operations

Shell Builtin: `calm dataset`