CALM Dataset

Overview

This document covers the CTDS binary dataset format, training data source format, the continual learning workflow, and the calm.pipeline utility module.

Related documents:

CTDS Dataset Format

Training data is stored in a binary format (CTDS — CALM Training Data Set) for efficient loading without runtime tokenization overhead.

File layout

Header (14 bytes):
  magic:          4 bytes    "CTDS"
  vocab_version:  uint32     (unused, set to 0 for byte-level tokenizer)
  count:          uint32     number of sequences
  max_len:        uint16     maximum sequence length

Per-sequence metadata:
  lengths:        count × uint16    token count per sequence
  cmd_positions:  count × uint16    0-indexed ATN position per sequence

Token data:
  tokens:         (sum of lengths) × uint16    all tokens contiguous

All integers are little-endian.

Loading and iterating

local calm = require("calm")
local ds = calm.load_dataset("/path/to/train.ctds")

print(ds:count())        -- number of sequences

ds:shuffle()             -- randomize iteration order

-- Get a batch (0-indexed batch number, batch size)
local seqs, cmd_pos = ds:batch(0, 32)
-- seqs = { {tok1, tok2, ...}, {tok1, tok2, ...}, ... }
-- cmd_pos = { 5, 8, 3, ... }

ds:close()

Shuffling permutes an index array, not the underlying data. Calling shuffle() before each epoch provides different iteration orders.

Creating CTDS files

Use pipeline.write_ctds() for the simplest approach, or write manually from Lua using calm.build_sequence():

local calm = require("calm")

local pipeline = require("calm.pipeline")

local examples = {
    { cwd = "/home/user", input = "git status", history = {} },
    { cwd = "/var/log",   input = "tail -f syslog", history = {} },
}

local f = io.open("/tmp/train.ctds", "wb")

-- Tokenize all sequences first to get counts
local all_seqs = {}
local all_cmds = {}
local max_len = 0

for _, ex in ipairs(examples) do
    local seq = calm.build_sequence(pipeline.TEMPLATES.shell, ex)
    -- Find ATN position
    local cmd_pos = 0
    for i, tok in ipairs(seq) do
        if tok == calm.ATN then cmd_pos = i - 1; break end  -- 0-indexed
    end
    all_seqs[#all_seqs + 1] = seq
    all_cmds[#all_cmds + 1] = cmd_pos
    if #seq > max_len then max_len = #seq end
end

-- Write header
f:write("CTDS")
-- vocab_version (uint32 LE, unused — set to 0)
f:write(string.char(0, 0, 0, 0))
-- count (uint32 LE)
local c = #all_seqs
f:write(string.char(c % 256, math.floor(c/256) % 256,
                    math.floor(c/65536) % 256, math.floor(c/16777216) % 256))
-- max_len (uint16 LE)
f:write(string.char(max_len % 256, math.floor(max_len/256)))

-- Write lengths
for _, seq in ipairs(all_seqs) do
    local len = #seq
    f:write(string.char(len % 256, math.floor(len/256)))
end

-- Write cmd_positions
for _, cp in ipairs(all_cmds) do
    f:write(string.char(cp % 256, math.floor(cp/256)))
end

-- Write tokens
for _, seq in ipairs(all_seqs) do
    for _, tok in ipairs(seq) do
        f:write(string.char(tok % 256, math.floor(tok/256)))
    end
end

f:close()

Training Data Source Format

The raw training examples (before tokenization) are plain text files with one example per block, separated by blank lines:

<CWD>/home/user/lilush
<GIT>main
<HIST>git diff --stat<EXIT>0
<HIST>git status<EXIT>0
<COMP>commit<NEXT>checkout<NEXT>cherry-pick<NEXT>clone
<CMD>git commit -m "fix tokenizer"

<CWD>/var/log
<CMD>files_matching log kat -m

<CWD>/home/user
<HIST>make -C src/litls<EXIT>0
<HIST>make -C src/litls<EXIT>2
<CMD>make -C src/litls clean && make -C src/litls

Each line starts with a special token name (e.g. <CWD>, <HIST>, <CMD>):

The calm dataset --from FILE command parses these via pipeline.parse_text_dataset(), builds token sequences per the tokenizer spec, and writes the binary CTDS dataset.

This format is simple enough to generate programmatically (for synthetic data) or write by hand (for curated examples).

Continual Learning Workflow

The typical workflow for training CALM on your own shell history:

# 1. Initialize a model with random weights (first time only)
calm init                             # Nano (default)
calm init --size micro                # Micro
calm init --size mini                 # Mini

# 2. Enable training metadata collection
export CALM_SAVE_TRAIN_DATA=1

# 3. Collect history by using the shell normally (automatic)
#    Completion candidates are stored in MNEME by calm_store

# 4. Build dataset from shell history
calm dataset

# 5. Train (runs in foreground; use 'job start calm train' for background)
calm train

# 6. Enable predictions (first time only)
calm enable

# 7. The shell automatically hot-reloads the updated weights
#    (mtime-based detection, no restart needed)

For more control over training:

calm train --epochs 5 --lr 5e-5 --optimizer adam --weight-decay 0.01

For offline training from curated text data:

# Convert text file to CTDS
calm dataset --from training_data.txt --output /tmp/train.ctds

# Train a new model from scratch
calm train --model new --size mini -d /tmp/train.ctds -o /tmp/model.cwgt --epochs 10

# Evaluate
calm evaluate -m /tmp/model.cwgt -d /tmp/train.ctds

Shell Builtin: calm dataset

calm dataset accepts these options:

FlagDescriptionDefault
--max-entries / -nMax shell history entries0 (all)
--from / -fPlain-text training data file(shell history)
--output / -oOutput CTDS path~/.local/share/lilush/calm/train.ctds
--view / -VView dataset contents
--dsCTDS file path (for --view)
--index / -iStart index for --view (1-based)1
--count / -cNumber of sequences for --view
--hist-frames / -HHistory frames per sequence5
--comp-framesCompletion candidates per sequence3
--max-dupMax duplicate commands3
--min-cmd-lenMin command length in chars2
--include-failedInclude non-zero exit commandsfalse
--include-trivialInclude trivial commandsfalse

calm dataset runs in the foreground. Use job start calm dataset ... for background execution.

Pipeline Module

The calm.pipeline module (src/calm/calm/pipeline.lua) provides shared utilities used by both tool scripts and shell builtins:

Lua API

Dataset operations

local calm = require("calm")

-- Load a pre-tokenized binary dataset
local dataset, err = calm.load_dataset("/path/to/train.ctds")

-- Get count
print(dataset:count())

-- Get a batch of sequences (0-indexed batch number)
local batch, cmd_positions = dataset:batch(batch_idx, batch_size)
-- batch = { {token_ids...}, {token_ids...}, ... }
-- cmd_positions = { 15, 22, 18, ... }

-- Shuffle dataset (in-place, for epoch randomization)
dataset:shuffle()

dataset:close()