CALM Tokenizer

Overview

This document specifies the byte-level tokenizer for CALM , Lilush's compact language model system. The tokenizer converts raw input and session context into token sequences for domain-specific Mamba SSM models.

Related: CALM for the model architecture.

Design Principles

Vocabulary

Layout

+------------------------------------------------------+
|  Byte Tokens (IDs 0 - 255)                           |
|  Raw byte values 0x00 - 0xFF.                        |
|  Token ID = byte value (identity mapping).           |
+------------------------------------------------------+
|  Special Tokens (IDs 256 - 277)                      |
|  Model operation and context framing.                |
|  Never produced by raw byte encoding.                |
+------------------------------------------------------+
|  Reserved (IDs 278 - 319)                            |
|  Headroom for future domain tokens                   |
|  (e.g. <LANG>, <CURSOR> for editor domain).          |
+------------------------------------------------------+

Total vocabulary size: 320.

Byte tokens (0-255)

Identity mapping. Token ID = byte value. Encoding is type widening (uint8 -> uint16). Decoding is truncation (uint16 -> uint8). No lookup tables, no trie.

UTF-8 codepoints are naturally represented as sequences of byte tokens. Multi-byte codepoints are emitted as consecutive byte tokens:

All shell structure is represented as raw bytes: operators (|, &&, ||, ;), quoting delimiters, flag prefixes (-, --), whitespace (space = token 32, tab = token 9), path components, and all command words. The model learns these as byte patterns from training data.

Special tokens (256-277)

22 tokens for model operation and context framing:

IDTokenRole
256<PAD>Padding
257<BOS>Beginning of sequence
258<EOS>End of sequence
259<ATN>Attention boundary
260<CWD>Current working directory frame
261<GIT>Git branch/status frame
262<HIST>Historical command frame
263<EXIT>Exit code subtoken (within HIST frame)
264<CMD>Current partial input / prediction target
265<ENV>Environment hint frame
266<COMP>Command completion candidates frame
267<QUERY>Input query frame (freeform question)
268<NEXT>Within-frame item separator
269<END>End of frame
270<WORD>Headword frame (dictionary domain)
271<POS>Part-of-speech subtoken (within WORD)
272<NOTE>Etymology/grammar notes (within WORD)
273<IPA>Pronunciation (within WORD)
274<DEF>Definition frame
275<QUOTE>Quotation frame
276<BY>Attribution subtoken (within QUOTE)
277<REF>Reference frame (RFC, standard, citation)

Token categories:

<NEXT> separates items within list frames (<COMP>). It replaces the space-separated encoding used in the previous tokenizer, making candidate boundaries unambiguous even when items contain spaces.

<ATN> marks the structural boundary between context frames and command input. It appears exactly once per sequence.

Encoding and Decoding

Encoding (text -> tokens)

function encode(text):
    tokens = []
    for pos = 0..len(text)-1:
        tokens.append(text[pos])   // byte value IS the token ID
    return tokens

That is the entire encoder. Special tokens are never produced by encoding raw text -- they are inserted programmatically by build_sequence() and calm_tokenize_frame().

Decoding (tokens -> text)

function decode(tokens):
    text = []
    for each token in tokens:
        if token < 256:
            text.append(byte(token))    // identity
        else:
            skip                        // special/frame token
    return text

Frame encoding

Simple frames (CWD, GIT, CMD, ENV) use calm_tokenize_frame():

calm_tokenize_frame(CWD, "/home/user") ->
    [CWD, '/', 'h', 'o', 'm', 'e', '/', 'u', 's', 'e', 'r', END]

Frames with subtokens (HIST with EXIT, COMP with NEXT, WORD with POS/NOTE/IPA, QUOTE with BY, REF with NOTE) embed the subtoken inline before the frame's <END>. Subtokens within a frame are optional — absent subtokens are omitted entirely:

<HIST> git diff --stat <EXIT> 0 <END>
<WORD> window <POS> n. <NOTE> OE. windowe <IPA> ˈwɪndoʊ <END>
<WORD> cat <POS> n. <END>
<DEF> An opening in the wall for the admission of light and air <END>
<QUOTE> Never say never. <END>
<QUOTE> Then to come, in spite of sorrow <BY> Milton <END>
<QUERY> what port does redis use <END>
<REF> RFC 8446, TLS 1.3 <NOTE> obsoletes RFC 5246 <END>

List frames (COMP) are constructed in Lua build_sequence() with <NEXT> separators between items. Repeated and list frames are capped at 15 items; excess entries are silently dropped (newest items are kept for history frames).

seq[#seq + 1] = COMP
for i, c in ipairs(candidates) do
    if i > 1 then seq[#seq + 1] = NEXT end
    local ctoks = core.tokenize(c)
    for _, t in ipairs(ctoks) do seq[#seq + 1] = t end
end
seq[#seq + 1] = END

Shell Context Window Format

<BOS> <CWD> /home/user/lilush <END>
      <GIT> main <END>
      <HIST> git diff --stat <EXIT> 0 <END>
      <HIST> git status <EXIT> 0 <END>
      <COMP> commit <NEXT> checkout <NEXT> cherry-pick <NEXT> clone <END>
      <ENV> venv:myproject <END>
      <ATN>
      <CMD> git com...

All content between frame tokens is byte-encoded. Each frame is: frame_token [byte content] <END>.

For COMP frames, individual candidates are separated by <NEXT> tokens.

During training, the full sequence ends with <EOS>:

<BOS> [context frames] <ATN> <CMD> [full command bytes] <EOS>

Context budget

With l_max = 768, byte-level encoding has comfortable headroom:

ComponentTypical tokensNotes
CWD frame15-35Path bytes + frame tokens
Git frame8-20
History (2-3 cmds)100-180Biggest consumer
COMP frame30-60With NEXT separators
ENV frame15-25Optional
ATN + CMD + overhead5
Context total~193-365
Remaining~403-575For input + generation

Stop Conditions

Stop conditions are stored in the model's CWGT metadata as a space-separated string of patterns. EOS and PAD always stop generation (hardcoded). The metadata specifies additional domain-specific stops.

For example, the shell domain uses: | ; && ||

Patterns can be:

Single-byte patterns match directly. Multi-byte patterns (like &&) match when the previous generated byte + current byte form the pattern, with rollback of the first byte.

Stop conditions are parsed at model load time into a calm_stop_conditions_t struct and checked during autoregressive generation when use_stop_conditions is enabled (default). Raw mode (calm generate --raw) disables stop condition checking.

C Implementation

Data structures

#define CALM_VOCAB_SIZE   320
#define CALM_BYTE_TOKENS  256
#define CALM_SPECIAL_BASE 256
#define CALM_MAX_TOKENS   1024

typedef struct {
    const char *special_names[22]; /* decode table for IDs 256-277 */
} calm_tokenizer_t;

C API

// Initialize tokenizer (populate special_names decode table)
int calm_tokenizer_init(calm_tokenizer_t *tok);

// Encode text to byte token IDs (identity mapping)
int calm_tokenize(const calm_tokenizer_t *tok,
                  const char *text, size_t text_len,
                  uint16_t *out_tokens, size_t max_tokens);

// Encode a context frame: [frame_token] [byte content] [END]
int calm_tokenize_frame(const calm_tokenizer_t *tok,
                        uint16_t frame_token,
                        const char *content, size_t content_len,
                        uint16_t *out_tokens, size_t max_tokens);

// Decode token IDs to text (bytes emitted, specials skipped)
int calm_detokenize(const calm_tokenizer_t *tok,
                    const uint16_t *tokens, size_t num_tokens,
                    char *out_text, size_t max_text);

Lua API

local calm = require("calm")

-- Tokenize raw input (byte-level identity mapping)
local tokens = calm.tokenize("git commit -m fix")
-- Returns {103, 105, 116, 32, 99, 111, ...} (byte values)

-- Build input sequence using a model's embedded template
local model = calm.load_model("~/.local/share/lilush/calm/shell.cwgt")
local seq = calm.build_sequence(model, {
    cwd = "/home/user/lilush",
    git = "main+3",
    history = {
        { cmd = "git diff --stat", exit = 0 },
        { cmd = "git status",      exit = 0 },
    },
    completions = { "commit", "checkout", "cherry-pick", "clone" },
    input = "git com",
})

-- Or build using an explicit template spec string
local pipeline = require("calm.pipeline")
local seq2 = calm.build_sequence(pipeline.TEMPLATES.shell, {
    input = "git com",
})

-- Detokenize
local text = calm.detokenize(tokens)

-- Constants
calm.PAD       -- 256
calm.BOS       -- 257
calm.EOS       -- 258
calm.ATN       -- 259
calm.CWD       -- 260
calm.GIT       -- 261
calm.HIST      -- 262
calm.EXIT      -- 263
calm.CMD       -- 264
calm.ENV       -- 265
calm.COMP      -- 266
calm.QUERY     -- 267
calm.NEXT      -- 268
calm.END_TOKEN -- 269
calm.WORD      -- 270
calm.POS       -- 271
calm.NOTE      -- 272
calm.IPA       -- 273
calm.DEF       -- 274
calm.QUOTE     -- 275
calm.BY        -- 276
calm.REF       -- 277

Weight File Format (CWGT v5)

For shell builtins, file locations, and other shell integration details, see CALM — Shell Integration.

Fixed header (48 bytes, packed, little-endian):
  magic:            4 bytes    "CWGT"
  arch_version:     uint16     5
  flags:            uint16     bit 0: tied, bit 1: has_ewc
  vocab_size:       uint16     vocabulary size (320)
  d_model:          uint16     embedding dimension
  n_layers:         uint8      number of blocks
  ffn_expand:       uint8      FFN expansion factor
  expand:           uint8      Mamba expand factor (offset 14)
  d_state:          uint8      Mamba d_state (offset 15)
  l_max:            uint16     maximum sequence length (768)
  param_count:      uint32     total parameter count
  def_temperature:  uint16     default temperature x 1000 (0 = unset)
  def_top_k:        uint16     default top-k (0 = unset)
  def_top_p:        uint16     default top-p x 1000 (0 = unset)
  def_min_p:        uint16     default min-p x 1000 (0 = unset)
  def_max_tokens:   uint16     default max generation tokens
  def_candidates:   uint8      default candidate count
  d_conv:           uint8      Mamba d_conv (offset 33)
  meta_size:        uint32     byte count of metadata blob (0 = none)
  dt_rank:          uint8      Mamba dt_rank (offset 38)
  reserved:         9 bytes    zero-filled

Metadata blob (meta_size bytes, 3 newline-terminated lines):
  <domain_name>\n              e.g. "shell"
  <template_spec>\n            e.g. "BOS;CWD:cwd;GIT:git;...;ATN;CMD:input"
  <stop_conditions>\n          e.g. "| ; && ||"

EWC data (if has_ewc flag set):
  fisher_diagonal:  [param_count x float32]
  anchor_weights:   [param_count x float32]

Weights:
  [param_count x float32]

The metadata blob embeds the model's prompt template, stop conditions, and domain name directly in the weight file, enabling self-describing models that carry their inference configuration.

Weight files with arch_version < 5 are rejected.