This document specifies the byte-level tokenizer for CALM , Lilush's compact language model system. The tokenizer converts raw input and session context into token sequences for domain-specific Mamba SSM models.
Related: CALM for the model architecture.
Byte-level identity mapping: Token ID = byte value. No trie, no lookup tables, no multi-tier priority. Encoding is a trivial byte-emit loop.
Fixed vocabulary: The vocabulary never changes. No dynamic tier, no vocabulary versioning, no retokenize/retrain cycles.
Pre-expansion: The tokenizer operates on raw user input before any shell processing. It sees exactly what the user has typed.
Special tokens are structural: Special tokens (256-277) frame the
context window. They are inserted programmatically by
build_sequence(), never produced by encoding raw text.
+------------------------------------------------------+
| Byte Tokens (IDs 0 - 255) |
| Raw byte values 0x00 - 0xFF. |
| Token ID = byte value (identity mapping). |
+------------------------------------------------------+
| Special Tokens (IDs 256 - 277) |
| Model operation and context framing. |
| Never produced by raw byte encoding. |
+------------------------------------------------------+
| Reserved (IDs 278 - 319) |
| Headroom for future domain tokens |
| (e.g. <LANG>, <CURSOR> for editor domain). |
+------------------------------------------------------+
Total vocabulary size: 320.
Identity mapping. Token ID = byte value. Encoding is type widening
(uint8 -> uint16). Decoding is truncation (uint16 -> uint8). No
lookup tables, no trie.
UTF-8 codepoints are naturally represented as sequences of byte tokens. Multi-byte codepoints are emitted as consecutive byte tokens:
ASCII a (0x61) -> 1 byte token
Cyrillic d (U+0434, bytes 0xD0 0xB4) -> 2 byte tokens
CJK (U+4E2D, bytes 0xE4 0xB8 0xAD) -> 3 byte tokens
Emoji (U+1F98A, bytes 0xF0 0x9F 0xA6 0x8A) -> 4 byte tokens
All shell structure is represented as raw bytes: operators (|, &&,
||, ;), quoting delimiters, flag prefixes (-, --), whitespace
(space = token 32, tab = token 9), path components, and all command words.
The model learns these as byte patterns from training data.
22 tokens for model operation and context framing:
| ID | Token | Role |
|---|---|---|
| 256 | <PAD> | Padding |
| 257 | <BOS> | Beginning of sequence |
| 258 | <EOS> | End of sequence |
| 259 | <ATN> | Attention boundary |
| 260 | <CWD> | Current working directory frame |
| 261 | <GIT> | Git branch/status frame |
| 262 | <HIST> | Historical command frame |
| 263 | <EXIT> | Exit code subtoken (within HIST frame) |
| 264 | <CMD> | Current partial input / prediction target |
| 265 | <ENV> | Environment hint frame |
| 266 | <COMP> | Command completion candidates frame |
| 267 | <QUERY> | Input query frame (freeform question) |
| 268 | <NEXT> | Within-frame item separator |
| 269 | <END> | End of frame |
| 270 | <WORD> | Headword frame (dictionary domain) |
| 271 | <POS> | Part-of-speech subtoken (within WORD) |
| 272 | <NOTE> | Etymology/grammar notes (within WORD) |
| 273 | <IPA> | Pronunciation (within WORD) |
| 274 | <DEF> | Definition frame |
| 275 | <QUOTE> | Quotation frame |
| 276 | <BY> | Attribution subtoken (within QUOTE) |
| 277 | <REF> | Reference frame (RFC, standard, citation) |
Token categories:
Core tokens: <PAD>, <BOS>, <EOS>, <ATN> — sequence-level
control tokens
Frame start tokens: <CWD>, <GIT>, <HIST>, <CMD>, <ENV>,
<COMP>, <QUERY>, <WORD>, <DEF>, <QUOTE>, <REF> — begin a
context frame
Subtokens (within frames): <EXIT>, <NEXT>, <POS>, <NOTE>,
<IPA>, <BY> — mark substructure inside a frame
Frame end token: <END> — terminates a frame
<NEXT> separates items within list frames (<COMP>). It
replaces the space-separated encoding used in the previous tokenizer,
making candidate boundaries unambiguous even when items contain spaces.
<ATN> marks the structural boundary between context frames and command
input. It appears exactly once per sequence.
function encode(text):
tokens = []
for pos = 0..len(text)-1:
tokens.append(text[pos]) // byte value IS the token ID
return tokens
That is the entire encoder. Special tokens are never produced by
encoding raw text -- they are inserted programmatically by
build_sequence() and calm_tokenize_frame().
function decode(tokens):
text = []
for each token in tokens:
if token < 256:
text.append(byte(token)) // identity
else:
skip // special/frame token
return text
Simple frames (CWD, GIT, CMD, ENV) use calm_tokenize_frame():
calm_tokenize_frame(CWD, "/home/user") ->
[CWD, '/', 'h', 'o', 'm', 'e', '/', 'u', 's', 'e', 'r', END]
Frames with subtokens (HIST with EXIT, COMP with NEXT, WORD with
POS/NOTE/IPA, QUOTE with BY, REF with NOTE) embed the subtoken inline
before the frame's <END>. Subtokens within a frame are optional —
absent subtokens are omitted entirely:
<HIST> git diff --stat <EXIT> 0 <END>
<WORD> window <POS> n. <NOTE> OE. windowe <IPA> ˈwɪndoʊ <END>
<WORD> cat <POS> n. <END>
<DEF> An opening in the wall for the admission of light and air <END>
<QUOTE> Never say never. <END>
<QUOTE> Then to come, in spite of sorrow <BY> Milton <END>
<QUERY> what port does redis use <END>
<REF> RFC 8446, TLS 1.3 <NOTE> obsoletes RFC 5246 <END>
List frames (COMP) are constructed in Lua build_sequence()
with <NEXT> separators between items. Repeated and list frames
are capped at 15 items; excess entries are silently dropped
(newest items are kept for history frames).
seq[#seq + 1] = COMP
for i, c in ipairs(candidates) do
if i > 1 then seq[#seq + 1] = NEXT end
local ctoks = core.tokenize(c)
for _, t in ipairs(ctoks) do seq[#seq + 1] = t end
end
seq[#seq + 1] = END
<BOS> <CWD> /home/user/lilush <END>
<GIT> main <END>
<HIST> git diff --stat <EXIT> 0 <END>
<HIST> git status <EXIT> 0 <END>
<COMP> commit <NEXT> checkout <NEXT> cherry-pick <NEXT> clone <END>
<ENV> venv:myproject <END>
<ATN>
<CMD> git com...
All content between frame tokens is byte-encoded. Each frame is:
frame_token [byte content] <END>.
For COMP frames, individual candidates are separated by
<NEXT> tokens.
During training, the full sequence ends with <EOS>:
<BOS> [context frames] <ATN> <CMD> [full command bytes] <EOS>
With l_max = 768, byte-level encoding has comfortable headroom:
| Component | Typical tokens | Notes |
|---|---|---|
| CWD frame | 15-35 | Path bytes + frame tokens |
| Git frame | 8-20 | |
| History (2-3 cmds) | 100-180 | Biggest consumer |
| COMP frame | 30-60 | With NEXT separators |
| ENV frame | 15-25 | Optional |
| ATN + CMD + overhead | 5 | |
| Context total | ~193-365 | |
| Remaining | ~403-575 | For input + generation |
Stop conditions are stored in the model's CWGT metadata as a space-separated string of patterns. EOS and PAD always stop generation (hardcoded). The metadata specifies additional domain-specific stops.
For example, the shell domain uses: | ; && ||
Patterns can be:
Byte patterns: literal byte sequences (e.g., |, ;, &&)
Special token names: <END>, <ATN>, etc. (angle-bracket syntax)
Single-byte patterns match directly. Multi-byte patterns (like &&)
match when the previous generated byte + current byte form the pattern,
with rollback of the first byte.
Stop conditions are parsed at model load time into a
calm_stop_conditions_t struct and checked during autoregressive
generation when use_stop_conditions is enabled (default). Raw mode
(calm generate --raw) disables stop condition checking.
#define CALM_VOCAB_SIZE 320
#define CALM_BYTE_TOKENS 256
#define CALM_SPECIAL_BASE 256
#define CALM_MAX_TOKENS 1024
typedef struct {
const char *special_names[22]; /* decode table for IDs 256-277 */
} calm_tokenizer_t;
// Initialize tokenizer (populate special_names decode table)
int calm_tokenizer_init(calm_tokenizer_t *tok);
// Encode text to byte token IDs (identity mapping)
int calm_tokenize(const calm_tokenizer_t *tok,
const char *text, size_t text_len,
uint16_t *out_tokens, size_t max_tokens);
// Encode a context frame: [frame_token] [byte content] [END]
int calm_tokenize_frame(const calm_tokenizer_t *tok,
uint16_t frame_token,
const char *content, size_t content_len,
uint16_t *out_tokens, size_t max_tokens);
// Decode token IDs to text (bytes emitted, specials skipped)
int calm_detokenize(const calm_tokenizer_t *tok,
const uint16_t *tokens, size_t num_tokens,
char *out_text, size_t max_text);
local calm = require("calm")
-- Tokenize raw input (byte-level identity mapping)
local tokens = calm.tokenize("git commit -m fix")
-- Returns {103, 105, 116, 32, 99, 111, ...} (byte values)
-- Build input sequence using a model's embedded template
local model = calm.load_model("~/.local/share/lilush/calm/shell.cwgt")
local seq = calm.build_sequence(model, {
cwd = "/home/user/lilush",
git = "main+3",
history = {
{ cmd = "git diff --stat", exit = 0 },
{ cmd = "git status", exit = 0 },
},
completions = { "commit", "checkout", "cherry-pick", "clone" },
input = "git com",
})
-- Or build using an explicit template spec string
local pipeline = require("calm.pipeline")
local seq2 = calm.build_sequence(pipeline.TEMPLATES.shell, {
input = "git com",
})
-- Detokenize
local text = calm.detokenize(tokens)
-- Constants
calm.PAD -- 256
calm.BOS -- 257
calm.EOS -- 258
calm.ATN -- 259
calm.CWD -- 260
calm.GIT -- 261
calm.HIST -- 262
calm.EXIT -- 263
calm.CMD -- 264
calm.ENV -- 265
calm.COMP -- 266
calm.QUERY -- 267
calm.NEXT -- 268
calm.END_TOKEN -- 269
calm.WORD -- 270
calm.POS -- 271
calm.NOTE -- 272
calm.IPA -- 273
calm.DEF -- 274
calm.QUOTE -- 275
calm.BY -- 276
calm.REF -- 277
For shell builtins, file locations, and other shell integration details, see CALM — Shell Integration.
Fixed header (48 bytes, packed, little-endian):
magic: 4 bytes "CWGT"
arch_version: uint16 5
flags: uint16 bit 0: tied, bit 1: has_ewc
vocab_size: uint16 vocabulary size (320)
d_model: uint16 embedding dimension
n_layers: uint8 number of blocks
ffn_expand: uint8 FFN expansion factor
expand: uint8 Mamba expand factor (offset 14)
d_state: uint8 Mamba d_state (offset 15)
l_max: uint16 maximum sequence length (768)
param_count: uint32 total parameter count
def_temperature: uint16 default temperature x 1000 (0 = unset)
def_top_k: uint16 default top-k (0 = unset)
def_top_p: uint16 default top-p x 1000 (0 = unset)
def_min_p: uint16 default min-p x 1000 (0 = unset)
def_max_tokens: uint16 default max generation tokens
def_candidates: uint8 default candidate count
d_conv: uint8 Mamba d_conv (offset 33)
meta_size: uint32 byte count of metadata blob (0 = none)
dt_rank: uint8 Mamba dt_rank (offset 38)
reserved: 9 bytes zero-filled
Metadata blob (meta_size bytes, 3 newline-terminated lines):
<domain_name>\n e.g. "shell"
<template_spec>\n e.g. "BOS;CWD:cwd;GIT:git;...;ATN;CMD:input"
<stop_conditions>\n e.g. "| ; && ||"
EWC data (if has_ewc flag set):
fisher_diagonal: [param_count x float32]
anchor_weights: [param_count x float32]
Weights:
[param_count x float32]
The metadata blob embeds the model's prompt template, stop conditions, and domain name directly in the weight file, enabling self-describing models that carry their inference configuration.
Weight files with arch_version < 5 are rejected.