TekkenTokenizer
Mistral Tekken tokenizer implementation.
Tekken is a tiktoken-based BPE tokenizer used by Mistral models (Mistral, Mixtral, Codestral, Voxtral, etc.). Unlike HuggingFace tokenizer.json, tekken.json uses:
Base64-encoded byte sequences for vocab tokens
Implicit merge ordering from vocab rank (lower rank = higher priority)
Separate special token list with reserved ID space at [0, numSpecialTokens)
tiktoken-style pre-tokenization regex pattern
Token ID layout:
IDs [0, numSpecialTokens) → special tokens (<unk>, <s>, </s>, [INST], ...)
IDs [numSpecialTokens, ...] → vocab tokens (rank 0..N offset by numSpecialTokens)Parameters
List of byte arrays, indexed by rank (rank 0 = first 256 are single bytes)
List of optional string representations, indexed by rank
Map of special token string → token ID
Map of token ID → special token string (for decoding)
Number of reserved special token IDs (default: 1000)
Pre-tokenization regex pattern (tiktoken-style)