skainet-io-core/sk.ainet.io.tokenizer/TekkenTokenizer

TekkenTokenizer

class TekkenTokenizer(vocabTokenBytes: List<ByteArray>, vocabTokenStrings: List<String?>, specialTokens: Map<String, Int>, specialTokensById: Map<Int, String>, numSpecialTokens: Int = 1000, pattern: Regex) : Tokenizer(source)

Mistral Tekken tokenizer implementation.

Tekken is a tiktoken-based BPE tokenizer used by Mistral models (Mistral, Mixtral, Codestral, Voxtral, etc.). Unlike HuggingFace tokenizer.json, tekken.json uses:

Base64-encoded byte sequences for vocab tokens
Implicit merge ordering from vocab rank (lower rank = higher priority)
Separate special token list with reserved ID space at [0, numSpecialTokens)
tiktoken-style pre-tokenization regex pattern

Token ID layout:

IDs [0, numSpecialTokens)      → special tokens (<unk>, <s>, </s>, [INST], ...)
IDs [numSpecialTokens, ...]    → vocab tokens (rank 0..N offset by numSpecialTokens)

Parameters

vocabTokenBytes

List of byte arrays, indexed by rank (rank 0 = first 256 are single bytes)

vocabTokenStrings

List of optional string representations, indexed by rank

specialTokens

Map of special token string → token ID

specialTokensById

Map of token ID → special token string (for decoding)

numSpecialTokens

Number of reserved special token IDs (default: 1000)

pattern

Pre-tokenization regex pattern (tiktoken-style)

Constructors

TekkenTokenizer

constructor(vocabTokenBytes: List<ByteArray>, vocabTokenStrings: List<String?>, specialTokens: Map<String, Int>, specialTokensById: Map<Int, String>, numSpecialTokens: Int = 1000, pattern: Regex)