TekkenTokenizer

class TekkenTokenizer(vocabTokenBytes: List<ByteArray>, vocabTokenStrings: List<String?>, specialTokens: Map<String, Int>, specialTokensById: Map<Int, String>, numSpecialTokens: Int = 1000, pattern: Regex) : Tokenizer(source)

Mistral Tekken tokenizer implementation.

Tekken is a tiktoken-based BPE tokenizer used by Mistral models (Mistral, Mixtral, Codestral, Voxtral, etc.). Unlike HuggingFace tokenizer.json, tekken.json uses:

  • Base64-encoded byte sequences for vocab tokens

  • Implicit merge ordering from vocab rank (lower rank = higher priority)

  • Separate special token list with reserved ID space at [0, numSpecialTokens)

  • tiktoken-style pre-tokenization regex pattern

Token ID layout:

IDs [0, numSpecialTokens)      → special tokens (<unk>, <s>, </s>, [INST], ...)
IDs [numSpecialTokens, ...] → vocab tokens (rank 0..N offset by numSpecialTokens)

Parameters

vocabTokenBytes

List of byte arrays, indexed by rank (rank 0 = first 256 are single bytes)

vocabTokenStrings

List of optional string representations, indexed by rank

specialTokens

Map of special token string → token ID

specialTokensById

Map of token ID → special token string (for decoding)

numSpecialTokens

Number of reserved special token IDs (default: 1000)

pattern

Pre-tokenization regex pattern (tiktoken-style)

Constructors

Link copied to clipboard
constructor(vocabTokenBytes: List<ByteArray>, vocabTokenStrings: List<String?>, specialTokens: Map<String, Int>, specialTokensById: Map<Int, String>, numSpecialTokens: Int = 1000, pattern: Regex)

Types

Link copied to clipboard
object Companion

Properties

Link copied to clipboard
open override val bosTokenId: Int

BOS token ID.

Link copied to clipboard
open override val eosTokenId: Int

EOS token ID.

Link copied to clipboard

Total token count (vocab + special tokens).

Link copied to clipboard
open override val vocabSize: Int

Number of vocab tokens (excluding special tokens).

Functions

Link copied to clipboard
fun decode(token: Int): String

Decode a single token ID to text.

open override fun decode(ids: IntArray): String

Decode token IDs to text.

Link copied to clipboard
open override fun encode(text: String): IntArray

Encode text to token IDs.