TekkenTokenizer

constructor(vocabTokenBytes: List<ByteArray>, vocabTokenStrings: List<String?>, specialTokens: Map<String, Int>, specialTokensById: Map<Int, String>, numSpecialTokens: Int = 1000, pattern: Regex)(source)

Parameters

vocabTokenBytes

List of byte arrays, indexed by rank (rank 0 = first 256 are single bytes)

vocabTokenStrings

List of optional string representations, indexed by rank

specialTokens

Map of special token string → token ID

specialTokensById

Map of token ID → special token string (for decoding)

numSpecialTokens

Number of reserved special token IDs (default: 1000)

pattern

Pre-tokenization regex pattern (tiktoken-style)