TekkenTokenizer
constructor(vocabTokenBytes: List<ByteArray>, vocabTokenStrings: List<String?>, specialTokens: Map<String, Int>, specialTokensById: Map<Int, String>, numSpecialTokens: Int = 1000, pattern: Regex)(source)
Parameters
vocabTokenBytes
List of byte arrays, indexed by rank (rank 0 = first 256 are single bytes)
vocabTokenStrings
List of optional string representations, indexed by rank
specialTokens
Map of special token string → token ID
specialTokensById
Map of token ID → special token string (for decoding)
numSpecialTokens
Number of reserved special token IDs (default: 1000)
pattern
Pre-tokenization regex pattern (tiktoken-style)