Tokenizer API
Interface
llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/Tokenizer.ktinterface Tokenizer {
fun encode(text: String): IntArray
fun decode(tokens: IntArray): String
fun decode(token: Int): String
val eosTokenId: Int
val bosTokenId: Int
val vocabSize: Int
}
Implementations
GGUFTokenizer
llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/GGUFTokenizer.ktAuto-detects tokenizer type from GGUF metadata:
-
BPE (GPT-2 style) — used by Qwen, Mistral
-
SentencePiece — used by LLaMA
-
WordPiece — used by BERT
Factory methods:
// From GGUF file (streaming, memory-efficient)
val tokenizer = GGUFTokenizer.fromRandomAccessSource(source)
// From HuggingFace tokenizer.json
val tokenizer = GGUFTokenizer.fromTokenizerJson(jsonString)
TokenizerFactory
llm-core/src/commonMain/kotlin/sk/ainet/apps/llm/tokenizer/TokenizerFactory.ktUnified factory for creating tokenizers:
// From GGUF file
val tokenizer = TokenizerFactory.fromGGUF(randomAccessSource)
// From tokenizer.json string
val tokenizer = TokenizerFactory.fromTokenizerJson(jsonString)
// From HuggingFace format with optional config
val tokenizer = TokenizerFactory.fromHuggingFace(tokenizerJson, configJson)