skainet-io-core/sk.ainet.io.tokenizer/TokenizerFactory

TokenizerFactory

Selects the right Tokenizer implementation for a model.

Tokenizer selection is per-architecture, not per file format. A Qwen model needs byte-level BPE whether its weights come from .gguf or .safetensors; a LLaMA model needs SentencePiece regardless of format. Callers pass either a GGUF metadata field map or a HuggingFace tokenizer.json string, and this factory inspects the tokenizer type (tokenizer.ggml.model or model.type) to dispatch.

Currently supported:

Byte-level BPE (Qwen, GPT-2, Mistral-Nemo) — via QwenByteLevelBpeTokenizer. Dispatched when tokenizer.ggml.model == "gpt2" or model.type == "BPE".
SentencePiece (LLaMA, Gemma, TinyLlama, Mistral v0.1) — via SentencePieceTokenizer. Dispatched when tokenizer.ggml.model == "llama" or model.type == "Unigram".

WordPiece (BERT) still throws UnsupportedTokenizerException.

Functions

fromGguf

@JvmStatic

fun fromGguf(fields: Map<String, Any?>): Tokenizer

Build a tokenizer from a GGUF metadata field map.

fromTokenizerJson

@JvmStatic

fun fromTokenizerJson(json: String): Tokenizer

Build a tokenizer from a HuggingFace tokenizer.json string.