TokenizerFactory

Selects the right Tokenizer implementation for a model.

Tokenizer selection is per-architecture, not per file format. A Qwen model needs byte-level BPE whether its weights come from .gguf or .safetensors; a LLaMA model needs SentencePiece regardless of format. Callers pass either a GGUF metadata field map or a HuggingFace tokenizer.json string, and this factory inspects the tokenizer type (tokenizer.ggml.model or model.type) to dispatch.

Currently supported:

  • Byte-level BPE (Qwen, GPT-2, Mistral-Nemo) — via QwenByteLevelBpeTokenizer. Dispatched when tokenizer.ggml.model == "gpt2" or model.type == "BPE".

  • SentencePiece (LLaMA, Gemma, TinyLlama, Mistral v0.1) — via SentencePieceTokenizer. Dispatched when tokenizer.ggml.model == "llama" or model.type == "Unigram".

WordPiece (BERT) still throws UnsupportedTokenizerException.

Functions

Link copied to clipboard

Build a tokenizer from a GGUF metadata field map.

Link copied to clipboard

Build a tokenizer from a HuggingFace tokenizer.json string.