Tokenizer Internals
Tokenizer Types
Different model families use different tokenization strategies.
The GGUFTokenizer auto-detects the type from GGUF metadata.
SentencePiece (LLaMA, Gemma)
-
Space is encoded as
\u2581(lower one eighth block) -
Subword units are learned from training data
-
GGUF metadata field:
tokenizer.ggml.model = "llama"or"sentencepiece"
BPE (Qwen, Mistral, GPT-2)
-
Byte-level BPE: text is converted to UTF-8 bytes, each byte mapped to a Unicode character
-
The byte-to-Unicode mapping avoids control characters (bytes 0-32 map to U+0100+)
-
Space is represented as
\u0120(Latin capital G with dot above) -
GGUF metadata field:
tokenizer.ggml.model = "gpt2"or"bpe"
Special Token Handling
Chat templates use special tokens like <|im_start|> and <|im_end|> to delimit messages.
These must be encoded as single tokens, not character-split.
The GGUFTokenizer collects special tokens from the vocabulary (tokens matching <|…|>) and splits text around them before applying BPE.
This ensures <|im_start|>system encodes as [151644, 8948] (two tokens), not as individual characters.
GGUF Tokenizer Fields
| Field | Description |
|---|---|
|
Tokenizer type: |
|
Vocabulary as string array |
|
BPE merge scores (SentencePiece) |
|
BPE merge pairs (GPT-2 style) |
|
Beginning-of-sequence token ID |
|
End-of-sequence token ID |
|
Per-token type (normal, control, unknown, byte) |