skainet-io-core/sk.ainet.io.tokenizer/SentencePieceTokenizer

SentencePieceTokenizer

class SentencePieceTokenizer(tokens: List<String>, scores: List<Float>, val unknownTokenId: Int? = null, val bosTokenId: Int? = null, val eosTokenId: Int? = null, val addSpacePrefix: Boolean = true) : Tokenizer(source)

SentencePiece tokenizer for LLaMA, Gemma, TinyLlama, Mistral-v0.1 and other models whose GGUF tokenizer.ggml.model is "llama" and whose HuggingFace tokenizer.json has model.type == "Unigram".

This matches the algorithm used by llm_tokenizer_spm in llama.cpp:

Whitespace escape: every space (' ') is replaced with ▁ (U+2581), and — when addSpacePrefix is true — a leading ▁ is prepended so the first word can still match merged vocab entries like ▁Hello.
Symbol split: the escaped input is broken into code-point-sized symbols held in a linked list.
Score-priority BPE: at each step we scan adjacent symbol pairs, pick the pair whose concatenated string is in the vocab with the highest score, and merge it. Repeat until no pair in the vocab exists. This is the score-wins rule, which is the opposite of the merge-rank rule used by GPT-2 byte-level BPE in QwenByteLevelBpeTokenizer.
Byte fallback: any symbol left over that isn't in the vocab is re-emitted one UTF-8 byte at a time as the hex-byte tokens <0x00>..<0xFF> (GGUF token_type == 6). If those aren't present in the vocab either, falls back to unknownTokenId.

Decode is the inverse: <0xNN> tokens are accumulated back into raw bytes and UTF-8-decoded, the rest are concatenated, ▁ is turned back into a space, and a leading space is stripped if addSpacePrefix is set.

Constructors

SentencePieceTokenizer

constructor(tokens: List<String>, scores: List<Float>, unknownTokenId: Int? = null, bosTokenId: Int? = null, eosTokenId: Int? = null, addSpacePrefix: Boolean = true)