QwenByteLevelBpeTokenizer
GPT-2-style byte-level BPE tokenizer (Qwen, GPT-2, Mistral-Nemo, …).
Implements the seven-step encoding pipeline required by HuggingFace transformers / tokenizers and llama.cpp for byte-level BPE:
Split input on the longest-match special token at each position (
<|im_start|>,<|endoftext|>, …) — these are atomic token IDs.For each non-special segment, apply the GPT-2 pretokenization regex.
UTF-8-encode each regex chunk.
Map bytes → unicode via ByteToUnicode (so "Hello" becomes
Hello, " is" becomesĠis, "\n" becomesĊ).Apply BPE merges to the resulting char sequence, always picking the pair with the lowest merge rank (not highest vocab score — that's the SentencePiece rule, not GPT-2 BPE).
Look up each resulting symbol in the vocab → token id.
Decode is the reverse: concat token strings, reverse byte-to-unicode, UTF-8-decode.
Constructors
Properties
optional BOS id (not emitted automatically by encode; callers add it if they want one).
optional EOS id.