QwenByteLevelBpeTokenizer

class QwenByteLevelBpeTokenizer(tokens: List<String>, merges: List<Pair<String, String>>, specialTokens: Map<String, Int>, val bosTokenId: Int? = null, val eosTokenId: Int? = null) : Tokenizer(source)

GPT-2-style byte-level BPE tokenizer (Qwen, GPT-2, Mistral-Nemo, …).

Implements the seven-step encoding pipeline required by HuggingFace transformers / tokenizers and llama.cpp for byte-level BPE:

Split input on the longest-match special token at each position (<|im_start|>, <|endoftext|>, …) — these are atomic token IDs.
For each non-special segment, apply the GPT-2 pretokenization regex.
UTF-8-encode each regex chunk.
Map bytes → unicode via ByteToUnicode (so "Hello" becomes Hello, " is" becomes Ġis, "\n" becomes Ċ).
Apply BPE merges to the resulting char sequence, always picking the pair with the lowest merge rank (not highest vocab score — that's the SentencePiece rule, not GPT-2 BPE).
Look up each resulting symbol in the vocab → token id.
Decode is the reverse: concat token strings, reverse byte-to-unicode, UTF-8-decode.

Constructors

constructor(tokens: List<String>, merges: List<Pair<String, String>>, specialTokens: Map<String, Int>, bosTokenId: Int? = null, eosTokenId: Int? = null)

open override val bosTokenId: Int?

optional BOS id (not emitted automatically by encode; callers add it if they want one).

open override val eosTokenId: Int?

optional EOS id.

open override val vocabSize: Int

open override fun decode(ids: IntArray): String

open override fun encode(text: String): IntArray