QwenByteLevelBpeTokenizer

class QwenByteLevelBpeTokenizer(tokens: List<String>, merges: List<Pair<String, String>>, specialTokens: Map<String, Int>, val bosTokenId: Int? = null, val eosTokenId: Int? = null) : Tokenizer(source)

GPT-2-style byte-level BPE tokenizer (Qwen, GPT-2, Mistral-Nemo, …).

Implements the seven-step encoding pipeline required by HuggingFace transformers / tokenizers and llama.cpp for byte-level BPE:

  1. Split input on the longest-match special token at each position (<|im_start|>, <|endoftext|>, …) — these are atomic token IDs.

  2. For each non-special segment, apply the GPT-2 pretokenization regex.

  3. UTF-8-encode each regex chunk.

  4. Map bytes → unicode via ByteToUnicode (so "Hello" becomes Hello, " is" becomes Ġis, "\n" becomes Ċ).

  5. Apply BPE merges to the resulting char sequence, always picking the pair with the lowest merge rank (not highest vocab score — that's the SentencePiece rule, not GPT-2 BPE).

  6. Look up each resulting symbol in the vocab → token id.

  7. Decode is the reverse: concat token strings, reverse byte-to-unicode, UTF-8-decode.

Constructors

Link copied to clipboard
constructor(tokens: List<String>, merges: List<Pair<String, String>>, specialTokens: Map<String, Int>, bosTokenId: Int? = null, eosTokenId: Int? = null)

Types

Link copied to clipboard
object Companion

Properties

Link copied to clipboard
open override val bosTokenId: Int?

optional BOS id (not emitted automatically by encode; callers add it if they want one).

Link copied to clipboard
open override val eosTokenId: Int?

optional EOS id.

Link copied to clipboard
open override val vocabSize: Int

Functions

Link copied to clipboard
open override fun decode(ids: IntArray): String
Link copied to clipboard
open override fun encode(text: String): IntArray