Package-level declarations

Types

Link copied to clipboard
class QwenByteLevelBpeTokenizer(tokens: List<String>, merges: List<Pair<String, String>>, specialTokens: Map<String, Int>, val bosTokenId: Int? = null, val eosTokenId: Int? = null) : Tokenizer

GPT-2-style byte-level BPE tokenizer (Qwen, GPT-2, Mistral-Nemo, …).

Link copied to clipboard
class SentencePieceTokenizer(tokens: List<String>, scores: List<Float>, val unknownTokenId: Int? = null, val bosTokenId: Int? = null, val eosTokenId: Int? = null, val addSpacePrefix: Boolean = true) : Tokenizer

SentencePiece tokenizer for LLaMA, Gemma, TinyLlama, Mistral-v0.1 and other models whose GGUF tokenizer.ggml.model is "llama" and whose HuggingFace tokenizer.json has model.type == "Unigram".

Link copied to clipboard
class TekkenTokenizer(vocabTokenBytes: List<ByteArray>, vocabTokenStrings: List<String?>, specialTokens: Map<String, Int>, specialTokensById: Map<Int, String>, numSpecialTokens: Int = 1000, pattern: Regex) : Tokenizer

Mistral Tekken tokenizer implementation.

Link copied to clipboard
interface Tokenizer

Common surface for all tokenizer implementations.

Link copied to clipboard

Selects the right Tokenizer implementation for a model.