Encode text to token IDs.
Split text using pre-tokenization regex pattern
For each chunk, convert to bytes and apply BPE merges
Offset ranks by numSpecialTokens to get final IDs