Post

Byte Pair Encoding: Explained

Byte Pair Encoding: Explained

Getting through the fabrication of a language model from scratch, I was struck with the fact that how you turn text into numbers is just as critical as the architecture you train on top of those numbers. So in this post I want to share my deep dive into Byte Pair Encoding (BPE) : why it became the de-facto standard, how to implement it correctly, and the practical traps that can still sink your project.


Vocabulary vs. Compression: Understanding the Difference

Before diving into BPE, it’s worth clarifying an important distinction: vocabulary construction is not the same as compression goals, though the two are related.

Compression algorithms want to minimize the total number of bits needed to store or transmit a corpus. They can be lossy and need not care about downstream model performance.

Vocabulary construction, on the other hand, is about balancing coverage and statistical efficiency for the model. A vocabulary that compresses text to 0.8 bits per byte is useless if it leaves your model unable to represent common sub-word patterns.

Some tokenizers can double as compression schemes (like BPE), but many useful tokenization tricks (special tokens, language-specific rules) make no sense for pure compression.


Sub-word Tokenization: Beyond Simple Word or Character Splits

When I started in NLP, I thought tokenization was a solved problem: split on whitespace, lowercase, done. More efficient? Just apply some Huffman-like compression algorithm and that’s it : I was wrong lol. Word-level vocabularies explode in size with morphology and proper nouns, while character-level vocabularies make the meaning and variations of words harder to learn, obviously because we are destroying the semantic of words and transforming them to mere characters.

This realization led me to sub-word tokenization, and in particular to Byte Pair Encoding (BPE) and its close cousin WordPiece.

Technical Confession
March 02, 2025

Well, i’m not gonna lie, I already knew about the need of tokenization, and specifically the use of BPE before even starting this journey, but I decided to take a null approach where I shall discover why simple methods like the ones cited before were not working well for language learning.

Merge Operations: The Fundamental Trade-off

BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair of symbols into a new symbol.

  • Merge Rule:
    At each step, count all adjacent symbol pairs (A, B) and replace every occurrence of A B with the new merged symbol AB.

  • Stopping Criterion:
    Continue until you reach a target vocabulary size (say 32 k, 50 k, or 100 k merges).

The key is that each merge is a precision-recall trade-off:

  • More merges → higher recall (fewer <unk> tokens) but lower precision (longer, rarer tokens that may not generalize).
  • Fewer merges → higher precision (shorter, reusable tokens) but lower recall (more <unk> tokens on rare words).

Vocabulary Size: Finding Balance

When you need a single hyper-parameter that balances coverage and granularity, target vocabulary size is often the go-to choice:

  • Too small: Out-of-vocabulary explosion.
  • Too large: GPU memory bloat and slower inference.

I’ve found 10 k to 50 k is the sweet spot for English-centric models, while multilingual setups often need 100 k+ to cover scripts like Chinese, Arabic, or emoji.

Frequency Thresholds and Pre-tokenization: Threshold-Independent Control

One limitation of vanilla BPE is that it depends on absolute corpus counts. But what if you want to bias the vocabulary toward morphological regularity instead of raw frequency?

This is where pre-tokenization rules (whitespace, punctuation, language-specific regexes) and frequency thresholds come in. They let you:

  • Prevent merges across punctuation (don’t vs. don + ' + t).
  • Force case handling (The vs. the) without exploding the vocabulary.
  • Down-weight noisy web text so that rare but linguistically important patterns still get merged.

Regression to Text: Quantifying Tokenization Error

For text generation tasks—where we’re predicting the next token—we need different evaluation approaches.

Compression Ratio: Measuring Efficiency

Compression ratio measures how many raw bytes are represented by one token on average:

\[\text{Compression Ratio} = \frac{\text{Bytes in original text}}{\text{Number of tokens}}\]

Higher ratio → fewer tokens per sentence → faster inference.

Reconstruction Error: Interpretable Accuracy

Reconstruct the original text from the token sequence and compute:

\[\text{Reconstruction Error} = \frac{\text{Number of characters that fail to round-trip}}{\text{Total characters}}\]

It seems preferable to keep this below 0.1 %; anything higher usually signals that the tokenizer is corrupting bytes.

OOV Rate: Robustness to New Domains

Out-of-vocabulary rate counts how many word types (or byte sequences) are split into <unk>:

\[\text{OOV Rate} = \frac{\text{Unique tokens mapped to <unk>}}{\text{Total unique tokens}}\]

Unlike compression ratio, this is sensitive to domain shift. A tokenizer that works fine on Wikipedia can suddenly produce 13 % <unk> tokens on medical or legal text, for example.


Language-Specific Extensions: Beyond Plain BPE

Vanilla BPE treats every byte as equal, but real text has structure. Here are extensions I’ve found useful:

SentencePiece: Unicode-Aware BPE

SentencePiece treats the entire Unicode string as a sequence of UTF-8 bytes, eliminating the need for language-specific pre-tokenizers. It also adds unigram language model pruning as an alternative to BPE’s greedy merges.

Tiktoken: Fast, Pre-computed Merges

OpenAI’s tiktoken ships with frozen BPE vocabularies optimized for GPT models. Thanks Karpathy btw : I learned about it first in one of his videos. These vocabularies already encode common English patterns, emojis, and code snippets.

Character-Aware BPE: Morphology First

Some teams add morphological segmentation before BPE (e.g., separating prefixes and suffixes). This can reduce OOV rates in morphologically rich languages like Turkish or Finnish.


Conclusion: Tokenizers as the first key to unlock LLMs potential

While I’ve covered some BPE variants in this post, it pushed me to look for recent approaches to this pre-processing step, which , as I’ve seen in some recent papers, seems to become unnecessary at all.

For the moment, from what I’ve seen, the best approach to tokenization combines quantitative diagnostics (compression ratio, OOV rate) with qualitative spot-checks (can humans still read the detokenized text?). Your tokenizer might achieve state-of-the-art compression on Wikipedia, but what ultimately matters is how it behaves on your specific application.

For now, I’ve focused on BPE and its variants, but in the near future I plan to dive into the newer tokenizer-less methods (like byte-level or continuous approaches) that are starting to make traditional pre-processing steps unnecessary. I’ll explore these neo-techniques more thoroughly and write a dedicated post on them soon.

This post is licensed under CC BY 4.0 by the author.
PRESENT DAY • PRESENT TIME