Subword Tokenization
Byte Pair Encoding (BPE) is a data compression algorithm that is commonly used in the context of subword tokenization for neural language models. BPE tokenizes text into smaller units, such as subword pieces or characters, to handle out-of-vocabulary words, reduce vocabulary size, and enhance the efficiency of language models.
Algorithm
The following describes the steps of BPE in terms of the EM algorithm:
Initialization: Given a dictionary consisting of all words and their counts in a corpus, the symbol vocabulary is initialized by tokenizing each word into its most basic subword units, such as characters.
Expectation: With the (updated) symbol vocabulary, it calculates the frequency of every symbol pair within the vocabulary.
Maximization: Given all symbol pairs and their frequencies, it merges the top-k most frequent symbol pairs in the vocabulary.
Steps 2 and 3 are repeated until meaningful sets of subwords are found for all words in the corpus.
The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?
Implementation
Let us consider a toy vocabulary:
First, we create the symbol vocabulary by inserting a space between every pair of adjacent characters and adding a special symbol [EoW]
at the end to indicate the End of the Word:
Next, we count the frequencies of all symbol pairs in the vocabulary:
Finally, we update the vocabulary by merging the most frequent symbol pair across all words:
The expect()
and maximize()
can be repeated for multiple iterations until the tokenization becomes reasonable:
When you uncomment L7
in bpe_vocab()
, you can see how the symbols are merged in each iteration:
What are the disadvantages of using BPE-based tokenization instead of rule-based tokenization? What are the potential issues with the implementation of BPE above?
References
Source code: src/byte_pair_encoding.py
Neural Machine Translation of Rare Words with Subword Units, Sennrich et al., ACL, 2016.
A New Algorithm for Data Compression, Gage, The C Users Journal, 1994.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Kudo and Richardson, EMNLP, 2018.
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (WordPiece), Wu et al., arXiv, 2016.
Last updated