Subword Tokenization
Last updated
Was this helpful?
Last updated
Was this helpful?
Byte Pair Encoding (BPE) is a data compression algorithm that is commonly used in the context of subword tokenization for . BPE text into smaller units, such as subword pieces or characters, to handle out-of-vocabulary words, reduce vocabulary size, and enhance the efficiency of language models.
The following describes the steps of BPE in terms of the :
Initialization: Given a dictionary consisting of all words and their counts in a corpus, the symbol vocabulary is initialized by tokenizing each word into its most basic subword units, such as characters.
Expectation: With the (updated) symbol vocabulary, it calculates the frequency of every symbol pair within the vocabulary.
Maximization: Given all symbol pairs and their frequencies, it merges the top-k most frequent symbol pairs in the vocabulary.
Steps 2 and 3 are repeated until meaningful sets of subwords are found for all words in the corpus.
Q4: The EM algorithm stands as a classic method in unsupervised learning. What are the advantages of unsupervised learning over supervised learning, and which tasks align well with unsupervised learning?
Let us consider a toy vocabulary:
First, we create the symbol vocabulary by inserting a space between every pair of adjacent characters and adding a special symbol [EoW]
at the end to indicate the End of the Word:
Next, we count the frequencies of all symbol pairs in the vocabulary:
Finally, we update the vocabulary by merging the most frequent symbol pair across all words:
The expect()
and maximize()
can be repeated for multiple iterations until the tokenization becomes reasonable:
When you uncomment L7
in bpe_vocab()
, you can see how the symbols are merged in each iteration:
Q5: What are the disadvantages of using BPE-based tokenization instead of ? What are the potential issues with the implementation of BPE above?
Source code:
, Sennrich et al., ACL, 2016.
, Gage, The C Users Journal, 1994.
, Kudo and Richardson, EMNLP, 2018.
(WordPiece), Wu et al., arXiv, 2016.