Tokenization is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. Tokens serve as the basic building blocks used for a specific task.
Q3: What is the difference between a word and a token?
When examining the dat/word_types.txt from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:
Depending on the task, you may want to tokenize "[26]" into ['[', '26', ']'] for more generalization. In this case, however, we consider "[26]" as a unique identifier for the corresponding reference rather than as the number "26" surrounded by square brackets. Thus, we aim to recognize it as a single token.
Delimiters
Let us write the delimit() function that takes a word and a set of delimiters, and returns a list of tokens by splitting the word using the delimiters:
L1: .
L2: Find the index of the first character in word that is in a of delimiters (, ). If no delimiter is found in word, return -1 ().
Let us define a set of delimiters and test delimit() using various input:
L1: .
L17: Iterate over the two lists, input and output, in parallel using the function.
Q4: All delimiters used in our implementation are punctuation marks. What types of tokens should not be split by such delimiters?
Post-Processing
When reviewing the output of delimit(), the first four test cases yield accurate results, while the last five are not handled properly, which should have been tokenized as follows:
To handle these special cases, let us post-process the tokens generated by delimit():
L2: Initialize variables i for the current position and new_tokens for the resulting tokens.
L4: Iterate through the input tokens.
L5: Case 1: Handling apostrophes for contractions like "'s" (e.g., it's).
Once the post-processing is applied, all outputs are now handled properly:
Q5: Our tokenizer uses hard-coded rules to handle specific cases. What would be a scalable approach to handling more diverse cases?
Tokenizing
Finally, let us write tokenize() that takes a path to a corpus and a set of delimiters, and returns a list of tokens from the corpus:
L2: Read the corpus file.
L3: Split the text into words.
L4: Tokenize each word in the corpus using the specified delimiters. postprocess() is used to process the special cases further. The resulting tokens are collected in a list and returned ().
Given the new tokenizer, let us recount word types in the corpus, , and save them:
L2: Import save_output() from the module.
L13: Save the tokenized output to .
Compared to the original tokenization, where all words are split solely by whitespaces, the more advanced tokenizer increases the number of word tokens from 305 to 363 and the number of word types from 180 to 197 because all punctuation symbols, as well as reference numbers, are now introduced as individual tokens.
Q6: The use of a more advanced tokenizer mitigates the issue of sparsity. What exactly is the sparsity issue, and how can appropriate tokenization help alleviate it?
References
Source:
- a heuristic-based tokenizer
L3: If no delimiter is found, return a list containing word as a single token.
L4: If a delimiter is found, create a list tokens to store the individual tokens.
L6: If the delimiter is not at the beginning of word, add the characters before the delimiter as a token to tokens.
L7: Add the delimiter itself as a separate token to tokens.
L9-10: If there are characters after the delimiter, call delimit() recursively on the remaining part of word and extend() the tokens list with the result.
def delimit(word: str, delimiters: set[str]) -> list[str]:
i = next((i for i, c in enumerate(word) if c in delimiters), -1)
if i < 0: return [word]
tokens = []
if i > 0: tokens.append(word[:i])
tokens.append(word[i])
if i + 1 < len(word):
tokens.extend(delimit(word[i + 1:], delimiters))
return tokens
delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}
input = [
'"R1:',
'(R&D)',
'15th-largest',
'Atlanta,',
"Department's",
'activity"[26]',
'centers.[21][22]',
'149,000',
'U.S.'
]
output = [delimit(word, delims) for word in input]
for word, tokens in zip(input, output):
print('{:<16} -> {}'.format(word, tokens))
def postprocess(tokens: list[str]) -> list[str]:
i, new_tokens = 0, []
while i < len(tokens):
if i + 1 < len(tokens) and tokens[i] == "'" and tokens[i + 1].lower() == 's':
new_tokens.append(''.join(tokens[i:i + 2]))
i += 1
elif i + 2 < len(tokens) and \
((tokens[i] == '[' and tokens[i + 1].isnumeric() and tokens[i + 2] == ']') or
(tokens[i].isnumeric() and tokens[i + 1] == ',' and tokens[i + 2].isnumeric())):
new_tokens.append(''.join(tokens[i:i + 3]))
i += 2
elif i + 3 < len(tokens) and ''.join(tokens[i:i + 4]) == 'U.S.':
new_tokens.append(''.join(tokens[i:i + 4]))
i += 3
else:
new_tokens.append(tokens[i])
i += 1
return new_tokens
output = [postprocess(delimit(word, delims)) for word in input]
for word, tokens in zip(input, output):
print('{:<16} -> {}'.format(word, tokens))
def tokenize(corpus: str, delimiters: set[str]) -> list[str]:
with open(corpus) as fin:
words = fin.read().split()
return [token for word in words for token in postprocess(delimit(word, delimiters))]
from collections import Counter
from src.frequency_analysis import save_output
corpus = 'dat/emory-wiki.txt'
output = 'dat/word_types-token.txt'
words = tokenize(corpus, delims)
counts = Counter(words)
print(f'# of word tokens: {len(words)}')
print(f'# of word types: {len(counts)}')
save_output(counts, output)