# Tokenization

**Tokenization** is the process of breaking down a text into smaller units, typically words or subwords, known as tokens. **Tokens** serve as the basic building blocks used for a specific task.

{% hint style="warning" %}
**Q3**: What is the difference between a **word** and a **token**?
{% endhint %}

When examining the [dat/word\_types.txt](https://github.com/emory-courses/nlp-essentials/tree/main/dat/word_types.txt) from the previous section, you notice several words that need further tokenization, where many of them can be resolved by leveraging punctuation:

* "R1: → `['"', "R1", ":"]`
* (R\&D) → `['(', 'R&D', ')']`
* 15th-largest → `['15th', '-', 'largest']`
* Atlanta, → `['Atlanta', ',']`
* Department's → `['Department', "'s"]`
* activity"\[26] → `['activity', '"', '[26]']`
* centers.\[21]\[22] → `['centers', '.', '[21]', '[22]']`

{% hint style="info" %}
Depending on the task, you may want to tokenize "\[26]" into `['[', '26', ']']` for more generalization. In this case, however, we consider "\[26]" as a unique identifier for the corresponding reference rather than as the number "26" surrounded by square brackets. Thus, we aim to recognize it as a single token.
{% endhint %}

## Delimiters <a href="#delimiters" id="delimiters"></a>

Let us write the `delimit()` function that takes a word and a set of delimiters, and returns a list of tokens by splitting the word using the delimiters:

{% code lineNumbers="true" %}

```python
def delimit(word: str, delimiters: set[str]) -> list[str]:
    i = next((i for i, c in enumerate(word) if c in delimiters), -1)
    if i < 0: return [word]
    tokens = []

    if i > 0: tokens.append(word[:i])
    tokens.append(word[i])

    if i + 1 < len(word):
        tokens.extend(delimit(word[i + 1:], delimiters))

    return tokens
```

{% endcode %}

* L1: [Support for type hints](https://docs.python.org/3.12/library/typing.html).
* L2: Find the index of the first character in `word` that is in a [set](https://docs.python.org/3/library/stdtypes.html#set) of `delimiters` ([enumerate()](https://docs.python.org/3/library/functions.html#enumerate), [next()](https://docs.python.org/3/library/functions.html#next)). If no delimiter is found in `word`, return -1 ([generator expressions](https://docs.python.org/3/reference/expressions.html#generator-expressions)).
* L3: If no delimiter is found, return a list containing `word` as a single token.
* L4: If a delimiter is found, create a list `tokens` to store the individual tokens.
* L6: If the delimiter is not at the beginning of `word`, add the characters before the delimiter as a token to `tokens`.
* L7: Add the delimiter itself as a separate token to `tokens`.
* L9-10: If there are characters after the delimiter, call `delimit()` recursively on the remaining part of `word` and [extend()](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) the `tokens` list with the result.

Let us define a set of delimiters and test `delimit()` using various input:

{% tabs %}
{% tab title="Code" %}
{% code lineNumbers="true" %}

```python
delims = {'"', "'", '(', ')', '[', ']', ':', '-', ',', '.'}

input = [
    '"R1:',
    '(R&D)',
    '15th-largest',
    'Atlanta,',
    "Department's",
    'activity"[26]',
    'centers.[21][22]',
    '149,000',
    'U.S.'
]

output = [delimit(word, delims) for word in input]

for word, tokens in zip(input, output):
    print('{:<16} -> {}'.format(word, tokens))
```

{% endcode %}

* L1: [Set types](https://docs.python.org/3/library/stdtypes.html?highlight=list#set-types-set-frozenset).
* L17: Iterate over the two lists, `input` and `output`, in parallel using the [zip()](https://docs.python.org/3/library/functions.html#zip) function.
* L18: [Format Specification Mini-Language](https://docs.python.org/3/library/string.html#format-specification-mini-language)
  {% endtab %}

{% tab title="Output" %}

```
"R1: -> ['"', 'R1', ':']
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'", 's']
activity"[26] -> ['activity', '"', '[', '26', ']']
centers.[21][22] -> ['centers', '.', '[', '21', ']', '[', '22', ']']
149,000 -> ['149', ',', '000']
U.S. -> ['U', '.', 'S', '.']
```

{% endtab %}
{% endtabs %}

{% hint style="warning" %}
**Q4**: All delimiters used in our implementation are **punctuation marks**. What types of tokens should not be split by such delimiters?
{% endhint %}

## Post-Processing

When reviewing the output of `delimit()`, the first four test cases yield accurate results, while the last five are not handled properly, which should have been tokenized as follows:

* Department's → `['Department', "'s"]`
* activity"\[26] → `['activity', '"', '[26]']`
* centers.\[21]\[22] → `['centers', '.', '[21]', '[22]']`
* 149,000 → `['149,000']`
* U.S. → `['U.S.']`

To handle these special cases, let us post-process the tokens generated by `delimit()`:

{% code lineNumbers="true" %}

```python
def postprocess(tokens: list[str]) -> list[str]:
    i, new_tokens = 0, []

    while i < len(tokens):
        if i + 1 < len(tokens) and tokens[i] == "'" and tokens[i + 1].lower() == 's':
            new_tokens.append(''.join(tokens[i:i + 2]))
            i += 1
        elif i + 2 < len(tokens) and \
                ((tokens[i] == '[' and tokens[i + 1].isnumeric() and tokens[i + 2] == ']') or
                 (tokens[i].isnumeric() and tokens[i + 1] == ',' and tokens[i + 2].isnumeric())):
            new_tokens.append(''.join(tokens[i:i + 3]))
            i += 2
        elif i + 3 < len(tokens) and ''.join(tokens[i:i + 4]) == 'U.S.':
            new_tokens.append(''.join(tokens[i:i + 4]))
            i += 3
        else:
            new_tokens.append(tokens[i])
        i += 1

    return new_tokens
```

{% endcode %}

* L2: Initialize variables `i` for the current position and `new_tokens` for the resulting tokens.
* L4: Iterate through the input tokens.
* L5: Case 1: Handling apostrophes for contractions like "*'s*" (e.g., *it's*).
  * L6: Combine the apostrophe and "*s*" and append it as a single token.
  * L7: Move the position indicator by 1 to skip the next character.
* L8-10: Case 2: Handling numbers in special formats like \[##], ###,### (e.g., \[42], 12,345).
  * L11: Combine the special number format and append it as a single token.
  * L12: Move the position indicator by 2 to skip the next two characters.
* L13: Case 3: Handling acronyms like "*U.S.*".
  * L14: Combine the acronym and append it as a single token.
  * L15: Move the position indicator by 3 to skip the next three characters.
* L17: Case 4: If none of the special cases above are met, append the current token.
* L18: Move the position indicator by 1 to process the next token.
* L20: Return the list of processed tokens.

Once the post-processing is applied, all outputs are now handled properly:

{% tabs %}
{% tab title="Code" %}
{% code lineNumbers="true" %}

```python
output = [postprocess(delimit(word, delims)) for word in input]

for word, tokens in zip(input, output):
    print('{:<16} -> {}'.format(word, tokens))
```

{% endcode %}
{% endtab %}

{% tab title="Output" %}

```
"R1: -> ['"', 'R1', ':']
(R&D) -> ['(', 'R&D', ')']
15th-largest -> ['15th', '-', 'largest']
Atlanta, -> ['Atlanta', ',']
Department's -> ['Department', "'s"]
activity"[26] -> ['activity', '"', '[26]']
centers.[21][22] -> ['centers', '.', '[21]', '[22]']
149,000 -> ['149,000']
U.S. -> ['U.S.']
```

{% endtab %}
{% endtabs %}

{% hint style="warning" %}
**Q5**: Our tokenizer uses **hard-coded rules** to handle specific cases. What would be a **scalable** approach to handling more diverse cases?
{% endhint %}

## Tokenizing

Finally, let us write `tokenize()` that takes a path to a corpus and a set of delimiters, and returns a list of tokens from the corpus:

{% code lineNumbers="true" %}

```python
def tokenize(corpus: str, delimiters: set[str]) -> list[str]:
    with open(corpus) as fin:
        words = fin.read().split()
    return [token for word in words for token in postprocess(delimit(word, delimiters))]
```

{% endcode %}

* L2: Read the `corpus` file.
* L3: Split the text into words.
* L4: Tokenize each word in the corpus using the specified delimiters. `postprocess()` is used to process the special cases further. The resulting tokens are collected in a list and returned ([list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)).

Given the new tokenizer, let us recount word types in the corpus, [emory-wiki.txt](https://github.com/emory-courses/nlp-essentials/blob/main/dat/text_processing/emory-wiki.txt), and save them:

{% tabs %}
{% tab title="Code" %}
{% code lineNumbers="true" %}

```python
from collections import Counter
from src.frequency_analysis import save_output

corpus = 'dat/emory-wiki.txt'
output = 'dat/word_types-token.txt'

words = tokenize(corpus, delims)
counts = Counter(words)

print(f'# of word tokens: {len(words)}')
print(f'# of word types: {len(counts)}')

save_output(counts, output)
```

{% endcode %}

* L2: Import `save_output()` from the [src/frequency\_analysis.py](https://github.com/emory-courses/nlp-essentials/blob/main/src/frequency_analysis.py) module.
* L13: Save the tokenized output to [dat/word\_types-token.txt](https://github.com/emory-courses/nlp-essentials/blob/main/dat/word_types-token.txt).
  {% endtab %}

{% tab title="Output" %}

```
# of word tokens: 363
# of word types: 197
```

[dat/text\_processing /word\_types-token.txt](https://github.com/emory-courses/nlp-essentials/blob/main/dat/text_processing/word_types-token.txt)
{% endtab %}
{% endtabs %}

Compared to the original tokenization, where all words are split solely by whitespaces, the more advanced tokenizer increases the number of word tokens from 305 to 363 and the number of word types from 180 to 197 because all punctuation symbols, as well as reference numbers, are now introduced as individual tokens.

{% hint style="warning" %}
**Q6**: The use of a more advanced tokenizer mitigates the issue of sparsity. What exactly is the **sparsity issue**, and how can appropriate tokenization help alleviate it?
{% endhint %}

## References

1. Source: [src/tokenization.py](https://github.com/emory-courses/nlp-essentials/blob/main/src/tokenization.py)
2. [ELIT Tokenizer](https://github.com/emorynlp/elit-tokenizer) - a heuristic-based tokenizer
