Frequency Analysis
Frequency Analysis examines how often each word appears in a corpus. It helps understand language patterns and structure by measuring how often words appear in text.
Word Counting
Consider the following text from Wikipedia about Emory University (as of 2023-10-18):
Our task is to determine the number of word tokens and unique word types in this text.
Q1: What is the difference between a word token and a word type?
A simple way of accomplishing this task is to split the text by whitespace and count the resulting strings:
L1: Import the Counter class, a special type of a dictionary, from the collections package.
L3: Use typing to indicate the parameter type (
str
) and the return type (Counter
).L4: open() the
corpus
file.L6: Count the occurrences of each word and return the results as a Counter.
L1: The corpus can be found dat/emory-wiki.txt.
L4: Save the total number of word tokens in the corpus, which is the sum() of
counts
values.L5: Save the number of unique word types in the corpus, which is the len() of
counts
.L7-8: Print the value using the formatted string literals.
When running this program, you may encounter FileNotFoundError
. To fix this error, follow these steps to set up your working directory:
Go to
[Run] > [Edit Configurations]
in the menu.Select "frequency_analysis" from the sidebar.
Change the working directory to the top-level "nlp-essentials" directory.
Click
[OK]
to save the changes.
Word Frequency
In this task, we aim to retrieve the top-k most or least frequently occurring word types in the text:
L2: Sort words in
counts
in ascending order and save them intoasc
as a list of (word, count) tuples.L4: Iterate over the top k most frequent words in the sorted list using slicing, and print each word along with its count.
L5: Iterate over the top k least frequent words in the sorted list and print each word along with its count.
Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) and "Georgia." (with the period). This occurs because the text was split only by whitespaces without considering punctuation. Consequently, these words are recognized separately from the word types "Atlanta" and "Georgia". Therefore, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.
Q2: How can we interpret the most frequent words in a text?
Save Output
Finally, let us save all word types in alphabetical order to a file:
L2: Open
outfile
in write mode (w
).L4: Iterate over unique word types (keys) of
counts
in alphabetical order.L5: Write each word followed by a newline character to
fout
.L7: Close the output stream.
L1: Creates the dat/word_types.txt file if it does not exist; otherwise, its previous contents will be completely overwritten.
References
Source: src/frequency_analysis.py
Sequence Types, The Python Standard Library - Built-in Types
Mapping Types, The Python Standard Library - Built-in Types
Input and Output, The Python Tutorial
Last updated
Was this helpful?