Frequency Analysis
Last updated
Was this helpful?
Last updated
Was this helpful?
Frequency Analysis examines how often each word appears in a corpus. It helps understand language patterns and structure by measuring how often words appear in text.
Consider the following text from Wikipedia about (as of 2023-10-18):
Our task is to determine the number of word tokens and unique word types in this text.
Q1: What is the difference between a word token and a word type?
A simple way of accomplishing this task is to split the text by whitespace and count the resulting strings:
L1: Import the class, a special type of a , from the package.
L3: Use to indicate the parameter type (str
) and the return type (Counter
).
L4: the corpus
file.
L5: the contents of the file as a string, it into a of words, and store them in words
.
L6: Count the occurrences of each word and return the results as a Counter.
In this task, we aim to retrieve the top-k most or least frequently occurring word types in the text:
L2: Sort words in counts
in ascending order and save them into asc
as a list of (word, count) tuples.
L5: Iterate over the top k least frequent words in the sorted list and print each word along with its count.
Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) and "Georgia." (with the period). This occurs because the text was split only by whitespaces without considering punctuation. Consequently, these words are recognized separately from the word types "Atlanta" and "Georgia". Therefore, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.
Q2: How can we interpret the most frequent words in a text?
Finally, let us save all word types in alphabetical order to a file:
L4: Iterate over unique word types (keys) of counts
in alphabetical order.
L5: Write each word followed by a newline character to fout
.
L7: Close the output stream.
L1: The corpus can be found .
L4: Save the total number of word tokens in the corpus, which is the of counts
values.
L5: Save the number of unique word types in the corpus, which is the of counts
.
L7-8: Print the value using the .
L1: Sort words in counts
in descending order and save them into dec
as a list of (word, count) , sorted from the most frequent to the least frequent words (, , ).
L4: Iterate over the top k most frequent words in the sorted list using , and print each word along with its count.
L2: Open outfile
in mode (w
).
L1: Creates the file if it does not exist; otherwise, its previous contents will be completely overwritten.
Source:
, The Python Standard Library - Built-in Types
, The Python Standard Library - Built-in Types
, The Python Tutorial