Word Counting

Update: 2023-10-31

Consider the following text from Wikipedia about Emory University (as of 2023-10-18):

Emory University is a private research university in Atlanta, Georgia. Founded in 1836 as Emory College by the Methodist Episcopal Church and named in honor of Methodist bishop John Emory.[18]

Emory University has nine academic divisions. Emory Healthcare is the largest healthcare system in the state of Georgia[19] and comprises seven major hospitals, including Emory University Hospital and Emory University Hospital Midtown.[20] The university operates the Winship Cancer Institute, Yerkes National Primate Research Center, and many disease and vaccine research centers.[21][22] Emory University is the leading coordinator of the U.S. Health Department's National Ebola Training and Education Center.[23] The university is one of four institutions involved in the NIAID's Tuberculosis Research Units Program.[24] The International Association of National Public Health Institutes is headquartered at the university.[25]

Emory University has the 15th-largest endowment among U.S. colleges and universities.[9] The university is classified among "R1: Doctoral Universities - Very high research activity"[26] and is cited for high scientific performance and citation impact in the CWTS Leiden Ranking.[27] The National Science Foundation ranked the university 36th among academic institutions in the United States for research and development (R&D) expenditures.[28][29] In 1995 Emory University was elected to the Association of American Universities, an association of the 65 leading research universities in the United States and Canada.[5]

Emory faculty and alumni include 2 Prime Ministers, 9 university presidents, 11 members of the United States Congress, 2 Nobel Peace Prize laureates, a Vice President of the United States, a United States Speaker of the House, and a United States Supreme Court Justice. Other notable alumni include 21 Rhodes Scholars and 6 Pulitzer Prize winners, as well as Emmy Award winners, MacArthur Fellows, CEOs of Fortune 500 companies, heads of state and other leaders in foreign government.[30] Emory has more than 149,000 alumni, with 75 alumni clubs established worldwide in 20 countries.[31][32][33]

Word Count

Our task is to determine the number of word tokens and unique word types in this text. A simple way of accomplishing this task is to split the text with whitespaces and count the strings:

What is the difference between a word token and a word type?

from collections import Counter

corpus = 'dat/text_processing/emory-wiki.txt'
words = open(corpus).read().split()
word_counts = Counter(words)

print('# of word tokens: {}'.format(len(words)))
print('# of word types: {}'.format(len(word_counts)))

L1: Import the Counter class from the collections package.
L4: Open the corpus file (open()), read the contents of the file as a string (read()), split it into a list of words (split()), and store them in the words list.
L5: Use Counter to count the occurrences of each word in words and store the results in the word_counts dictionary.
L7: Print the total number of word tokens in the corpus (format()), which is the length of words (len()).
L8: Print the number of unique word types in the corpus, which is the length of word_counts.

# of word tokens: 305
# of word types: 180

Top-k Frequent Words

In this task, we want to check the top-k most or least frequently occurring words in this text:

wc_des = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
for word, count in wc_des[:10]: print(word, count)

wc_asc = sorted(word_counts.items(), key=lambda x: x[1])
for word, count in wc_asc[:10]: print(word, count)

L1: Sort items in word_counts in descending order (sorted(), items(), lambda) and save them into wc_dec as a list of (word, count) tuples, sorted from the most frequent to the least frequent words.
L2: Iterate over the top 10 (most frequent) words in the sorted list (slice) and print each word along with its count.
L4: Sort items in word_counts in ascending order and save them into wc_asc as a list of (word, count) tuples.
L5: Iterate over the top 10 (least frequent) words in the sorted list zand print each word along with its count.

the 18
and 15
of 12
Emory 11
in 10
University 7
is 7
university 6
United 6
research 5

private 1
Atlanta, 1
Georgia. 1
Founded 1
1836 1
College 1
by 1
Episcopal 1
Church 1
named 1

Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) or "Georgia." (with the period). This is because the text was split only by whitespaces without considering punctuation. As a result, these words are separately recognized from the word types "Atlanta" or "Georgia", respectively. Hence, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.

Save Output

Finally, save all word types in alphabetical order to a file:

fout = open('dat/text_processing/word_types.txt', 'w')
for key in sorted(word_counts.keys()): fout.write('{}\n'.format(key))
fout.close()

L1: Open a file word_types.txt in write mode (w). # If the file does not exist, it will be created. If it does exist, its previous contents will be overwritten.
L2: Iterate over unique word types (keys) of word_counts in alphabetical order, and write each word followed by a newline character to fout.

Mapping Types, The Python Standard Library - Built-in Types.
Sequence Types, The Python Standard Library - Built-in Types.
Input and Output, The Python Tutorial.

References

Source: word_counting.py

PreviousText Processing NextTokenization

Last updated 6 months ago