arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

Frequency Analysis

Frequency Analysis examines how often each word appears in a corpus. It helps understand language patterns and structure by measuring how often words appear in text.

hashtag
Word Counting

Consider the following text from Wikipedia about Emory Universityarrow-up-right (as of 2023-10-18):

chevron-righthashtag

Our task is to determine the number of word tokens and unique word types in this text.

circle-exclamation

Q1: What is the difference between a word token and a word type?

A simple way of accomplishing this task is to split the text by whitespace and count the resulting strings:

  • L1: Import the class, a special type of a , from the package.

  • L3: Use to indicate the parameter type (str) and the return type (Counter).

  • L4: the corpus

  • L1: The corpus can be found .

  • L4: Save the total number of word tokens in the corpus, which is the of counts values.

  • L5: Save the number of unique word types in the corpus, which is the of counts

circle-info

When running this program, you may encounter FileNotFoundError. To fix this error, follow these steps to set up your working directory:

  1. Go to [Run] > [Edit Configurations] in the menu.

hashtag
Word Frequency

In this task, we aim to retrieve the top-k most or least frequently occurring word types in the text:

  • L1: Sort words in counts in descending order and save them into dec as a list of (word, count) , sorted from the most frequent to the least frequent words (, , ).

  • L2: Sort words in counts in ascending order and save them into asc as a list of (word, count) tuples.

Notice that the top-10 least-frequent word list contains unnormalized words such as "Atlanta," (with the comma) and "Georgia." (with the period). This occurs because the text was split only by whitespaces without considering punctuation. Consequently, these words are recognized separately from the word types "Atlanta" and "Georgia". Therefore, the counts of word tokens and types processed above do not necessarily represent the distributions of the text accurately.

circle-exclamation

Q2: How can we interpret the most frequent words in a text?

hashtag
Save Output

Finally, let us save all word types in alphabetical order to a file:

  • L2: Open outfile in mode (w).

  • L4: Iterate over unique word types (keys) of counts in alphabetical order.

  • L5: Write each word followed by a newline character to fout

  • L1: Creates the file if it does not exist; otherwise, its previous contents will be completely overwritten.

hashtag
References

  1. Source:

  2. , The Python Standard Library - Built-in Types

  3. , The Python Standard Library - Built-in Types

file.
  • L5: read()arrow-up-right the contents of the file as a string, split()arrow-up-right it into a listarrow-up-right of words, and store them in words.

  • L6: Count the occurrences of each word and return the results as a Counter.

  • .
  • L7-8: Print the value using the formatted string literalsarrow-up-right.

  • Select "frequency_analysis" from the sidebar.
  • Change the working directory to the top-level "nlp-essentials" directory.

  • Click [OK] to save the changes.

    Configure the working directory.
  • L4: Iterate over the top k most frequent words in the sorted list using slicingarrow-up-right, and print each word along with its count.

  • L5: Iterate over the top k least frequent words in the sorted list and print each word along with its count.

  • .
  • L7: Close the output stream.

  • Input and Outputarrow-up-right, The Python Tutorial
    Emory University is a private research university in Atlanta, Georgia. Founded in 1836 as Emory College by the Methodist Episcopal Church and named in honor of Methodist bishop John Emory.[18]
    
    Emory University has nine academic divisions. Emory Healthcare is the largest healthcare system in the state of Georgia[19] and comprises seven major hospitals, including Emory University Hospital and Emory University Hospital Midtown.[20] The university operates the Winship Cancer Institute, Yerkes National Primate Research Center, and many disease and vaccine research centers.[21][22] Emory University is the leading coordinator of the U.S. Health Department's National Ebola Training and Education Center.[23] The university is one of four institutions involved in the NIAID's Tuberculosis Research Units Program.[24] The International Association of National Public Health Institutes is headquartered at the university.[25]
    
    Emory University has the 15th-largest endowment among U.S. colleges and universities.[9] The university is classified among "R1: Doctoral Universities - Very high research activity"[26] and is cited for high scientific performance and citation impact in the CWTS Leiden Ranking.[27] The National Science Foundation ranked the university 36th among academic institutions in the United States for research and development (R&D) expenditures.[28][29] In 1995 Emory University was elected to the Association of American Universities, an association of the 65 leading research universities in the United States and Canada.[5]
    
    Emory faculty and alumni include 2 Prime Ministers, 9 university presidents, 11 members of the United States Congress, 2 Nobel Peace Prize laureates, a Vice President of the United States, a United States Speaker of the House, and a United States Supreme Court Justice. Other notable alumni include 21 Rhodes Scholars and 6 Pulitzer Prize winners, as well as Emmy Award winners, MacArthur Fellows, CEOs of Fortune 500 companies, heads of state and other leaders in foreign government.[30] Emory has more than 149,000 alumni, with 75 alumni clubs established worldwide in 20 countries.[31][32][33]
    dat/emory-wiki.txtarrow-up-right
    Counterarrow-up-right
    dictionaryarrow-up-right
    collectionsarrow-up-right
    typingarrow-up-right
    open()arrow-up-right
    dat/emory-wiki.txtarrow-up-right
    sum()arrow-up-right
    len()arrow-up-right
    tuplesarrow-up-right
    sorted()arrow-up-right
    items()arrow-up-right
    lambdaarrow-up-right
    writearrow-up-right
    dat/word_types.txtarrow-up-right
    src/frequency_analysis.pyarrow-up-right
    Sequence Typesarrow-up-right
    Mapping Typesarrow-up-right
    # of word tokens: 305
    # of word types: 180
    the 18
    and 15
    of 12
    Emory 11
    in 10
    University 7
    is 7
    university 6
    United 6
    research 5
    private 1
    Atlanta, 1
    Georgia. 1
    Founded 1
    1836 1
    College 1
    by 1
    Episcopal 1
    Church 1
    named 1
    from collections import Counter
    
    def count_words(corpus: str) -> Counter:
        fin = open(corpus)
        words = fin.read().split()
        return Counter(words)
    corpus = 'dat/emory-wiki.txt'
    counts = count_words(corpus)
    
    n_tokens = sum(counts.values())
    n_types = len(counts)
    
    print(f'# of word tokens: {n_tokens}')
    print(f'# of word types: {n_types}')
    des = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    asc = sorted(counts.items(), key=lambda x: x[1])
    
    for word, count in des[:10]: print(word, count)
    for word, count in asc[:10]: print(word, count)
    def save_output(counts: Counter, outfile: str):
        fout = open(outfile, 'w')
    
        for word in sorted(counts.keys()):
            fout.write(f'{word}\n')
    
        fout.close()
    save_output(counts, 'dat/word_types.txt')