Thus far, I’ve pseudo-justified why a collection of NYT articles by Thomas Friedman would be interesting to study, actually compiled/scraped the text and metadata (see Background and Creation post), improved/verified the quality of the data, and computed a handful of simple, corpus-level statistics (see Data Quality and Corpus Stats post). Now, onward to actual natural language analysis!


I would argue that the frequency of occurrence of words and other linguistic elements is the fundamental measure on which much NLP is based. In essence, we want to answer “How many times did something occur?” in both absolute and relative terms. Since words are probably the most familiar “linguistic elements” of a language, I focused on word occurrence; however, other elements may also merit counting, including morphemes (“bits of words”) and parts-of-speech (nouns, verbs, …).

Note: In the past I’ve been confused by the terminology used for absolute and relative frequencies —– pretty sure it’s used inconsistently in the literature. I use count to refer to absolute frequencies (whole, positive numbers: 1, 2, 3, …) and frequency to refer to relative frequencies (rational numbers between 0.0 and 1.0). These definitions sweep certain complications under the rug, but I don’t want to get into it right now…

Anyway, in order to count individual words, I had to split the corpus text into a list of its component words. I’ve discussed tokenization before, so I won’t go into details. Given that I scraped this text from the web, though, I should note that I cleaned it up a bit before tokenizing: namely, I decoded any HTML entities; removed all HTML markup, URLs, and non-ASCII characters; and normalized white-space. Perhaps controversially, I also unpacked contractions (e.g., “don’t” => “do not”) in an effort to avoid weird tokens that creep in around apostrophes (e.g., “don”+”’”+”t” or “don”+”‘t”). Since any mistakes in tokenization propagate to results downstream, it’s probably best to use a “standard” tokenizer rather than something homemade; I’ve found NLTK’s defaults to be good enough (usually). Here’s some sample code:

from itertools import chain
from nltk import clean_html, sent_tokenize, word_tokenize
# combine all articles into single block of text
all_text = ' '.join([doc['full_text'] for doc in docs])
# partial cleaning as example: this uses nltk to strip residual HTML markup
cleaned_text = clean_html(all_text)
# tokenize text into sentences, sentences into words
tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(cleaned_text)]
# flatten list of lists into a single words list
all_words = list(chain(*tokenized_text))

Now I had one last set of decisions to make: Which words do I want to count? Depends on what you want to do, of course! For example, this article explains how filtering for and studying certain words helped computational linguists identify J.K. Rowling as the person behind the author Robert Galbraith. In my case, I just wanted to get a general feeling for the meaningful words Friedman has used the most. So, I filtered out stop words and bare punctuation tokens, and I lowercased all letters, but I did not stem or lemmatize the words; the total number of words dropped from 2.96M to 1.43M. I then used NLTK’s handy FreqDist() class to get counts by word. Here are both counts and frequencies for the top 30 “good” words in my Friedman corpus:


You can see that the distributions are identical, except for the y-axis values: as discussed above, counts are the absolute number of occurrences for each word, while frequencies are those counts divided by the total number of words in the corpus. It’s interesting but not particularly surprising that Friedman’s top two meaningful words are mr. and said –— he’s a journalist, after all, and he’s quoted a lot of people. (Perhaps he met them on the way to/from a foreign airport…) Given what we know about Friedman’s career (as discussed in (1)), most of the other top words also sound about right: Israel/Israeli, president, American, people, world, Bush, …

On a lark, I compared word counts for the five presidents that have held office during Friedman’s NYT career: Ronald Reagan, George H.W. Bush, Bill Clinton, George W. Bush, and Barack Obama:

  • “reagan”: 761
  • “bush”: 3582
  • “clinton”: 2741
  • “obama”: 964

Yes, the two Bush’s got combined, and Hillary is definitely contaminating Bill’s counts (I didn’t feel like doing reference disambiguation on this, sorry!). I find it more interesting to plot conditional frequency distributions, i.e. a set of frequency distributions, one for each value of some condition. So, taking the article’s year of publication as the condition, I produced this plot of presidential mentions by year:


Nice! You can clearly see frequencies peaking during a given president’s term(s), which makes sense. Plus, they show Friedman’s change in focus over time: early on, he covered Middle Eastern conflict, not the American presidency; in 1994, a year in which Clinton was mentioned particularly frequently, Friedman was specifically covering the White House. I’m tempted to read further into the data, such as the long decline of W. Bush mentions throughout —– and beyond –— his second term possibly indicating his slide into irrelevance, but I shouldn’t without first inspecting context. Some other time, perhaps.

I made a few other conditional frequency distributions using NLTK’s ConditionalFreqDist() class, just for kicks. Here are two, presented without comment (only hints of a raised eyebrow on the author’s part):


These plots-over-time lead naturally into the concept of dispersion.


Although frequencies of (co-)occurrence are fundamental and ubiquitous in corpus linguistics, they are potentially misleading unless one also gives a measure of dispersion, i.e. the spread or variability of a distribution of values. It’s Statistics 101: You shouldn’t report a mean value without an associated dispersion!

Counts/frequencies of words or other linguistic elements are often used to indicate importance in a corpus or language, but consider a corpus in which two words have the same counts, only the first word occurs in 99% of corpus documents, while the second word is concentrated in just 5%. Which word is “more important”? And how should we interpret subsequent statistics based on these frequencies if the second word’s high value is unrepresentative of most of the corpus?

In the case of my Friedman corpus, the conditional frequency distributions over time (above) visualize, to a certain extent, those terms’ dispersions. But we can do more. As it turns out, NLTK includes a small module to plot dispersion; like so:

from nltk.draw import dispersion_plot
                ['reagan', 'bush', 'clinton', 'obama'],

To be honest, I’m not even sure how to interpret this plot –— for starters, why does Obama appear at what I think is the beginning of the corpus?! Clearly, it would be nice to quantify dispersion as, like, a single, scalar value. Many dispersion measures have been proposed over the years (see [1] for a nice overview), but in the context of linguistic elements, most are poorly known, little studied, and suffer from a variety of statistical shortcomings. Also in [1], the author proposes an alternative, conceptually simple measure of dispersion called DP, for deviation of proportions, whose derivation he gives as follows:

  • Determine the sizes s of each of the n corpus parts (documents), which are normalized against the overall corpus size and correspond to expected percentages which take differently-sized corpus parts into consideration.
  • Determine the frequencies v with which word a occurs in the n corpus parts, which are normalized against the overall number of occurrences of a and correspond to an observed percentage.
  • Compute all n pairwise absolute differences of observed and expected percentages, sum them up, and divide the result by two. The result is DP, which can theoretically range from approximately 0 to 1, where values close to 0 indicate that a is distributed across the n corpus parts as one would expect given the sizes of the n corpus parts. By contrast, values close to 1 indicate that a is distributed across the n corpus parts exactly the opposite way one would expect given the sizes of the n corpus parts.

Sounds reasonable to me! (Read the cited paper if you disagree, I found it very convincing.) Using this definition, I calculated DP values for all words in the Friedman corpus and plotted those values against their corresponding counts:


As expected, the most frequent words tend to have lower DP values (be more evenly distributed in the corpus), and vice-versa; however, note the wide spread in DP for a fixed count, particularly in the middle range. Many words are definitely distributed unevenly in the Friedman corpus!

A common —– but not entirely ideal –— way to account for dispersion in corpus linguistics is to compute the adjusted frequency of words, which is often just frequency multiplied by dispersion. (Other definitions exist, but I won’t get into it.) Such adjusted frequencies are by definition some fraction of the raw frequency, and words with low dispersion are penalized more than those with high dispersion. Here, I plotted the frequencies and adjusted frequencies of Friedman’s top 30 words from before:


You can see that the rankings would change if I used adjusted frequency to order the words! This difference can be quantified with, say, a Spearman correlation coefficient, for which a value of 1.0 indicates identical rankings and -1.0 indicates exactly opposite rankings. I calculated a value of 0.89 for frequency-ranks vs adjusted frequency-ranks: similar, but not the same! It’s clear that the effect of (under-)dispersion should not be ignored in corpus linguistics. My big issue with adjusted frequencies is that they are more difficult to interpret: What, exactly, does frequency*dispersion actually mean? What units go with those values? Maybe smarter people than I will come up with a better measure.

Well, I’d meant to include word co-occurrence in this post, but it’s already too long. Congratulations for making it all the way through! :) Next time, then, I’ll get into bigrams/trigrams/n-grams and association measures. And after that, I get to the fun stuff!

[1] Gries, Stefan Th. “Dispersions and adjusted frequencies in corpora.” International journal of corpus linguistics 13.4 (2008): 403-437.