Intro to Automatic Keyphrase Extraction

I often apply natural language processing for purposes of automatically extracting structured information from unstructured (text) datasets. One such task is the extraction of important topical words and phrases from documents, commonly known as terminology extraction or automatic keyphrase extraction. Keyphrases provide a concise description of a document’s content; they are useful for document categorization, clustering, indexing, search, and summarization; quantifying semantic similarity with other documents; as well as conceptualizing particular knowledge domains.

Read More »

On Starting Over with Jekyll

After another lengthy hiatus from blogging, I’m back! Long story short, I got so frustrated with Blogger’s shortcomings and complications, not to mention the general lack of control over my content, that I lost the will to update my old blog. At the same time, I was putting in longer hours at Harmony Institute and volunteering on the side for DataKind, so I didn’t have much to say outside of official channels. That said, my data life has not gone entirely un-blogged:

Read More »

Friedman Corpus (3) — Occurrence and Dispersion

Thus far, I’ve pseudo-justified why a collection of NYT articles by Thomas Friedman would be interesting to study, actually compiled/scraped the text and metadata (see Background and Creation post), improved/verified the quality of the data, and computed a handful of simple, corpus-level statistics (see Data Quality and Corpus Stats post). Now, onward to actual natural language analysis!

Read More »

Friedman Corpus (2) — Data Quality and Corpus Stats

With a full-text Friedman corpus finally in hand (see Background and Creation post), my first task was to verify data quality. Given “Garbage In, Garbage Out”, the fun stuff (analysis! plots! Friedman_ebooks?!) had to wait. Yes, it’s a pain in the ass, but this step is really important.

Read More »

Friedman Corpus (1) — Background and Creation

Much work in Natural Language Processing (NLP) begins with a large collection of text documents, called a corpus, that represents a written sample of language in a particular domain of study. Corpora come in a variety of flavors: mono- or multi-lingual; category-specific or a representative sampling from a variety of categories, e.g. genres, authors, time periods; simply “plain” text or annotated with additional linguistic information, e.g. part-of-speech tags, full parse trees; and so on. They allow for hypothesis testing and statistical analysis of natural language, but one must be very cautious about applying results derived from a given corpus to other domains.

Read More »

While I Was Away

I’ve not posted in almost six months, but I was, like, totally busy. Here’s what I’ve been up to:

Read More »

Intro to Natural Language Processing (2)

A couple months ago, I posted a brief, conceptual overview of Natural Language Processing (NLP) as applied to the common task of information extraction (IE) —– that is, the process of extracting structured data from unstructured data, the majority of which is text. A significant component of my job at HI involves scraping text from websites, press articles, social media, and other sources, then analyzing the quantity and especially quality of the discussion as it relates to a film and/or social issue. Although humans are inarguably better than machines at understanding natural language, it’s impractical for humans to analyze large numbers of documents for themes, trends, content, sentiment, etc., and to do so consistently throughout. This is where NLP comes in.

Read More »

A Data Science Education?

Given that you’re currently reading a data science blog, you’re probably well aware that online resources for an informal education in data science abound. Blogs are a great place to start (here, here, here, here, here), but topics and pedagogical quality are –— let’s be honest –— scattershot at best. No comment on the usefulness of this particular blog…

Read More »

Connecting to the Data Set

As a relative newcomer to the field, I’ve been learning and doing data science largely on my own. This is okay, I guess, given access to Stack Overflow, MOOCs, and a handful of O’Reilly’s textbooks, but not ideal. Fortunately, the data science community here in New York seems to be big and active, so opportunities to connect are plentiful.

Read More »

Data, Data, Everywhere

As I’ve mentioned before, the Internet is a huge (and ever huger!) repository of data. Much of that is in the form of unstructured text —– for which natural language processing comes in handy —– but an impressive variety of structured datasets can be found and downloaded, too, if you know where to look. Here are some of my favorite sources…

Read More »