First, the big news: I got a job! I’m now a data scientist at a non-profit organization here in Manhattan called Harmony Institute, where we study the science of influence through entertainment. Basically, simple metrics like box office sales and television viewers don’t adequately quantify a film or show’s social impact; we use theory-driven methodology to more fully assess this impact in individuals and across networks. In the case of, say, a social justice documentary that was made specifically to have such an impact, we are able to quantify the film’s level of success. More on this over time, I’m sure.
Second, more big news: “Decay,” the feature-length, physics-themed zombie movie I co-produced, filmed, and edited while also earning my physics PhD at CERN has been released online! You can watch or download it for free! We’ve had over 200,000 views and downloads in the first week alone, and a fair amount of international press. I’ve been tracking our “buzz” for the past couple months and will do some fancy analytics after we’re off the post-release peak, so stay tuned. And check out the movie, people seem to have really enjoyed it. :)
Now, back to data science! I wanted to properly introduce Natural Language Processing (NLP for short) after briefly mentioning it in my previous post on web scraping. As you probably know from personal experience, much of the information available online comes in the form of “natural language” like English or Spanish (as opposed to “structured language” like Python or mathematics). Broadly speaking, NLP is computer manipulation of natural language: from word counts to AutoCorrect, machine translation to sentiment analysis, part-of-speech tagging to speech recognition. NLP is a huge and increasingly vital field, allowing for more intuitive human-computer interaction and, of particular importance to data scientists, more effective extraction of structured information from unstructured text.
Here I’ll focus on that last bit, the very useful task of information extraction. Essentially, we want to identify information expressed in a natural language document and convert it into a structured, machine-friendly representation for further analysis. In practice, however, it’s much easier to focus on asking specific questions, i.e. looking for specific “entity relations” in the text. Let’s say we want to know who created the X-men comics given the first paragraph of their Wikipedia page:
The X-Men are a superhero team in the Marvel Comics Universe. They were created by writer Stan Lee and artist Jack Kirby, and first appeared in The X-Men #1 (September 1963). The basic concept of the X-Men is that under a cloud of increasing anti-mutant sentiment, Professor Xavier created a haven at his Westchester mansion to train young mutants to use their powers for the benefit of humanity, and to prove mutants can be heroes. Xavier recruited Cyclops, Iceman, Angel, Beast, and Marvel Girl, calling them “X-Men” because they possess special powers due to their possession of the “X-gene,” a gene which normal humans lack and which gives mutants their abilities.
A person can read this and readily answer the question — Stan Lee and Jack Kirby –— but the complexity of natural language makes it difficult for a machine to do the same. It helps to split the task into an ordered pipeline of sub-tasks, starting from the raw text of a document and ending with a list of relations:
- Sentence Segmentation: Before manipulating text at the level of individual words, it is often necessary to split or “segment” the text into sentences. This isn’t trivial, since periods are used in acronyms (U.S.A., Mr.) as well sentence endings –— sometimes simultaneously —– and other sentence-ending punctuation (?!) may be used in different, non-standard ways. (Note that this is for English; other languages do things differently!) Using the raw text as input, this step outputs a list of its constituent sentences, e.g. [‘The X-men are a superhero team in the Marvel Comics Universe.’, ‘They were created by writer Stan Lee and artist Jack Kirby, and first appeared in The X-Men #1 (September 1963).’, …]
- Word Tokenization: Before trying to understand the meanings of words, we first have to identify the words themselves. In English, words are generally delimited by white space, though that fails in the common case of contractions (“it’s” = “it” and “is”; “won’t” = “will” and “not”). In other languages such as Chinese or Thai, where words are not delimited, word tokenization is much harder. In this step, our input is a list of sentences and our output is a nested list of sentences in which each sentence is represented by a list of its constituent words, e.g. [[‘The’, ‘X-men’, ‘are’, ‘a’, ‘superhero’, ‘team’, ‘in’, ‘the’, ‘Marvel’, ‘Comics’, ‘Universe,’ ‘.’], …]
- Part-of-speech Tagging: The process of classifying words by their parts-of-speech and labeling them accordingly is known as part-of-speech tagging, or POST. This is another necessary precursor to understanding the relationships between words in a sentence, given that the same words may represent different parts of speech in different contexts. For example: “Gas prices are up [adverb].” vs “He climbed up [preposition] the ladder.” vs “They’ve had some ups [noun] and downs.” From a nested list of sentences of words, this step outputs a nested list in which each word is stored as a pair, with one value for the word itself and another for its part of speech, e.g. [[(‘The’, ‘DT’), (‘X-Men’, ‘JJ’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘superhero’, ‘NN’), (‘team’, ‘NN’), (‘in’, ‘IN’), (‘the’, ‘DT’), (‘Marvel’, ‘NNP’), (‘Comics’, ‘NNP’), ‘Universe’, ‘NNP’), (‘.’, ‘.’)], …]
- Entity Recognition: Higher-level conceptual entities are recognized as such through a process called chunking, a common precursor to relation recognition (our end goal) in which linked sets of words are grouped or “chunked” together. You can chunk noun phrases, or verb phrases, or prepositional phrases based on words’ ordering and parts-of-speech in a sentence; often, you want to look for named entities (“NE”) that correspond to people, places, organizations, etc. The output of this step is a list of hierarchical trees, e.g. [Tree(‘S’, [(‘The’, ‘DT’), (‘X-Men’, ‘JJ’), (‘are’, ‘VBP’), (‘a’, ‘DT’), (‘superhero’, ‘NN’), (‘team’, ‘NN’), (‘in’, ‘IN’), (‘the’, ‘DT’), Tree(‘ORGANIZATION’, [(‘Marvel’, ‘NNP’), (‘Comics’, ‘NNP’), (‘Universe’, ‘NNP’)]), (‘.’, ‘.’)])], …] Note: The chunker correctly recognized “Marvel Comics Universe” but missed “The X-Men,” partly because the POST incorrectly classified “X-Men” as an adjective (“JJ”). NLP is a statistical process, and errors happen!
- Relation Recognition: Finally, we can try to identify the relations that exist between entities. This may be possible through a regular expression parser that looks for a particular order of consecutive parts of speech, or something fancier using the hierarchical relationships between named entities. In the case of the former, searching for one or more verbs followed by a preposition followed by one or more nouns (”<VB.*>+<IN><NN.*>+”) would find information of the form “created by writer Stan Lee.” As it turns out, that phrase is the only one in our example paragraph that fits the pattern, and it is indeed a partial answer to our question. Note, however, that it misses the full answer: “created by writer Stan Lee and artist Jack Kirby.” This can be accounted for in the parser’s pattern, sure, but it’s much tougher knowing ahead of time what to search for and whether you’ve correctly found it.
This has been a little preview of the power of natural language processing. Throughout, I’ve used a comprehensive Python package called the Natural Language ToolKit (NLTK), available for free download here; happily, a companion textbook is also available for free, and together they provide an accessible introduction to NLP that will take you pretty far. If you want to learn more, you might also want to sign up for an upcoming Coursera class on the topic, though its date is still TBD.
I’ve been using NLP and NLTK quite a bit in my new job —– not to mention web scraping, among other things —– so I’m sure this topic will keep coming up. Stay tuned.