Intro to Natural Language Processing (1)

First, the big news: I got a job! I’m now a data scientist at a non-profit organization here in Manhattan called Harmony Institute, where we study the science of influence through entertainment. Basically, simple metrics like box office sales and television viewers don’t adequately quantify a film or show’s social impact; we use theory-driven methodology to more fully assess this impact in individuals and across networks. In the case of, say, a social justice documentary that was made specifically to have such an impact, we are able to quantify the film’s level of success. More on this over time, I’m sure.

Read More »

Web Scraping and HTML Parsing (2)

As I wrote last time, the Internet is chock-full of data, but much of it is “messy” and unstructured and spread throughout an HTML tree — in other words, not ready for analysis. Fortunately, web scraping and HTML parsing allow for the automated extraction of online data and its conversion into a more analysis-friendly form; unfortunately, it can be an awful lot of work. In fact, data scientists often spend more of their time getting and cleaning data than analyzing it!

Read More »

Web Scraping and HTML Parsing (1)

Hi! I haven’t posted in a while: I got displaced by Sandy, distracted by job applications, and overrun by zombies. It happens. But back to business…

Read More »

Classification of Hand-written Digits (4)

In my previous posts (Part 1 | Part 2 | Part 3), I described the k-nearest neighbors algorithm, applied a benchmark model to the classification of hand-written digits, then chose an optimal value for k as the one that minimized the model’s prediction error on a dedicated validation data set. I also excluded about 2/3 of the features (image pixels) from the model because they had near-zero variance, thereby improving both performance and runtime. Now, I’d like to add one last complication to the kNN model: weighting.

Read More »

Classification of Hand-written Digits (3)

Now for the fun part! In Part 1, I described the machine learning task of classification and some well-known examples, such as predicting the values of hand-written digits from scanned images. In Part 2, I outlined a general analysis strategy and visualized the training set of hand-written digits, gleaning at least one useful insight from that. Now, in Part 3, I pick a learning algorithm, train and optimize a model, and make predictions about new data!

Read More »

A Shout-out for Liberal Arts

This past weekend I was at Kalamazoo College for my five-year college reunion. I mentally prepared myself to feel really old but came out of it feeling young and refreshed instead. Funny how that works out.

Read More »

Classification of Hand-written Digits (2)

In Classification of Hand-written Digits (1), I qualitatively described the machine learning task of classification and sketched out two classic examples, then went into more detail about another well-known example: the classification of hand-written digits. The challenge here is to program a classifier that correctly predicts the value represented in a scanned image of a hand-written digit.

Read More »

Classification of Hand-written Digits (1)

In the last few posts, I’ve attempted to lay a basic foundation explaining what data science is generally about; in the next few posts, I’d like to delve deeper into a specific example.

Read More »

Version Control Is Important!

I recently came across a post on Kaggle’s no free hunch blog, “Engineering Practices in Data Science”, in which Chris Clark describes a set of best practices for those who work in the medium of code — specifically, those practices common among software engineers but not among data scientists. I was much chagrined by the first wag of his finger: many data scientists don’t use version control (a logical way to manage multiple versions of the same information, e.g. source code), preferring instead to save files with elaborate names and/or back them up dropbox. Ugh, I realized, he’s talking about me.

Read More »

Data as Force for Positive Change

I’ve been reading a lot about how big data is or is going to revolutionize the world we live in. Yes, some of this is hype; yes, more data means more potential for statistical shenanigans and bad analysis; and yes, some changes may not necessarily be for the better (think: violation of privacy and other unethical abuses of personal data). But it’s important to recognize that data can be a powerful force for positive change in the world.

Read More »