One of the biggest tasks facing data scientists — and one that distinguishes them from traditional business analysts — is fetching and cleaning data. Once upon a time (or so I’m told), data was kept in orderly, consolidated databases, there wasn’t so much that size was an issue, and analysts could access it in a relatively straightforward manner through, say, a structured query language (SQL). Then the Internet happened, and data got Big and messy, and the process of getting it for analysis turned into a chore. Unfortunately, point-and-click doesn’t scale.
Let’s consider a simple case: You want to know tomorrow’s local weather forecast, but you don’t feel like going to the website, typing in the search bar, and dealing with all the advertisements. So, you write a little program that sends an HTTP request to weather.com’s server, which responds with (among other things) the HTML content you asked for, then you parse that HTML to find the string of characters embedded in a deep hierarchy of tags corresponding to the temperature: 48.
This example is kind of ridiculous, I know –— just bookmark the damn site! –— but vastly more complicated web scraping tasks can be built up from this basic procedure: request HTML, parse HTML, extract data (repeat). Maybe you’d like to compile a list of abilities of all superheroes on Wikipedia, or get U.S. election results by district from this guy without paying lots of money for his already-structured and -cleaned Excel spreadsheets, or get the Metacritic scores of all horror films in the past ten years. Sure, given enough time and patience, you could probably do this manually, but it’s much much easier to automate through code.
Although you can do limited web scraping tasks directly from a command line (with the
awk commands, among others), it’s nicer to work in a full-fledged scripting language. Python, for example, is great for web scraping and HTML parsing! Happily, people have written libraries and even entire frameworks specifically for these purposes:
- Scrapy: Free web scraping and crawling framework. You pick a website, specify the kind of data you want to extract and the rules to follow in finding/extracting it, then let Scrapy do its thing. It has built-in support for reading and cleaning the scraped data, and much more.
- Requests: Free and user-friendly HTTP library. Easily add options to your web query, read and properly encode the web server’s response, deal with authentication, etc.
- BeautifulSoup: Free HTML parsing library. Provides methods for navigating, searching, and modifying the parse tree, which saves you a lot of time.
- Mechanize: Free library to emulate a web browser in your script. This includes basic functionality like downloads, cookies, form-filling, and history.
This isn’t an exhaustive list, by any means. I’ve mostly used Requests and BeautifulSoup so far, though I’d like to add Mechanize to the mix. I tried Scrapy but didn’t care for it; maybe I needed to give it more time, since it seems to be the most full-featured (and complicated) of the bunch. I should also add that nice web browsers let you inspect the HTML of a web page in the browser itself. This can be very handy when you’re trying to figure out what, exactly, to automate in your code. The weather example above shows an example of the developer tools bundled in Chrome; if you prefer Firefox, get the Firebug add-on; if you’re using Internet Explorer, there’s probably no hope for you. Just… stop.