As I’ve mentioned before, the Internet is a huge (and ever huger!) repository of data. Much of that is in the form of unstructured text —– for which natural language processing comes in handy —– but an impressive variety of structured datasets can be found and downloaded, too, if you know where to look. Here are some of my favorite sources…

  • Free access to thousands of datasets maintained by the U.S. Federal government, from White House visitor logs to cantaloupe statistics (seriously), with pretty good search functionality. Dozens of other national governments offer something similar, e.g., as does the United Nations at UNdata.
  • The U.S. Census Bureau offers free access to a wealth of demographics data, from the broad decennial census to detailed American Community Surveys. Check out American FactFinder for fancy filtering that enables you to inspect very specific slices of the population.
  • Infochimps: In addition to providing a platform for big data analysis, Infochimps maintains a “data marketplace” where thousands of free and paid datasets can be downloaded or accessed via API, from a Twitter census to a corpus of several thousand erotica stories (for NLP training, of course…).
  • Datamob: “Public data put to good use.” Aggregates hundreds of data sources (plus apps and general resources) with tags and descriptions, covering sports, government, media, science, etc. Check out the tags list.
  • The New York Times: Provides tens of APIs to access the paper’s extensive article archives, Congressional records, NYC real estate sales data, etc. They even provide a handy API Tool to test out your queries.
  • The Guardian: Besides having a great data blog, they also make all of the data available to the public! Here’s a full list of their datasets covering a wide range of topics.
  • Sunlight Labs: Data arm of Sunlight Foundation, dedicated to government accountability and transparency. Provides several APIs to access data on state legislatures, campaign contributions, and the words actually spoken on the record in Congress. As it turns out, they’re hosting a hackathon in a couple weeks, and I’ll be there! :)
  • Reddit: Hosts a datasets archive filled with the sort of thing you might expect from the front page/gutter of the internet: 10,000 images of cats (for training a classifier…?), people with questions and trolls with answers, and occasional pointers to something interesting.

Obviously, this is nowhere near a comprehensive list, and many others have made longer/better lists of their own, e.g. famous data scientists Peter Skomoroch and Hilary Mason. If you find yourself looking for but unable to find a particular dataset, Google is your mostly-not-evil friend.

In my list I mentioned a few APIs —– Application Programming Interfaces –— which, I should note, are distinct from structured datasets ready for download. Web APIs provide a relatively consistent and stable channel to access a website’s data and return it in a standardized format like XML or JSON. If done well, APIs can be a great help, although they do have some drawbacks: registration to get an official access key, limits on the rate at which you can access the data, and occasional clamp-downs. Still, it’s nice to have options; to get you going, here’s a massive API directory and a brand-new Codecademy learning track specifically on APIs in Javascript, Python, and Ruby. And when all else fails, you can always fall back on web scraping. In fact, some prefer it that way…

The data’s out there –— happy fetching! :)