I’ve not posted in almost six months, but I was, like, totally busy. Here’s what I’ve been up to:

Way back in February, I participated in a hackathon with a few data friends from CSV Soundsystem; we made a Federal Management Service symphony, and it won Best in Show. Rather than let the project die at the end of the hackathon, we applied for —– and received! —– a Code Sprint grant from the Knight Foundation to build it out. I performed an epic, damn near endless feat of data munging, and the other guys did everything else. The end result was treasury.io (and its companion tweetbot, @TreasuryIO). It provides the first-ever electronically-searchable database of the federal government’s daily revenues, spending, and borrowing. It lets you do lots of cool things, like plot public debt against the debt ceiling over time:


I’ve also been working hard at Harmony Institute on (among other things) a massive interactive web app that maps the landscape of films around social issues, positioning them along the issues’ conversational zeitgeist, and allowing for deep examination and comparison of films’ social impacts. It’s called ImpactSpace… until we decide on a name that wasn’t recently claimed by someone else –— damn! I’ve done a great deal of data mining from dozens of sources via web crawls, web scrapes, API access, and structured data dumps; performed still more epic feats of data munging; dived into cutting-edge NLP research and come out with fancy algorithms that I then implemented in Python; and even gotten my feet wet in social and semantic network analysis. Much work remains, but we’re making good progress! :)


I’ve tried to keep up with developments in data science… Some seriously cool code, projects, and papers have come out in the past few months. In case you missed them:

  • Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach: In a nutshell, the words we use to express ourselves on social media are strongly indicative of our personality, age, and gender. Or, as Gawker put it, “science shows men and women are both awful stereotypes on Facebook.”
  • prettyplotlib: A Python package built on top of the de-facto plotting standard matplotlib that produces pretty plots by default, saving yourself a great deal of trouble. Inspired by Tufte!
  • TextBlob: A Python package that simplifies and improves a number of basic natural language processing tasks like part-of-speech tagging and noun phrase extraction. It builds upon the already-impressive NLTK and pattern packages.
  • bibviz: An interactive resource for exploring some of the more negative aspects of holy books, such as Bible contradictions, biblical inerrancy, and the Bible as a source of morality. Fun and fascinating.
  • Paperscape: An interactive tool to visualize the arXiv, an open, online repository for scientific research papers, as a network of papers connected by citations.
  • NLP with Deep Learning: Google went ahead and applied deep learning techniques to language analysis with pretty spectacular results —– and they open-sourced it! Python ports appeared quickly.

Oh man, there’s so much more… but you’ll have to search through my Twitter feed. :)

Where else has the time gone? Well, I went to a handful of weddings, moved into an apartment in Chelsea, spent ten days in Scandinavia with my boyfriend, got 241 out of 242 power stars in Super Mario Galaxy 2, and resumed regular gym-going.


Finally, my on-again, off-again data side-project, the creation and analysis of a Thomas L. Friedman corpus, will be the subject of my next few blog posts. And no, it won’t be years until my next entry –— I’m no George R.R. Martin.