If I’m to write about becoming a data scientist, I should first define what I mean by data science. A simple Google search yields over one billion results… so I’ll do my best to summarize. (This is easier said than done, of course, since the concept has been around and evolved considerably since the 1970s, and a generally-accepted definition does not appear to exist.)

Data science is a relatively new field that lies at the intersection of math and statistics, computing and hacking, machine learning and data mining. As such, its practitioners (data scientists) are inherently interdisciplinary, problem-solving generalists who, according to Mike Loukides in his seminal article “What is data science?”, “can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: ‘here’s a lot of data, what can you make from it?’” In an interview with Peter Skomoroch of LinkedIn, the “Voltron of data science” is characterized by a technical ability to code, mathematical know-how to build algorithms, and overall business intelligence. DJ Patil describes data scientists in “Building data science teams” as “those who use both data and science to create something new.” He goes on to stress the importance of curiosity and cleverness as personality traits of successful data scientists.

A visual attempt at definition comes in the form of The Data Science Venn Diagram by Drew Conway:


The emphasis here is, again, on the interdisciplinary nature of data science, which lies at the intersection of three general domains of knowledge and experience. His inclusion of “substantive expertise” points to what makes data science (and data scientists) new and distinct from, say, business intelligence analysts: It’s not just about the existence of the data and the ability to quantitatively analyze it; data science is about testing hypotheses, and deriving new knowledge from the data, then making sure that the conclusions are valid. It’s about discovery.

Another way to define data science is to describe the sort of work that’s actually performed by data scientists. In A Taxonomy of Data Science, Hilary Mason and Chris Wiggins list what a data scientist actually does, in approximate chronological order: obtain (finding and getting sufficient amounts of data from a variety of sources); scrub (cleaning up messy and/or incomplete data to make analysis possible); explore (looking at the data by reading numbers, basic plotting, and unsupervised clustering techniques); model (producing the most predictive model of the data possible, quantifying the accuracy of its predictions); and interpret (gleaning generalized insight from the model to produce data products and suggest directions for further inquiry). Not surprisingly, data scientists perform a wide array of specific tasks. According to Jeff Hammerbacher, who established the data science group at Facebook:

”… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization.”

Amazing. Mike Driscoll details the three sexy skills of data geeks — statistics, data munging, and visualization — the last of which is a critical component of data science that I’ve not yet mentioned. Presenting the data such that its underlying structure is clear and visible facilitates a better understanding of the dataset itself, not to mention communication of your conclusions with others! On that note, here’s a final, visual take on tasks now associated with data science, taken from a 2004 dissertation on computational information design by Ben Fry:


So, as far as I can tell, that’s data science in a nutshell. But, given that it’s a new and varied field, I’ve probably missed some important points; if anybody out there has something to add, please do so in the comments!