During this year’s STRATA conference, President Obama introduced Dr. DJ Patil as his new Chief Data Scientist in a video message. DJ is a very well known data scientist and is even credited by some with coining the term “data science”. During his introduction of DJ, Obama said that he wanted to do a joke about Data Science but noted “half of the stuff my staff came up with was below average.”
Let’s decode this sentence into “stats speak” for a moment. What Obama meant was the median of the quality of those jokes was less than their mean. Thus the quality of the suggested jokes were skewed towards the end of bad quality, therefore he decided to drop the joke. That could be wrong, however, because all that was needed was one Joke. Thus if even all but one joke were terrible, it is the one joke that he could have used to start off his intro of DJ.
This omission is precisely why we need data scientists like DJ. We do not need all of the big data, we need the right data – and sometimes it is even only ONE dataset that we need. Even when most of the data within our Big Data cloud is ‘bad’ (aka. useless) we might be able to pull off a great prediction if we get the right dataset.
The same point is valid for models. We do not need many models (with perhaps the exception of a method called “random forest”), but rather we need only one model that is sufficient in its balance between accuracy and speed.
DJ, your knowledge and insights are needed. Let’s look for the right data within those 135 000 datasets that were made available to the public. We are looking forward to great changes based on data…