Data Demystified

“Data Demystified” is the best way to describe this year’s Strata Conference. The data conference STRATA has grown substantially: more visitors, more talks, more vendors at the exhibitions, and more space.

But most importantly the topics displayed have matured. Data is not only becoming easier, the science of data itself has become demystified. It is no longer the task of a few highly specialized data scientists or engineers: today, data is available for everyone. Tweet now! Let’s look at some trends that go beyond the usual infrastructure sumo wrestling.

TREND 1 – Easier Data

Every infrastructure vendor – from storage to database vendors – will tell you that you are going to have more data. That is no surprise. However, this year one can see the trend ofeasy data. Getting data has become much simpler.

One way of getting data is scraping. But scraping used to be only for those who know how to code. With the industry becoming more mature, this is no longer the case. Take Andrew Fogg’s (@andrewfogg) startup import.io as an example. Last year this was just an idea at the Strata startup showcase. This year they presented a very good and stable solution at their own booth. Import.io makes scraping as easy as point-and-click. I actually used it to scrape Autotrader.com and help me find the best priced car as I moved over to the US. The important part is that this kind of scraping does not require code: everyone can do it. Everyone can become data driven. We do not have to worry about technology, but rather simply think about the right business question.

Another company in this space is enigma.io. They were my personal favorite of this year’s startup showcase. Their vision is also to make data more accessible. It is a data platform for government data. Open data – which used to be a big topic at Strata years ago – exists now as governments have opened their data stores. However, this data is often not easy to use, as it is not in the right format. Nothing a data scientist or analyst could not fix, but the trend nowadays is to make data retrieval easy. Enigma sources the data and makes them available to everyone. Again, the underlying idea is to focus on the business question rather than the technology.

TREND 2 – Easier Clean Up

What Enigma is doing for just public data, is Joe Hellerstein (@joe_hellerstein) Co-Founder ofTrifacta is doing for the world. Today the biggest part of data science is to ‘wrangle‘ data: in other words, to clean and re-structure data. Whether you use want to transform all European date formats into US formats or fill in missing values in a table, you normally need to code or script. The trend at this year’s STRATA is to talk about how to make this task easier. Trifacta’s approach is that you can easily point and click to do all kind of ‘grep’-like regular expressions. The aim here is that dealing with data becomes easier so that you do not need to worry about technology anymore.

There is, however, a second reason why I am mentioning Trifacta. The computer industry has long since anticipated that we will be able to “write code” by pointing and clicking, ready to be deployed, without actually writing it. Trifacta has really taken this idea to a new level. In the same way that programming languages represented the end of the assembly coder, companies like Trifacta might serve as the end to the data wrangler.

TREND 3 – Charting, Charting, Charting

You see charts everywhere you go at Strata. Charts are often the first and easiest win. Not only might they be helpful for you to explore the data, they also often astonish your audience. Thus it is no surprise that a lot of companies try to automate this process, starting with Tableau, the top dog over newcomers like DataHero or Chartio. Their promise: with a lot of connectors we can connect to everything from spreadsheets to Hadoop and then all the data is charted. Thus there is no need to think visual anymore – just load the data and see colorful charts. And then did you ever hear from your audience statements like:

That is so cool… if I could just understand what it means?

Of course, these tools will not solve a single business question for you. But they do give you a nice way of representing data, so that you have all hands free to think about the real measurement you want. What is the right way to display this data so that it fits your business?

TREND 4 – Easier Predictions

This last trend is about pure data science. How many people in your organization can run anaïve bays? Or SVM? Or nearest k neighbors? Not many… and this is true not only for you but for many organizations (except you are a company like LinkedIn and yes, we want to hire even more such people!) Tweet to us if you want to join

But with the data industry maturing, data science is also maturing. Easy plug-and-play solutions have become more and more available. Take as an example companies likewise.io, co-founded by Joshua Bloom (@profjsb), or Skytree, founded by Martin Hack(@mhackster). They offer tools to simplify predictive algorithms. It is like the WEKA package on steroids. Just upload your data and then score it, rank it, rate it… all automated. Worry free. Again – the underlying trend is to free us from technology so that the business focus can become important again.

The point to demystified data science was best made by John Foreman (@john4man), author of the book Data Smart. In order to demystify funky artificial intelligence packages, his tutorial trained the audience how to use Excel to build machine learning programs. Really? Excel? I thought John was out of his mind. But he is not. It works beautifully. According to John:

Artificial intelligence is just counting stuff…. Excel can do this.

And thus after 45 min of Excel operations I had a naïve bays model that classified 19 out of 20 tweets correctly. Very much fun! And surprisingly enough, John was not the only one talking about Excel as a tool for data science. Felienne Hermans (@felienne)from the Delft University of Technology introduced several plug-ins for Excel (see her blog) to make it a better tool for the data world. This also shows how much the data world has matured: we start to offer tools that every business consultant knows, so that they can think with us about the right application of data.

The Future: Data Demystified

What is next? We get easier data, and it will be easier to clean, and easier to chart, and using this data we can more easily predict…. predict… predict. Yes, what is it that we actually wanted to predict?

Data demystified does not mean that we have solved our problems in this world. It just means we have better technology that enables us to focus on what is important. We work with data because we want to change a behavior or get an action. No one said this better than British science historian James Burke during the main Strata plenary session:

Information is causing change… if it is not causing change, it is no information.

He then looked in the audience and said,

No information: you are sitting in a seat.

Information: the person next to you has a miasmal disease.”

Yes… we tend to forget this. We are here in the data world for a reason. We want to change the world with data. Technology is making this job easier and the tools became better, but what counts are the results. And those results can also be found at Strata.

For example, Chris Harland from Microsoft explained how he uses data to improve the business of bars. He measures the behavior of guests at bars, and one of his stunning facts was that Corona beer is a good predictor for more spending. But dear bar owner, please do not force your customers to drink Corona… that would be mixing cause and correlation.

Corona is a good predictor for spending behavior. Do not mix cause and correlation.

Another fascinating (meaning action-oriented) talk was given by Drew Sullivan, on the Organized Crime and Corruption Reporting Project. He used data on money movement to show how to detect fraudulent activities in Montenegro.

Now since data is demystified, let’s apply it: turn our businesses around and become data-driven. This means that data scientists should learn more about business, and businesses should become more like data scientists. Perhaps a new role might be that of a business scientist?

But do not be fooled. To get the question right, is the hardest part. I describe this issue in depth in my book Ask-Measure-Learn (O’Reilly). Take Monica Rogati‘s (@mrogati) presentation as example. She is the famous data scientist, who showed in a well-received talk that woman sleep 20 minutes longer on average. Ok, but how does that help me? This insight is amusing at best. The real question is what to do with this kind of information. Knowing Monica, she has already a new product based on data in mind. Let’s see what she will say at the next Strata. I will be there.

Leave a Reply

Your email address will not be published. Required fields are marked *