Data Demystified

Data Demystified

[vc_row][vc_column width=”1/1″][vc_column_text]”Data Demystified” is the best way to describe this year’s Strata Conference. The Strata data conference has grown substantially this year: more visitors, more talks, more vendors at the exhibition, and more space.

But most importantly, the topics previously displayed have matured. Data is not only becoming easier, the science of data itself has become demystified. It is no longer the task of a few highly specialized data scientists or engineers: today, data is available for everyone. Tweet now! Let’s look at some trends that go beyond the usual infrastructure: sumo wrestling.

06b59b2

TREND 1 – Easier Data

Every infrastructure vendor – from storage vendors to database vendors – will tell you that you are going to get more data from buying their product. That is no surprise. However, this year one can see the trend of easy data. Getting data has become much simpler.

One way of getting data is scraping. Scraping used to be reserved for those who knew how to code. With the industry becoming more mature, this is no longer the case. Take Andrew Fogg’s (@andrewfogg) startup import.io as an example. Last year, this was just an idea at the Strata startup showcase. This year, they presented a very good and stable solution at their own booth. Import.io makes scraping as easy as point-and-click. I actually used it to scrape Autotrader.com and help me find the best priced car when I moved to the U.S. The important part is that this kind of scraping does not require code: everyone can do it. Everyone can become data-driven. We do not have to worry about technology, but rather simply formulate the right business question.

Another company in this space is enigma.io. They were my personal favorite in this year’s startup showcase. Their vision is also to make data more accessible. It is a data platform for government data. Open data – which was a big topic at Strata years ago – exists now, as governments have opened their data stores. However, this data is not often easy to use, as it is not in the right format. It is nothing a data scientist or analyst could not fix, but the trend nowadays is to make data retrieval easy. Enigma sources the data and makes it available to everyone. Again, the underlying idea is to focus on the business question, rather than the technology.

TREND 2 – Easier Clean Up

What Enigma is doing for public data is what Joe Hellerstein (@joe_hellerstein), co-founder of Trifacta, is doing for the world. Today, the biggest part of data science is ‘wrangling‘ data: in other words, cleaning and re-structuring data. Whether you want to transform all European date formats into U.S. formats or fill in the missing values in a table, you will generally need to code or script to make this happen. The trend at this year’s STRATA is to talk about how to make this task easier. Trifacta’s approach is that you can easily point and click to do all kind of ‘grep’-like regular expressions. The goal of this is to make dealing with data easier so that you do not have to worry about technology anymore.

There is, however, a second reason why I am mentioning Trifacta. The computer industry has long since anticipated that we will be able to “write code” by pointing and clicking, making it ready for deployment without actually writing it. Trifacta has really taken this idea to a new level. In the same way that programming languages represented the end of the assembly coder, companies like Trifacta might serve as the end to the data wrangler.

TREND 3 – Charting, Charting, Charting

You see charts everywhere you go at Strata. Charts are often the first and easiest way to win. Not only can they be helpful to you in exploring the data, they often astonish your audience. Thus, it is no surprise that a lot of companies try to automate this process, starting with Tableau, the top dog, followed by newcomers like DataHero or Chartio. Their promise: with a lot of connectors, we can connect to everything – from spreadsheets to Hadoop – and then see all of the data charted. Thus, there is no need to think visually anymore – just load the data and look at the colorful charts. And then, have you ever heard your audience make statements such as this?

That is so cool… If only I could understand what it means!?

Of course, these tools will not solve a single business question for you. But they do give you a nice way of representing data, so that you have all hands free to think about the real measurement you want. What is the right way to display this data so that it fits your business?

TREND 4 – Easier Predictions

This last trend relates to pure data science. How many people in your organization can run naïve bays? Or SVM? Or nearest k neighbors? Not many… And this is true not only for you, but for many organizations (unless you are a company like LinkedIn – and yes, we are looking to hire even more of the aforementioned people!) Tweet to us if you want to join

With the data industry maturing, data science is also maturing. Easy plug-and-play solutions have become more and more available. For example, take a look at companies like wise.io, co-founded by Joshua Bloom (@profjsb), or Skytree, founded by Martin Hack (@mhackster). They offer tools to simplify predictive algorithms. It is like the WEKA package on steroids. Just upload your data and then score it, rank it, rate it… All automated. Worry free. Again – the underlying trend is to free us from technology so that the most important focus is put on the business.

A point on demystifying data science was best made by John Foreman (@john4man), author of the book Data Smart. In order to demystify funky artificial intelligence packages, his tutorial trained the audience on how to use Excel to build machine learning programs. Really? Excel? I thought John was out of his mind. But he is not. It works beautifully. According to John:

Artificial intelligence is just counting stuff…. Excel can do this.

And thus, after 45 minutes of Excel operations I had a naïve bays model that classified 19 out of 20 tweets correctly. It was a lot of fun, and surprisingly enough, John was not the only one talking about Excel as a tool for data science. Felienne Hermans (@felienne) from the Delft University of Technology introduced several plug-ins for Excel (see her blog) to make it a better tool for the data world. This also shows how much the data world has matured: we have started to offer tools that every business consultant knows, so that they can think with us about the right application of data.

The Future: Data Demystified

What is next? It will be easier to get data, and it will be easier to clean, and easier to chart, and by using this data, we can more easily predict…. predict… predict. Yes, what is it that we actually wanted to predict?

Data demystified does not mean that we have solved our problems in this world. It just means we have better technology that enables us to focus on what is important. We work with data because we want to change a behavior or for it to result in an action. No one said this better than British science historian James Burke during the main Strata plenary session:

Information is causing change… If it is not causing change, it is not information.

He then looked in the audience and said,

No information: you are sitting in a seat.

Information: the person next to you has a miasmal disease.”

Yes… We tend to forget this. We are here in the data world for a reason. We want to change the world with data. Technology is making this job easier and the tools became better, but what counts are the results. And those results can also be found at Strata.

For example, Chris Harland from Microsoft explained how he uses data to improve business at bars. He measures the behavior of guests at bars, and one of his stunning facts was that Corona beer is a good predictor for more spending. But dear bar owners, please do not force your customers to drink Corona… as that would be mixing cause and correlation.

Corona is a good predictor for spending behavior. Do not mix up cause and correlation.

Another fascinating (meaning action-oriented) talk was given by Drew Sullivan, on the Organized Crime and Corruption Reporting Project. He used data on money movement to show how to detect fraudulent activities in Montenegro.

Now since data is demystified, let’s apply it: turn our businesses around and become data-driven. This means that data scientists should learn more about business, and businesses should become more like data scientists. Perhaps a new role might be that of a business scientist?

17ff056

But do not be fooled. Getting the question right is the hardest part. I describe this issue in depth in my book Ask-Measure-Learn (O’Reilly). Take Monica Rogati‘s (@mrogati) presentation as an example. She is the famous data scientist who demonstrated in a well-received talk that woman sleep 20 minutes longer than men, on average. Ok, but how does that help me? This insight is amusing at best. The real question is what to do with this kind of information. Knowing Monica, she has already a new product based on data in mind. Let’s see what she will have to say about it at the next Strata. I know I will be there.[/vc_column_text][/vc_column][/vc_row]

Leave a Reply

Your email address will not be published. Required fields are marked *