Category Archives: Uncategorized

Data Demystified

“Data Demystified” is the best way to describe this year’s Strata Conference. The data conference STRATA has grown substantially: more visitors, more talks, more vendors at the exhibitions, and more space.

But most importantly the topics displayed have matured. Data is not only becoming easier, the science of data itself has become demystified. It is no longer the task of a few highly specialized data scientists or engineers: today, data is available for everyone. Tweet now! Let’s look at some trends that go beyond the usual infrastructure sumo wrestling.

TREND 1 – Easier Data

Every infrastructure vendor – from storage to database vendors – will tell you that you are going to have more data. That is no surprise. However, this year one can see the trend ofeasy data. Getting data has become much simpler.

One way of getting data is scraping. But scraping used to be only for those who know how to code. With the industry becoming more mature, this is no longer the case. Take Andrew Fogg’s (@andrewfogg) startup import.io as an example. Last year this was just an idea at the Strata startup showcase. This year they presented a very good and stable solution at their own booth. Import.io makes scraping as easy as point-and-click. I actually used it to scrape Autotrader.com and help me find the best priced car as I moved over to the US. The important part is that this kind of scraping does not require code: everyone can do it. Everyone can become data driven. We do not have to worry about technology, but rather simply think about the right business question.

Another company in this space is enigma.io. They were my personal favorite of this year’s startup showcase. Their vision is also to make data more accessible. It is a data platform for government data. Open data – which used to be a big topic at Strata years ago – exists now as governments have opened their data stores. However, this data is often not easy to use, as it is not in the right format. Nothing a data scientist or analyst could not fix, but the trend nowadays is to make data retrieval easy. Enigma sources the data and makes them available to everyone. Again, the underlying idea is to focus on the business question rather than the technology.

TREND 2 – Easier Clean Up

What Enigma is doing for just public data, is Joe Hellerstein (@joe_hellerstein) Co-Founder ofTrifacta is doing for the world. Today the biggest part of data science is to ‘wrangle‘ data: in other words, to clean and re-structure data. Whether you use want to transform all European date formats into US formats or fill in missing values in a table, you normally need to code or script. The trend at this year’s STRATA is to talk about how to make this task easier. Trifacta’s approach is that you can easily point and click to do all kind of ‘grep’-like regular expressions. The aim here is that dealing with data becomes easier so that you do not need to worry about technology anymore.

There is, however, a second reason why I am mentioning Trifacta. The computer industry has long since anticipated that we will be able to “write code” by pointing and clicking, ready to be deployed, without actually writing it. Trifacta has really taken this idea to a new level. In the same way that programming languages represented the end of the assembly coder, companies like Trifacta might serve as the end to the data wrangler.

TREND 3 – Charting, Charting, Charting

You see charts everywhere you go at Strata. Charts are often the first and easiest win. Not only might they be helpful for you to explore the data, they also often astonish your audience. Thus it is no surprise that a lot of companies try to automate this process, starting with Tableau, the top dog over newcomers like DataHero or Chartio. Their promise: with a lot of connectors we can connect to everything from spreadsheets to Hadoop and then all the data is charted. Thus there is no need to think visual anymore – just load the data and see colorful charts. And then did you ever hear from your audience statements like:

That is so cool… if I could just understand what it means?

Of course, these tools will not solve a single business question for you. But they do give you a nice way of representing data, so that you have all hands free to think about the real measurement you want. What is the right way to display this data so that it fits your business?

TREND 4 – Easier Predictions

This last trend is about pure data science. How many people in your organization can run anaïve bays? Or SVM? Or nearest k neighbors? Not many… and this is true not only for you but for many organizations (except you are a company like LinkedIn and yes, we want to hire even more such people!) Tweet to us if you want to join

But with the data industry maturing, data science is also maturing. Easy plug-and-play solutions have become more and more available. Take as an example companies likewise.io, co-founded by Joshua Bloom (@profjsb), or Skytree, founded by Martin Hack(@mhackster). They offer tools to simplify predictive algorithms. It is like the WEKA package on steroids. Just upload your data and then score it, rank it, rate it… all automated. Worry free. Again – the underlying trend is to free us from technology so that the business focus can become important again.

The point to demystified data science was best made by John Foreman (@john4man), author of the book Data Smart. In order to demystify funky artificial intelligence packages, his tutorial trained the audience how to use Excel to build machine learning programs. Really? Excel? I thought John was out of his mind. But he is not. It works beautifully. According to John:

Artificial intelligence is just counting stuff…. Excel can do this.

And thus after 45 min of Excel operations I had a naïve bays model that classified 19 out of 20 tweets correctly. Very much fun! And surprisingly enough, John was not the only one talking about Excel as a tool for data science. Felienne Hermans (@felienne)from the Delft University of Technology introduced several plug-ins for Excel (see her blog) to make it a better tool for the data world. This also shows how much the data world has matured: we start to offer tools that every business consultant knows, so that they can think with us about the right application of data.

The Future: Data Demystified

What is next? We get easier data, and it will be easier to clean, and easier to chart, and using this data we can more easily predict…. predict… predict. Yes, what is it that we actually wanted to predict?

Data demystified does not mean that we have solved our problems in this world. It just means we have better technology that enables us to focus on what is important. We work with data because we want to change a behavior or get an action. No one said this better than British science historian James Burke during the main Strata plenary session:

Information is causing change… if it is not causing change, it is no information.

He then looked in the audience and said,

No information: you are sitting in a seat.

Information: the person next to you has a miasmal disease.”

Yes… we tend to forget this. We are here in the data world for a reason. We want to change the world with data. Technology is making this job easier and the tools became better, but what counts are the results. And those results can also be found at Strata.

For example, Chris Harland from Microsoft explained how he uses data to improve the business of bars. He measures the behavior of guests at bars, and one of his stunning facts was that Corona beer is a good predictor for more spending. But dear bar owner, please do not force your customers to drink Corona… that would be mixing cause and correlation.

Corona is a good predictor for spending behavior. Do not mix cause and correlation.

Another fascinating (meaning action-oriented) talk was given by Drew Sullivan, on the Organized Crime and Corruption Reporting Project. He used data on money movement to show how to detect fraudulent activities in Montenegro.

Now since data is demystified, let’s apply it: turn our businesses around and become data-driven. This means that data scientists should learn more about business, and businesses should become more like data scientists. Perhaps a new role might be that of a business scientist?

But do not be fooled. To get the question right, is the hardest part. I describe this issue in depth in my book Ask-Measure-Learn (O’Reilly). Take Monica Rogati‘s (@mrogati) presentation as example. She is the famous data scientist, who showed in a well-received talk that woman sleep 20 minutes longer on average. Ok, but how does that help me? This insight is amusing at best. The real question is what to do with this kind of information. Knowing Monica, she has already a new product based on data in mind. Let’s see what she will say at the next Strata. I will be there.

Strata Talk – Suggested Readings

Thanks for watching my Strata Talk or checking out the presentation. Please see below for a list of suggested readings, which provides further research on bots:

Pew: Half of Americans get news digitally, topping newspapers, radio
by Andrew Beaujon
http://j.mp/Vt9DG8

55% Of Journalists Worldwide Use Twitter, Facebook To Source News Stories [STUDY]
By Shea Bennett
http://j.mp/VExCER

Infographics – every 60 s
by GO-Globe
http://j.mp/WU38NA

SPAM Clock
http://j.mp/YkbA4t

Study Spammers Conversion – 1 Conversion in whole of LA
http://j.mp/1537ALV

2010 – Twitter claims that they managed the SPAM issue
by Twitter
http://j.mp/Xmregx

2010 – Great Research article on Facebook SPAM
by Hongyu Gao et.al.
http://j.mp/W1pXjL

Crime in the Internet. Internetsafety 101
http://j.mp/XilebE

2010 – Uncovering Social Spammers: Social Honeypots + Machine Learning
by Kyumin Lee et.al.
http://j.mp/WLsc7N

2011 – The Socialbot Network – When Bots Socialize for Fame and Money
by Yazan Boshmaf et. al.
http://j.mp/UD5rED

How the ‘Good Life’ is Threatened in Cyberspacemore
by Huma Shah
http://j.mp/Zdj9gf

2005 – Social Phishing
by Tom Jagatic
http://j.mp/Ywys2I

Consumers pay more attention to Reviews then to Social Networks
by eccomplished
http://j.mp/XKAMRE

Spotting fake reviewer groups in consumer reviews
by Arjun Mukherjee et.al.
http://j.mp/VEKYRA

2013 – CATS: Characterizing Automation of Twitter Spammers
by Amit A. Amleshwaram et.al.
http://j.mp/Y65Hdk

2012 – Understanding and Combating Link Farming in the Twitter Social Network
by Saptarshi Ghosh et. al.
http://j.mp/W7VOtU

STRATA – A Media Overview

This years STRATA Conference has just finished. It was a highly professional organized meeting. Hats off to Edd Dumbill, his team, and the Co-Chair Alistair Croll.

Below, you will find a few media facts taken from the Fisheye Analytics tool (disclaimer: I am a co-founder of Fisheye Analytics).

OVERALL

Trends_Over_Time

Amount of Articles on Strata 2013

Within one week, there were more then 11,000 mentions of Strata on Twitter and Facebook. Most of those mentions happened while the conference took place.

(Note on the keyword. The keyword determines how many articles are filtered. In this case I had set up: “@Strata”, “#StrataConf” and “Strata Conference”. )

Surely, not all of those mentions are equally as important. One can see how often articles are retweeted and commented on, thus diving into the most important articles. They came from: Mashable, CSDN.net, TechCrunsh, Money CNN, Forbes, The Daily Beast, and many other sources.

MAIN TOPICS

Trending_Hashtags

Trending Hashtags – Strata 2013

In one of my other posts on participation, I joked that less and less “data scientists” and more and more “managers” seemed to attend STRATA. However, if you look at the topics covered, you can see that STRATA dove deep into technology. The usage of big data, or ANALYTICS, was very much at the forefront of the discussions. The trending topics in those 11,000 articles were (listed by number of appearances):

  • analytical database
  • predictive analytics
  • Elasticube
  • multi-dimensional data sources
  • massive scale
  • Hadoop
  • on-the-fly analytics

TOP CONTRIBUTOR

Top contributors to the twitter discussions were Joseph Lalonde, who tweeted 292 times about #StrataConf, followed closely by Daniel Tunkelang (196 tweets) and Brad Lynch (120 tweets).

MOST RETWEETED

The most retweeted Tweet was Monica Rogati’s Tweet predicting this years focus. He tweet got re-tweeted 85 times, followed closely by  @BigDataBorat tweet: “Toughest job as Data Scientist is when data is bias against your pre-ordained conclusion” (65 re-tweets).

 

 

Who Is Coming To Strata?

Directors and Managers are the Majority

Directors and Managers are the Majority

O’Reilly Media started their STRATA Conferences way before the big hype… in 1997… whoa! That’s visionary. The times have changed… and a brief analytic of who is coming reveals that managers and directors make up the majority. Where are all the needed data scientists?

 

Microsoft dominates

Microsoft dominates

The company overview tells a clear story. What company first comes into mind if you think “BIG DATA”… really? Microsoft.

If you want to grab and analyze you own data from the attendees list… here is the code: