Category Archives: Data Noise

The Force Awakens In Data – Industry Leaders Comment

How does the latest “Star Wars” movie parallel with the data industry? Both have a force that awakens. However, while Rey, the tough scavenger in “Star Wars” (played well by Daisy Ridley), saw her force come to life in less then 30min, the data industry has been waiting for ‘that’ to happen, for half a decade. However, finally, we seem to be at the tipping point – at least if you believe the latest Gartner report. Working with data has shifted, according to the report, “from IT-led enterprise reporting to business-led self-service analytics.” In other words, business-focused analytics and data discovery are on the rise.

For a while, the common consensus was that the hardest part in data science would be to find the actionable insights. A data platform should empower the business users. It should offer businesses easy and agile ways of working with data directly, without the need to go through an IT or BI department. The reality, in many companies around the world, unfortunately, was far from that. Over the last few years, we have seen a lot of innovation happening. Some tools even offer the capability to just type their business questions and an algorithm will translate this into a data query. Let’s look at the areas that one will need for a data enabling platform.
Has the Force Really Awakened In Data

An easy way to load data

Data Scientists often complain that they feel more like a “data janitor”. Most of their time is taken up by the process of extracting the data (e.g.: a website), transforming it (e.g.: clean it up) and loading it into a database that one can start working with. Especially in companies that do not have their natural foundation in data, this can be a daunting task. A data platform needs not only be able to connect to different data sources, but also to simplify the process for the ‘force’ to awaken, Joe Hellerstein, founder of the platform Trifacta, thoughtfully pointed out.

If 80% of the work is data wrangling, then the biggest productivity gains can be found there. Business users need platforms that let them see their data, assess quality, and transform the data into shape for analysis.Click to tweet(@joe_hellerstein)

Data Wrangling Is The Hardest Part

Analytical agility

If your sales went down, you would want to know, right now, why that happened. And, if you have data that no one else has, you will want to play around with new product ideas.

Agility – a concept we know very well from the SW development space, has made it over into our data world. Stefan Groschupf, CEO and Founder at Datameer, pointed out that the ‘force’ only awakens under the following condition:

For real business value, an analyst should be able to dig into their data to understand what’s in it and what new insights it can reveal. A good data platform needs to allow an easy exploration of data with the fastest time to insight without needing engineering skills or knowledge.Click to tweet(@StefanGroschupf)

Governance and metadata

The easier it is to explore data – the more people will do it. We saw this, in the mid 2000’s, with the onset of social media analytics. Platforms offered so-called insights with the ease of a mouse click. Suddenly, business folks created a plethora of new metrics  – many of them highly useless, as I pointed out in my book; “Ask Measure Learn”.

But high quality data is unquestionably a prerequisite to sound decision making, and is the #1 most important criteria for any organization.  Thus, in the last few years, data governance and data lineage have become the focal points of the industry. William Kan – Senior Product Manager at LinkedIn, who created a scalable way to define and manage metrics via a unified data platform at LinkedIn explains:

People often associate the term governance with unnecessary overhead and processes. Our idea of governance is to put checks and balances in place with the goal to provide the best data quality but at the same time to be as low touch as possible via machine learning and automation.” Click to tweet(@WillKanman)

Checks & Balances - Ensures Quality Data


Accounting is a notion that Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel and Thomas W. Oestreich did not mention in their reports. However, since Data processing comes at a cost, so no one should be surprised that there will soon be a need to ‘regulate’ the usage of data platforms. Not everyone should be able, with a few clicks, to bring the server down (or our budget, as the server now scales into the cloud). But hold on… didn’t’ we try, very hard, to make data accessible? Correct. Thus, this time, we should not make it more complex, but we should ask for higher accountability. As Gregory Piatetsky-Shapiro the well known  data scientist, co-founder of KDD conferences said:

The impact a more complex machine learning algorithm might not always drive the wanted insight. Organizations will need to balance what is feasible and what is useful.Click to tweet(@kdnuggets)

Automated insights

I normally start my courses at Harvard with the question, “Do we have this kind of hype in data? The answer is “Yes”, as it’s all about pattern recognition. Hidden patterns, in large datasets, help predict user behavior and help improve the value proposition of our products. Typically, a data scientist or an analyst will dig through the data, in order to surface patterns related to statistical relevant correlation or outliers.

This process can be automated to a good degree. This is where automated advanced analytics comes into play. Automated analytics is like a form of the ‘death star’ when it comes to the industry. With one stroke, a group of algorithms goes in parallel through the data, in order to detect correlations, clusters, outliers, anomalies, linkages, trends… you name it. It’s the brute force approach. 

But correlation itself might not make an insight – let alone create an action – see the graph below showing the correlation between the divorce rate and the margarine consumption. Or as Gartner highlighted formulated it “that most business users do not have the training necessary to accurately conduct or interpret analysis.”

Spurious correlations

This is where BeyondCore gets involved. Arijit Sengupta, founder of BeyondCore, build this platform not only to surface all kinds of correlations, but also to warn the business user about potential hidden factors, in an effort to protect the user from statistically unsound decisions.

Most business users see a pretty graph and think they can take action based on it. In reality they regularly take actions based on misleading or misunderstood graphs. Automated Analysis software need to guide us to statistically-sound actionable insights and explain the patterns in detail so that business users can act with confidence. Click to tweet(Arijit Sengupta)


We know all that a picture says more than 1000 words. Thus, a data platform needs to be visual and allow the business user to showcase the most important insight. With the onset of HighCharts, we have seen many companies try to out-bid their competitors, by using their superior number of chart types. But be aware – even without actionable insights, one can create good visualization. We call this, “beautiful but useless”. As the author of the book, “Data Mining for Dummies”, Meta Brown rightly said:

Tools are just… tools. They should not define your work. It’s your job to understand the problem, identify goals and choose the tools and processes that help you reach those goals as easily as possible.Click to tweet(@metabrown312)

Looks Beautiful - If I just would know what it means

Outside world

More power leads to more publicity. In the past, the BI team and their insights tugged away in the basements of companies. But now, data has become a first class citizen within those companies. It is thus no surprise that it has become important to communicate insights to the outside world. Only insights that are seen can be acted on. Thus the new set of data platforms make it easy to publish their findings to an internal audience as well as embedded those insights into products and services to their customers. As Chris Wintermeyer told me lately at a dinner table:

Much of the success of any data platform will hinge on the way that, the insights generated, are shared and discussed.Click to tweet(@ChrisAtDomo)

The future

With the business force awakening, the future seems bright. Most companies, by now, have the right vision. That’s not really hard, since we have talked about the data needed ‘now’, for at least half a decade. However, the Gartner magic quadrant does not list any ‘challenger’. Is this the end of innovation in the data space?

Maybe. But maybe the true challenges today are no longer within technology, as such, but in a balance to use the insights to support our decisions and not to determine our actions blindly. As the Netflix CEO, Reed Hastings, recently pointed out: Data has a support function Click to tweet(@reedhastings)

For humans, with or without the force, that rule is still true: “actionable your data must beClick to tweet.


This article was originally published on my Forbes Blog.

Binladen Map

Knowing Osama’s Whereabouts

 In April 2011, the United States Special Forces descended on Osama bin Laden, leader of the terrorist group al-Qaida’s, hideout. The ensuing raid killed bin Laden after over a decade of living in hiding and directing attacks through his followers. So who knew where he was located?

The answer may surprise you:

We ALL did.

According to Kalev Leetaru, a researcher at the University of Illinois at Urbana-Champaign, an analysis of public news articles about bin Laden pinpointed his location within 200 kilometers in diameter. In a very real sense, one of the world’s most secretive hiding places may have ultimately revealed itself from the mosaic of individual data points. Each journalist had an opinion about the location, and all opinions together formed a true answer. The catch here: no survey was conducted, and no journalist was actually asked where they thought Osama was hiding. They revealed their opinions on bin Laden’s whereabouts through their articles. This is the power of public and unstructured data.


Most likely, the US forces did not rely on crowdsourced wisdom like this. We know today that US governement agencies like the NSA are tapping into all kinds of different data sources, from spying on the phones of top-level politicians to tapping into everyone’s communications through email providers. However, the principle is the same: actionable intelligence was derived from the aggregation of individual, and in this case, seemingly random, data points.

Herein lies the promise of what we call big data. It has become one of the trendiest digital buzzwords of the new millenium. It involves gathering business intelligence from the heretofore forbidden territory of data sets that are too large to curate or maintain as databases, often encompassing terabytes or even petabytes of information.

The text is part of the INTRO chapter from Ask – Measure – Learn.

Death by Infographics or Dashboard



For the past 2 years, we have had a new quest: to make raw data more understandable by making it good looking.

The people charged with this quest have added the word ‘data‘ in front of their job description, and it sounds very hip. We have ‘data-journalists‘, ‘data-scientists‘ and ‘data-artists.’ Yet, before these jobs existed, didn’t scientists and journalists already use data for their insights?

Those insights are now presented in form of infographics, which have replaced the good old PowerPoint presentation. Unfortunately, not every so called infographic will be easier to understand than the raw data. The good old ‘death by PowerPoint’ has met its maker – its re-incarnation, ‘death by infographics.’

The main idea of the infographic or the dashboard is correct. A well-placed graphic can say more then 1,000 words, but one needs to start with the content first. Which issue do we want to solve? Which story do we want to tell? What is the underlying business problem?

How does one create a useful dashboard or infographic? As the most important rule, one should first think about the issue that is to be solved with data. For example:

Marketing: Which media channels should I use to best reach my customers?
PR: Can you warn me if there is an online reputation case concerning me or my competitors?
Customer Care: Which customers should I focus on first?
Content Creation: What content is most interesting to my clients?
Sales: How do I best attract potential clients?
Business Intelligence: What kind of products are my competitors planning to use? 

Thus, one must FIRST formulate the question, THEN think about the data and, only at the end, PUT it all together graphically. The master in visualization of data is Hans Rosling. See here for his stunning TED talk on world poverty.