Data can fail us

Data Can Fail Us – Like Brazil Did

[vc_row][vc_column width=”1/1″][vc_column_text]Germany defeated Brazil in the World Cup semi-finals. It was a brutal 7-1 drubbing and was totally unexpected! So, how does this relate to data?

Many of us who work with data hail all possibilities (see the Age of the Algorithm). But the two above sentences point to the two most common pitfalls I see when I work with business partners.

Do not trust it too much – Data may fail us

Based on data, there are many systems predicting soccer results. Take the ESPN’s Soccer Power Index (SPI). It scores the defensive or offensive power of each player based on historic goal differences and sums it up for each team constellation. The Elo Rating system uses historical match data, but weights it according to the importance of the match. And Transfermarkt relies on the actual monetary value of each player.


(seen on Facebook)

However, none of those systems would have ever predicted a 1:7 defeat of Brazil. Nate Silver, the mastermind behind ESPN’s score, tweeted that the likelihood of this outcome was 1 in 4000.

What did we learn from this? Data can fail us! (Tweet this) Yes, no one wants to hear this story in the midst of big data hype – but we are often faced with questions where there is not enough data to give a good prediction. No matter whether the prediction is one in 400 or one in 4000, anyone would have said that 7 goals were highly unlikely. But “highly unlikely” does not mean impossible. We need to understand when working with data that models have their restrictions. Predictions are probabilities and they are not absolute truths. Not yet convinced? The world is full of ‘unlikely to happen’ events. In soccer, you only need to look at Czechoslovakia beating Argentina in the 1958 World Cup with a very unlikely 6 to 1 final score (see a short video here). We need to understand when working with data that models have their restrictions. Predictions are probabilities and they do not tell the absolute truth, and can only give an indication. Business leaders who understand that predictions are just probabilities will have so-called business discontinuity planning for the very unlikely (but still probable) case of all facilities being lost in a fire … or an employee’s terrible decision that costs the firm dearly … or an unexpected uprising in a country where a majority of products are produced … or (you fill in the blank).


(FiveThirtyEight’s reaction)

Learn about the shortcomings

Yes, we all love data and it does not always have to be big data, as I pointed out in this blog post. Data will help us to predict, but we need to learn to use it correctly by not underestimating the unlikely probabilities. If you want to learn more about how to work with data, check out my book “Ask Measure Learn” from O’Reilly Media.[/vc_column_text][/vc_column][/vc_row]

Big Data

The Reason Why We Love Big Data: Recommendations

[vc_row][vc_column width=”1/1″][vc_column_text]Recommendation Engines have gained the most attention in the big data world. Why is that? Big data has created three distinct types of data-driven products:

  • Data used to benchmark
  • Data used for recommendation and filter systems
  • Data used for predictions

Benchmarking is often the first quick win when embarking into the world of big data. We have, however, been doing benchmarking for centuries and benchmarking is not the reason for the big data hype. Benchmarking often needs an educated decision maker at the other side of the screen to explain why to benchmark: Why is A performing better then B? Why did the curve drop? Thus, benchmarking products often are not scalable – the more dashboards we build from big data, the more educated decision makers (also called “the analysts”) are needed.

The fame of data products is driven by something else: recommendation engines. Recommendations narrow down what could have been a complex decision into just a few recommendations. Big data allowed us to do recommendations on a new scale that we had not seen before. The most well-known example is how the Google search algorithm trumped AltaVista by recommending the best websites to view. Another well-known example is Amazon’s recommendation engine, based on the past reading behaviors of other readers. Both of those systems are based on algorithms that “learn” from past data.

A recommendation system outdoes benchmarking because it does not need an analyst at the end. It reduces big data to small data (see my opinion on why small data is important). A recommendation system suggests a few data points out of a large pool of data. Take LinkedIn as an example: The data product “people you may know” recommends only a few members out of a database of 300,000,000 members.

Thus, recommendation engines are becoming more and more important. Logically, the world of startups is filled with companies creating recommendation products in one way or the other. alone lists hundreds startups claiming to “recommend.” From the right restaurants (recommenu by Jake Bailey) to films (foundd by Lasse Clausen) to products (Linkcious by Weichang Lai) … All of those companies try to find a smarter way of making sense of data.

But what is a recommendation engine exactly? I asked Anmol Bhasin, who is one of the leading experts in the field of recommendation. Watch this 2 minute video to learn about the difference between Content Based Recommendation Engines vs. Collaborative Filtering.

But, before you now rush off and invest your money in recommendation engines, beware: life might not be that easy. There are major technology challenges in recommendation engines:

  • Cold Start Problem
    The heart of a recommendation system is that a computer learns from data, i.e., who has read this book before, who connected to this person before, etc. One of the biggest challenges can be that there is not a sufficient amount of historical data at the start. Take FOUNDD, a young Berlin-based startup for movie recommendations. It did not have a long purchase history, such as Netflix would presently have, thus the algorithm would not be able to recommend anything useful in the beginning. Fully aware of that issue, the founder Lasse Clausen created a “hot or flop” page in the beginning. Each customer has to rate 10 movies before the system begins to recommend anything.
  • No Surprises: Let’s say there was a sufficient amount of historical data, then the second problem with recommendation engines – if executed badly – is that there might be no surprises. Advising someone to read the book Harry Potter 3 after they looked at Harry Potter 6 might not be all too insightful. It just states the obvious. Recommendation engines therefore work best in the long tail of the data – because here are the unexpected results.

The two main industries that at this moment benefit strongly from recommendation engines are the retail industry and the media industry, because both have a lot of data in the long tail, and both have a lot of data to overcome the cold-start problem.


(Adapted from Oğuzhan Abdik under the Creative Common’s licence)

But, as other industries are beginning to use recommendation engines more and more, such as the transportation industry, we see more and more intelligent navigation systems for either personal use (waze) or being used as traffic control systems (IBM). Or, you can take a look at the airline industry – GE started a Kaggle competition to find the best routes to save energy for the airline industry.

The recommendation engine is the shining star of big data and we will see way more applications in the future. Read the next post (Sep 16th) to learn more about the third and last element of the data products: predictions. Can’t wait? Subscribe to my newsletter to get an some free resources about data products – such as my latest talk at the Harvard Business Review Conference.

(The article was original published by FORBES)[/vc_column_text][/vc_column][/vc_row]


Our Future: Free Will vs. Predictions with Data

[vc_row][vc_column width=”1/1″][vc_column_text]What is the future of big data? It will be all about predictions! Predictions based on data have come into our world and we often do not even know it. In many cities in the US, for example, it is no longer a coincidence when you meet a police officer: they are getting dispatched based on the models created by George Mohler, a seismologist who has found a way to help predict where the next crime is about to happen (read more here). When you get a flyer in the mail, it might be because your neighborhood retailer has tried to predict what you will soon need. Sometimes they do this too well: Target once made it into the news (read more here) because they knew that an underage girl was pregnant long before her own father knew it.

But let’s not get carried away by the big data world. Predictions are nothing new. Magicians like the famous Alexander Seer promised at the beginning of the last century to “know, see and tell” it all. Despite being new technology, predictions based on data are the most difficult data products we have to work with. Technically, the difference between predictions and recommendation engines (read here about them) is small. Most recommendations could be re-phrased as a predictions. The difference lies in our own free will.

Data products are differentiated by the amount of support that is needed before ‘actionable insights’ can be made out of the data. Benchmarking, the most basic data product, requires the interpretation of an “analyst” to make sense of the data. Recommendations only need the “user” to decide what to do next. And predictions? Predictions need no one. Read that sentence again! Predictions know the answer already– there is no further need to investigate or to make choices.


(Taken and adapted from Pieterjan Vandaele under the creative commons license)

The idea is that the more data we have, the better our recommendation engines will be – so that they can become predictions. This view is best summarized as being the “end of theory.” Chris Anders (@chrisanders) argued that, in the future, we will have sufficient data to predict anything, and thus there will be no need for theoretical models anymore.

But often, it is not the amount of data that matters when creating a valid prediction. For example, the Incas predicted the best time of year to plant crops. Their dataset might have been as little as 3560 data points (= 10 years) – which is virtually nothing in our big data world. 500 years later, we have companies like Google that measure a lot about our online behavior, but despite all of this data, predictions are not necessarily easy to make. For example, New York Times bestselling business author Carol Roth once complained in her blog that Google infers that she is a male over age 65, when in fact she is a woman, and decades younger.

Why is this? Because not all of the data Google has aggregated is really helpful to the specific prediction they try to make. The fact that not all data is useful was best seen at the onset of social media. Suddenly, there were massive amounts of data and many of us thought that this could predict amazing things. For example, we saw many companies who claimed that they could predict stock price movement through social media content. Most of them (if not all) have vanished by now, since it turns out that social media chatter is just to “noisy” and thus cannot really help to make predictions.

Allow me to make a prediction about data products as such: predictive algorithms will become a bigger part of our life, and will probably change our society more then the Internet has. The Internet enabled us to do things faster and more conveniently. However, predictions based on our data trails aims even further because they enable us to forecast human behavior in a way we have never been able to before.

The biggest danger to the success of predictions is us – the “users” – who do not yet understand that a prediction is just a trained algorithm that could go wrong. Even if the right data set was used – for example, the wisdom of the crowd – they might not be the right crowd for us. Think about the student who is required to change his major because he was “off track” for too long, and so the algorithm assumes a low likelihood of success and the student must change majors (read more about this here). Such strict rules might mean the end of ”out-of-the-box” thinking.

Our world is full of wrong predictions – even if they are based on data – and a wrong prediction might easily destroy our future. But when we learn as consumers to take predictions as what they are – as likelihoods that advise (but do not dictate) our lives – predictions based on data will benefit all of us.

Do you want to learn more? Subscribe to my newsletter for more free resources on data products.

(The article was original published by FORBES)[/vc_column_text][/vc_column][/vc_row]

Binladen Map

Knowing Osama’s Whereabouts

 In April 2011, the United States Special Forces descended on Osama bin Laden, leader of the terrorist group al-Qaida’s, hideout. The ensuing raid killed bin Laden after over a decade of living in hiding and directing attacks through his followers. So who knew where he was located?

The answer may surprise you:

We ALL did.

According to Kalev Leetaru, a researcher at the University of Illinois at Urbana-Champaign, an analysis of public news articles about bin Laden pinpointed his location within 200 kilometers in diameter. In a very real sense, one of the world’s most secretive hiding places may have ultimately revealed itself from the mosaic of individual data points. Each journalist had an opinion about the location, and all opinions together formed a true answer. The catch here: no survey was conducted, and no journalist was actually asked where they thought Osama was hiding. They revealed their opinions on bin Laden’s whereabouts through their articles. This is the power of public and unstructured data.


Most likely, the US forces did not rely on crowdsourced wisdom like this. We know today that US governement agencies like the NSA are tapping into all kinds of different data sources, from spying on the phones of top-level politicians to tapping into everyone’s communications through email providers. However, the principle is the same: actionable intelligence was derived from the aggregation of individual, and in this case, seemingly random, data points.

Herein lies the promise of what we call big data. It has become one of the trendiest digital buzzwords of the new millenium. It involves gathering business intelligence from the heretofore forbidden territory of data sets that are too large to curate or maintain as databases, often encompassing terabytes or even petabytes of information.

The text is part of the INTRO chapter from Ask – Measure – Learn.