Tag Archives: Intro


Predicting is Nothing new?

The magician Alexander Seer promised as early as 1915 to KNOW, SEE and TELL it all. It took about a century longer until his marketing promises started coming true. The magical new term is PREDICTION.

Predictive analytics is nothing new for Quant people. Your insurance company already has lots of predictions about you. It knows how likely it is that you crash your car tomorrow or that you become sick. Also, retailers are studying your behavior very closely in order to offer you the right advertisements. (Here is a brilliant NYTimes article talking about how retailer Target started to predict customers’ pregnancies based on their purchase history.)

But discussions on big data and on predictive analytics seem to be all the hype. Searches on Google for Predictive Analytics have more than doubled in the last 2 years. Why is that?

There are two main reasons for this increased focus: A) there is more data publicly available and B) the technology to process large amounts of data currently exists.

More Data

Today, more and more data is easily accessible to everyone. This is new, as the data your insurance company had about you and your peer group is highly confidential. But today Twitter gives out some parts of its 140 million tweets. Stackoverflow regularly allows one to download complete answers to questions. In addition to those companies, there are data markets which offer access to all kinds of data, free or paid.

Data today stands for a new class of economic assets. Today, internet savvy users are very much aware that they pays with their own data for the freemium model. Like Andrew Lewis said: “If you’re not paying for something, you’re not the customer; you’re the product being sold.”

Better Technology

The second reason for the high hopes of predictive analytics is that today, the technology exists to deal with large quantities of decentralized data. In the past, predictive analytics worked by ‘loading’ highly structured data into a big data warehouse and processing all of the data. This approach would be difficult if one were to deal with an unknown quantity of data, which can be stored anywhere, in any quality and in any structure.

A yellow stuffed elephant named Hadoop became the logo for this new technology. Apache Hadoop can query very large, distributed and loosely-structured datasets.

  • Instead of keeping all data in one database, one can work with distributed databases
  • Instead of doing all of the processing in one server, one can as well distribute the server power to many systems.
  • Instead of using structured content, one can also work with unstructured content.
  • Instead of ‘hindsight’ results, a business can get ‘near real-time’ results

Read here for one of the many overview articles on big enterprise data. Claudera, Hadoop, MReduce, and Greenplum are about to create a new revolution, one that could easily be bigger than the revolution created by Web 2.0. This technology will enable massive data to be analyzed, cruched and compared. Which in turn should allow us to predict and to do, which is exactly Alexander Seer promised: to KNOW, SEE and TELL it all.