Category Archives: Blog

Racial Injustice In NYC Revealed By Data

Every morning, New York City police officers receive insights from computers, directing them to areas where crime is likely to occur. If that sounds similar to the 2002 sci-fi movie Minority Report, it’s because it is quite similar. While the NYPD may not use futuristic “precogs” to target specific individuals before they commit a crime, the department does use a computer program to identify “hotspots” where crime is likely to occur. But in both the movie and in New York, the prediction is just that — a prediction. The actions taken by the police are the reality, and unfortunately, sometimes a racially unjust reality.

We analyzed the NYPD’s stop-and-frisk program and found that while the overall number of incidents has declined significantly since a redesign of the policy, there is an unsettling increase in racial imbalance in at least eight NYC precincts.Click to tweet

Stop and Frisk

The “stop-and-frisk” program

From its inception, the stop-and-frisk program sought to reduce crime by giving policing officers the authority to identify and search suspicious individuals for weapons and contraband. The stop-and-frisk tactics gained greater traction with the implementation of CompStat in the 1990s. This process employed Geographic Information Systems in order to map crime and identify problems, taking into account past criminal trends in neighborhoods and allowing officers to quickly address crime spikes. Over the past two decades, the practice evolved into the “stop-and-frisk” program we know today – where at one point stops exceeded over fifty thousand a month, largely targeting minorities. In 2012, racial minorities accounted for 92% of all stop-and-frisk incidents. The disparate impact of the stops on minorities led to public outcry and received extensive coverage from The New York Times, Slate, The Atlantic, eventually resulting in a legal case against the city.

Today, the NYPD is hoping to improve its’ policing tactics and is conducting a two-year trial run with predictive policing software, Hunchlab. While the Hunchlab algorithm does not take into account individual characteristics, such as race, ethnicity, or gender, it does incorporate factors such as “socioeconomic indicators; historic crime levels; and near-repeat patterns to help police departments understand and respond more effectively to crime.” Certainly, this brings doubt as to whether or not the software will reduce the disproportionate burden of current policing tactics on minority communities. In the meantime, we should be questioning what individual precincts can learn from others and what they can do to improve their policing practices for all.

Breaking it down by precinct

While prior analyses of the stop-and-frisk program mainly focused on the overall discriminatory practices of the policy, we looked at precinct level data to identify areas with the largest disparities between the racial makeup of the community and the racial makeup of the stop-and-frisk incidents within that community. In other words, holding all else equal, in a precinct where 10% of the community is black, black residents should represent only 10% of the total number of stop-and-frisk incidents occurring in that community. If that number is substantially higher than 10%, then racial profiling may be contributor to this disparity. To compare precincts we created an index called the Racial Disparity Index (RDI) to compare across precincts. 

Our study showed that neighborhoods with the greatest racial disparity between the race of people stopped and the racial makeup of the residents are predominantly white neighborhoods, with the exception of Chinatown. These neighborhoods include the Upper East Side, Greenwich Village, Upper West Side, Park Slope, Tribeca, Soho, Brooklyn Heights, Midtown East, and Chinatown. For example, Precinct 19, part of the Upper East Side, recorded the highest level of racial disparity in 2015, reaching a score of 31.41 on our index. Out of the 278 stops conducted there, nearly half were against black residents, even though they make up a mere 2.3% of the population.

all-precincts-rdi

Many precincts showed an increase in Racial Disparity Index (RDI) scores over the past five years Click to tweet , even after the NYPD enacted reforms in attempt to reduce racial profiling within the program. While in 2012 the Racial Disparity Index (RDI) was at 7.6 (average over all precincts) it went up after the program changed to a 8.9 in 2015 conveying that the stop-and-frisk program has become more imbalanced. Or said differently, if you are not part of the racial majority of a specific community, you still have a higher likelihood of being stopped. In 2015, out of the 270 stops that occurred in Precinct 84, Brooklyn Heights, three-fourths of them were carried out against black residents, a minority group in the community.

It is unclear what leads to this increased disparity over time, but we see several plausible reasons. For one it could be the way crime is conducted has changed between 2012 and today. Criminals are mobile and might move into other precincts changing the RDI. In this case we are seeing a reflection of this occurrence. On the other hand it could be that the system as such is skewed. Here are two possible elements: the human and the machine.

Something is working

Undoubtedly, racial discrimination has plagued the stop-and-frisk program from the beginning. As one can see in the image below, public outcry and the resulting policy reform led to a dramatic decrease in the number of stops conducted.

Number of Stops

At the same time, the overall effectiveness of the policy, as measured by the percentage of stops in which contraband is recovered, has increased. From this, we can infer that the stops have become more effective.

Our analysis breaks down the success of the program by precinct. For example, Precinct 72, which includes Greenwood and Sunset Park, has managed to lower the racial disparity of stop-and-frisk occurrences while simultaneously increasing the effectiveness of its’ tactics from 3.69% to 7.38% over the past five years. The racial makeup of the precinct is well mixed between Asian, Hispanic, and White. Meanwhile, Precinct 78 (Park Slope), which is adjacent to 72, has seen the greatest increase in disparity, from an RDI score of 18.17 in 2010 to 25.13 in 2015, while only marginally improving its’ contraband discovery. This suggests that the increased policing against racial minorities has not proven to be effective in recovering contraband in the precinct.

Percentage of Contraband

Other Possibilities

Racial disparity is not the only issue that predictive policing algorithms are facing. It is important to keep in mind we use the metric “percent of contraband found” as a key performance indicator for the program, as the city uses this criteria. This might be a misleading metric. For example, let’s assume the amount of contraband in the city is increasing as the city becomes less safe. The likelihood of finding contraband during a stop-and-frisk incident would then increase as well. In this scenario, that doesn’t translate to a successful program, but rather displays an indication of the opposite. Moreover, these algorithms only spot correlation based on past data. Sending police to a hotspot does not expose the underlying cause of why the crime actually occurred.

While predictive policing is a booming sector with software coming from companies such as Azavea, PredPol, and Hitachi, an analysis of the data from the NYPD’s stop-and-frisk program reveals legitimate concerns to the data-driven approach. How city tactics change over time will need to be closely monitored to ensure a fair and non-discriminatory approach, while also taking a look at the root cause of criminal activity.

Who we are

The analysis of the NYPD’s stop-and-frisk program was performed by Maciej Szelazek, Maggie Barnes and Derek Cutting together with Lutz Finger as part of his course on Data Products at Cornell Tech. Further insights into data can be seen here.

This article was first published at Forbes.

The Force Awakens In Data – Industry Leaders Comment

How does the latest “Star Wars” movie parallel with the data industry? Both have a force that awakens. However, while Rey, the tough scavenger in “Star Wars” (played well by Daisy Ridley), saw her force come to life in less then 30min, the data industry has been waiting for ‘that’ to happen, for half a decade. However, finally, we seem to be at the tipping point – at least if you believe the latest Gartner report. Working with data has shifted, according to the report, “from IT-led enterprise reporting to business-led self-service analytics.” In other words, business-focused analytics and data discovery are on the rise.

For a while, the common consensus was that the hardest part in data science would be to find the actionable insights. A data platform should empower the business users. It should offer businesses easy and agile ways of working with data directly, without the need to go through an IT or BI department. The reality, in many companies around the world, unfortunately, was far from that. Over the last few years, we have seen a lot of innovation happening. Some tools even offer the capability to just type their business questions and an algorithm will translate this into a data query. Let’s look at the areas that one will need for a data enabling platform.
Has the Force Really Awakened In Data

An easy way to load data

Data Scientists often complain that they feel more like a “data janitor”. Most of their time is taken up by the process of extracting the data (e.g.: a website), transforming it (e.g.: clean it up) and loading it into a database that one can start working with. Especially in companies that do not have their natural foundation in data, this can be a daunting task. A data platform needs not only be able to connect to different data sources, but also to simplify the process for the ‘force’ to awaken, Joe Hellerstein, founder of the platform Trifacta, thoughtfully pointed out.

If 80% of the work is data wrangling, then the biggest productivity gains can be found there. Business users need platforms that let them see their data, assess quality, and transform the data into shape for analysis.Click to tweet(@joe_hellerstein)

Data Wrangling Is The Hardest Part

Analytical agility

If your sales went down, you would want to know, right now, why that happened. And, if you have data that no one else has, you will want to play around with new product ideas.

Agility – a concept we know very well from the SW development space, has made it over into our data world. Stefan Groschupf, CEO and Founder at Datameer, pointed out that the ‘force’ only awakens under the following condition:

For real business value, an analyst should be able to dig into their data to understand what’s in it and what new insights it can reveal. A good data platform needs to allow an easy exploration of data with the fastest time to insight without needing engineering skills or knowledge.Click to tweet(@StefanGroschupf)

Governance and metadata

The easier it is to explore data – the more people will do it. We saw this, in the mid 2000’s, with the onset of social media analytics. Platforms offered so-called insights with the ease of a mouse click. Suddenly, business folks created a plethora of new metrics  – many of them highly useless, as I pointed out in my book; “Ask Measure Learn”.

But high quality data is unquestionably a prerequisite to sound decision making, and is the #1 most important criteria for any organization.  Thus, in the last few years, data governance and data lineage have become the focal points of the industry. William Kan – Senior Product Manager at LinkedIn, who created a scalable way to define and manage metrics via a unified data platform at LinkedIn explains:

People often associate the term governance with unnecessary overhead and processes. Our idea of governance is to put checks and balances in place with the goal to provide the best data quality but at the same time to be as low touch as possible via machine learning and automation.” Click to tweet(@WillKanman)

Checks & Balances - Ensures Quality Data

Accounting

Accounting is a notion that Josh Parenteau, Rita L. Sallam, Cindi Howson, Joao Tapadinhas, Kurt Schlegel and Thomas W. Oestreich did not mention in their reports. However, since Data processing comes at a cost, so no one should be surprised that there will soon be a need to ‘regulate’ the usage of data platforms. Not everyone should be able, with a few clicks, to bring the server down (or our budget, as the server now scales into the cloud). But hold on… didn’t’ we try, very hard, to make data accessible? Correct. Thus, this time, we should not make it more complex, but we should ask for higher accountability. As Gregory Piatetsky-Shapiro the well known  data scientist, co-founder of KDD conferences said:

The impact a more complex machine learning algorithm might not always drive the wanted insight. Organizations will need to balance what is feasible and what is useful.Click to tweet(@kdnuggets)

Automated insights

I normally start my courses at Harvard with the question, “Do we have this kind of hype in data? The answer is “Yes”, as it’s all about pattern recognition. Hidden patterns, in large datasets, help predict user behavior and help improve the value proposition of our products. Typically, a data scientist or an analyst will dig through the data, in order to surface patterns related to statistical relevant correlation or outliers.

This process can be automated to a good degree. This is where automated advanced analytics comes into play. Automated analytics is like a form of the ‘death star’ when it comes to the industry. With one stroke, a group of algorithms goes in parallel through the data, in order to detect correlations, clusters, outliers, anomalies, linkages, trends… you name it. It’s the brute force approach. 

But correlation itself might not make an insight – let alone create an action – see the graph below showing the correlation between the divorce rate and the margarine consumption. Or as Gartner highlighted formulated it “that most business users do not have the training necessary to accurately conduct or interpret analysis.”

Spurious correlations

This is where BeyondCore gets involved. Arijit Sengupta, founder of BeyondCore, build this platform not only to surface all kinds of correlations, but also to warn the business user about potential hidden factors, in an effort to protect the user from statistically unsound decisions.

Most business users see a pretty graph and think they can take action based on it. In reality they regularly take actions based on misleading or misunderstood graphs. Automated Analysis software need to guide us to statistically-sound actionable insights and explain the patterns in detail so that business users can act with confidence. Click to tweet(Arijit Sengupta)

Visualization

We know all that a picture says more than 1000 words. Thus, a data platform needs to be visual and allow the business user to showcase the most important insight. With the onset of HighCharts, we have seen many companies try to out-bid their competitors, by using their superior number of chart types. But be aware – even without actionable insights, one can create good visualization. We call this, “beautiful but useless”. As the author of the book, “Data Mining for Dummies”, Meta Brown rightly said:

Tools are just… tools. They should not define your work. It’s your job to understand the problem, identify goals and choose the tools and processes that help you reach those goals as easily as possible.Click to tweet(@metabrown312)

Looks Beautiful - If I just would know what it means

Outside world

More power leads to more publicity. In the past, the BI team and their insights tugged away in the basements of companies. But now, data has become a first class citizen within those companies. It is thus no surprise that it has become important to communicate insights to the outside world. Only insights that are seen can be acted on. Thus the new set of data platforms make it easy to publish their findings to an internal audience as well as embedded those insights into products and services to their customers. As Chris Wintermeyer told me lately at a dinner table:

Much of the success of any data platform will hinge on the way that, the insights generated, are shared and discussed.Click to tweet(@ChrisAtDomo)

The future

With the business force awakening, the future seems bright. Most companies, by now, have the right vision. That’s not really hard, since we have talked about the data needed ‘now’, for at least half a decade. However, the Gartner magic quadrant does not list any ‘challenger’. Is this the end of innovation in the data space?

Maybe. But maybe the true challenges today are no longer within technology, as such, but in a balance to use the insights to support our decisions and not to determine our actions blindly. As the Netflix CEO, Reed Hastings, recently pointed out: Data has a support function Click to tweet(@reedhastings)

For humans, with or without the force, that rule is still true: “actionable your data must beClick to tweet.

Future

This article was originally published on my Forbes Blog.

The New Facebook Likes

Social Media really becomes more and more like a survey tool. You can now express different types of “likes”…. Thumb up or not? It’s your choice! “Not every moment is a good moment” said “Zuck”. That new stickers open up a lot of new abilities to survey or analyze peoples posts. But as always… more options makes less data… and less data makes analytics more complex… I am curious to hear how the analytics team will do it.

final (1)

The full overview was reported by the Next Web:

facebook Like Stickers

data to manager

How Cornell Trains Future Data Managers

How do you create amazing new data driven products! By studying deep, deeper and the deepest machine learning algorithms? Nope – by enabling business folks to talk and think data .

Recently I had the opportunity to test this theory. I taught a course for MBA students at Cornell University’s Johnson Business School and Cornell Tech campus. In the course, we covered the basic components of data science including modeling, visualization scraping technologies and databases. At the end each team built a stunning data app: from predicting the of future Starbucks locations to setting prices for old vinyl records.

data to manager

It is the business mindset that drives many of our product ideas. Thus while data scientists are tough to hire (I am hiring at LinkedIn – join our great team) it is the business focus that is missing. McKinsey recently called out that we face a shortage of about 150k Data Scientist as well as 1.5 million managers . Really? Are those data scientists so hard to manage that we need 10 managers to take care of them? No. The truth is that we are missing data minded managers as we need to include a business-related component in any data discussion. If you hear “actionable analytics,” it only means something like “business thinking inside”.  Analytics should be focused on an action to improve or change our business.

My course at Cornell had two main objectives: to take away the fear of big data and to create a common language for MBAs to use. But how do we “take away the fear”? The answer is by building your own data applications. Many data science syllabi often teach students tools such as R (a programming language for statisticians), Python, and the like. Don’t get me wrong: these are great tools, but they are useless for teaching MBAs. No one will remember a few months down the road how to even load a dataset.

Thus I focus on simplicity. For data scraping we used import.io (a great tool founded by Andrew Fogg), for visualization we used plot.ly (a very simple visualization tool by Matt Sundquist) and for the predictive layer we used eitherBigML or Excel. Yes, Excel. It can – with a bit of hand-holding – recreate many Data Science models. It is learning by doing. If you want to dig into this, I highly recommend the book Data Smart from John W. Foreman.

data_product_thinking

Scraping, plotting, and crunching are important… but it does not take long for any smart MBA to ask the “so what question.” So we start with a framework of actionability and applicability discussed in my recent book Ask Measure Learn(find it here). Students learn that data and algorithms are nothing if you cannot create an action or build a product.

The course finishes with a term project where teams can use any data they can get as well as any complexity of model, as long as they define their own product use-case for the app. This complete freedom created amazing results. One person who is an excited collector of vinyl records built an engine to better determine their price point. Another team built an app to find for the best location for party-and-bar-loving MBAs  at a given price. Yet another team analyzed the feedback from various classes to determine how professors need to change to improve their teaching.

More important, some of these projects went far beyond being cool and made a very strong business case. Here are just two examples from our course that showcase what MBAs can do with just a little data science background:

  • Jacob Jordan (Jacob Jordan) predicted with an 80% likelihood where Starbucks will open their next store. With his model, he went even further and analyzed the claim that Starbucks drives gentrification, but could not find a high correlation with typical gentrification factors.

At the end of the course there were many high-quality business apps powered by data. A complete list of all project videos can be found here. This Cornell course is proof for me that in order to unleash amazing capabilities for innovation, companies need to teach business managers basic data science techniques .

This fall I will teach this course again at Harvard Business School (together withProf. Datar) as well as at Johnson’s Cornell. Let’s see what kind of innovation we will get!

lutz-cornell

(Lutz Finger – talks about his book “Ask Measure Learn” at Cornell University (photo: Bryan Russett))

This article was first published in Forbes.

Screen Shot 2015-09-15 at 11.03.34 PM

Spring 2015 – Best of Cornell

How do you create amazing new data driven products! By studying deep, deeper and the deepest machine learning algorithms? Nope – by enabling business folks to talk and think data. I recently had the opportunity to test this theory. I taught a course about Data Science for MBA students at Cornell University’s Johnson School and Cornell Tech campus. The student projects are listed here: from predicting the of future Starbucks locations to setting prices for old vinyl records.

Screen Shot 2015-09-15 at 11.03.34 PM

Fair Rental Finder
This tool identifies whether an apartment in Boston is fairly priced.

Analyzing Yelp – Small Business Tool
A tool to help restaurant owners to analyze most efficiently and effective use reviews to improve their restaurants.

Predictive Pricing Guidance on Amazon for Vinyl Records
This tool analyzes the pricing strategy of amazon and inform potential steps for action (go/no-go) in a music executive’s decision to sell vinyls on Amazon or even produce vinyls.

Car Value by Location Estimator
The data product predicts much should I pay for a particular car in my region, and what regions have the cheapest prices for it?

Automatically populating product lines onto e-commerce sites
This tool help to simplify the listing process at sites like ebay by analyzing and inferring attributes through their descriptions.

Zillow Predictions
The tool will predict which Zillow object will help investors to quickly recoup their investment.

Starbucks Predictions
This tool allows quickly to identify potential store locations of new Starbucks shops. Also it helps you to determine whether Starbucks store location strategy is changing or not.

Professor2Rockstar
This tool analyzes the students comments about Professors and matches it with their respective ratings. The Professor2Rockstar data product can help professors determine what aspects of their course they should prioritize in order to improve overall course satisfaction.

Analytics of Cornell’s Animal Health Diagnostic Center
This analytics tool helps to analyze the Cornell’s Animal Health Diagnostic Center’s customer feedback survey.

Nightlife and Apartment Finder
This tool helps young people to find an affordable apartment within their budget that is also close to a desired number of bars and clubs. The app analyzes apartment rentals and combines them with bar listings.

Predicting StartUp Funding Rounds
This tool aims to predict the likelihood of successful funding round. It thus would allow investors to gauge the likelihood of safe and lucrative investment.