"Big data" is the jargon du jour, the tech world's one-size-fits-all (so long as it's triple XL) answer to solving the world's most intractable problems. The term is commonly used to describe the art and science of analyzing massive amounts of information to detect patterns, glean insights, and predict answers to complex questions. It might sound a bit dull, but from stopping terrorists to ending poverty to saving the planet, there's no problem too big for the evangelists of big data.
"The benefits to society will be myriad, as big data becomes part of the solution to pressing global problems like addressing climate change, eradicating disease, and fostering good governance and economic development," crow Viktor Mayer-Schönberger and Kenneth Cukier in modestly titled Big Data: A Revolution that Will Transform How We Live, Work, and Think.
So long as there are enough numbers to crunch -- whether it's data from your iPhone, grocery store purchases, online dating profile, or, say, the anonymized health records of an entire country -- the insights that can be gleaned from our computing ability to decode this raw data are innumerable. Even Barack Obama's administration has jumped with both feet on the bandwagon, releasing on May 9 a "groundbreaking" trove of "previously inaccessible or unmanageable data" to entrepreneurs, researchers, and the public.
"One of the things we're doing to fuel more private-sector innovation and discovery is to make vast amounts of America's data open and easy to access for the first time in history. And talented entrepreneurs are doing some pretty amazing things with it," said President Obama.
But is big data really all it's cracked up to be? Can we trust that so many ones and zeros will illuminate the hidden world of human behavior? Foreign Policy invited Kate Crawford of the MIT Center for Civic Media to go behind the numbers. —Ed.
"With Enough Data, the Numbers Speak for Themselves."
Not a chance. The promoters of big data would like us to believe that behind the lines of code and vast databases lie objective and universal insights into patterns of human behavior, be it consumer spending, criminal or terrorist acts, healthy habits, or employee productivity. But many big-data evangelists avoid taking a hard look at the weaknesses. Numbers can't speak for themselves, and data sets -- no matter their scale -- are still objects of human design. The tools of big-data science, such as the Apache Hadoop software framework, do not immunize us from skews, gaps, and faulty assumptions. Those factors are particularly significant when big data tries to reflect the social world we live in, yet we can often be fooled into thinking that the results are somehow more objective than human opinions. Biases and blind spots exist in big data as much as they do in individual perceptions and experiences. Yet there is a problematic belief that bigger data is always better data and that correlation is as good as causation.
For example, social media is a popular source for big-data analysis, and there's certainly a lot of information to be mined there. Twitter data, we are told, informs us that people are happier when they are farther from home and saddest on Thursday nights. But there are many reasons to ask questions about what this data really reflects. For starters, we know from the Pew Research Center that only 16 percent of online adults in the United States use Twitter, and they are by no means a representative sample -- they skew younger and more urban than the general population. Further, we know many Twitter accounts are automated response programs called "bots," fake accounts, or "cyborgs" -- human-controlled accounts assisted by bots. Recent estimates suggest there could be as many as 20 million fake accounts. So even before we get into the methodological minefield of how you assess sentiment on Twitter, let's ask whether those emotions are expressed by people or just automated algorithms.
But even if you're convinced that the vast majority of tweeters are real flesh-and-blood people, there's the problem of confirmation bias. For example, to determine which players in the 2013 Australian Open were the "most positively referenced" on social media, IBM conducted a large-scale analysis of tweets about the players via its Social Sentiment Index. The results determined that Victoria Azarenka was top of the list. But many of those mentions of Azarenka on Twitter were critical of her controversial use of medical timeouts. So did Twitter love her or hate her? It's difficult to trust that IBM's algorithms got it right.
Once we get past the dirty-data problem, we can consider the ways in which algorithms themselves are biased. News aggregator sites that use your personal preferences and click history to funnel in the latest stories on topics of interest also come with their own baked-in assumptions -- for example, assuming that frequency equals importance or that the most popular news stories shared on your social network must also be interesting to you. As an algorithm filters through masses of data, it is applying rules about how the world will appear -- rules that average users will never get to see, but that powerfully shape their perceptions.
Some computer scientists are moving to address these concerns. Ed Felten, a Princeton University professor and former chief technologist at the U.S. Federal Trade Commission, recently announced an initiative to test algorithms for bias, especially those that the U.S. government relies upon to assess the status of individuals, such as the infamous "no-fly" list that the FBI and Transportation Security Administration compile from the numerous big-data resources at the government's disposal and use as part of their airport security regimes.