When Is Big Data Enough Data?

Big data has arrived. And I am not the one only saying so. The New York Times , The Economist, The Wall-Street Journal, TED,  and many others have all run recent pieces on this topic. But the practical and ethical implications of this trend are as yet not well-defined.

Big data is a loosely-defined term that refers to data that is often unstructured, such as text, images, archives, web logs, among others. Big data, as its name suggests, comes in voluminous quantities. Structured and unstructured data are available more readily than ever before and, perhaps most importantly, the computer-assisted means to crunch all these pieces and bits of data are becoming more common and more powerful.

Political Science is not adrift of this revolution. It is no coincidence that the New York Times piece about the Big Data trend highlighted political scientists:

Justin Grimmer, for example, is one of the new breed of political scientists. A   28-year-old assistant professor at Stanford, he combined math with political science in his undergraduate and graduate studies, seeing “an opportunity because the discipline is becoming increasingly data-intensive.” His research involves the computer-automated analysis of blog postings, Congressional speeches and press releases, and news articles, looking for insights into how political ideas spread.

And the piece continues:

 “It’s a revolution,” says Gary King, director of Harvard’s Institute for Quantitative Social Science. “We’re really just getting under way. But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched.”

While the Big Data trend has made its way into political science, I would argue that it has been restricted to those in the discipline primarily concerned with methodology and research design. So far, substantive knowledge on comparative politics, American politics, and other subfields has not been greatly influenced by the influx of new sources of data.

But governments are not moving at the same pace as the discipline. In Brazil, the government is in fact moving faster.

In recent years, the Brazilian federal government has been making available online an impressive amount of electoral, budgetary, geographic, social and economic data on Brazil. The country has no doubt come a long way since 1990 when the census was delayed a full year due to mismanagement and political maneuvers – breaking the 10-year census periodicity that had been established in 1940.

The recent Law of Access to Information has made this trend of open governance combined with huge amounts of data even more salient. As of 16th May 2012, the federal government and many state and municipal governments are making data available online about their budgets, bureaucracies, and employees. Moreover, citizens now have an online tool to demand more information.

The most controversial example of the manner in which this law has been enforced come from the São Paulo Municipal Chamber. All of the Chamber’s personnel have had their names, official job titles, and detailed paychecks made public on the Internet.

It is not difficult to image what has happened since. In such an unequal country, in which citizens’ opinions of the legislative power oscillate from complete execration to total obliviousness, most people were outraged at the high paychecks and apparent distortions. An average municipal legislator in São Paulo makes seven times more than the average per capita income in São Paulo (about one thousand reais per month, a little less than 500 dollars).  But, somewhat surprisingly, the main targets of public discussion were the Chamber’s employees’ salaries. More specifically, a garage manager and a nurse who received eighteen times more than the average income were used over and over again as symbols of the misuse of taxpayers’ money (read about this in the Brazilian media and The Economist).

The public employees’ unions and professional associations argued that the information on salaries could harm the safety of these workers since most of them are clearly well-off compared to most Brazilians.  But, given the somewhat widespread, stereotypical impression shared by many Brazilians that public employees are a privileged bunch, with unfair retirement and vacation benefits and less-than-demanding jobs, their complaints fell on deaf-ears. The publication of the information was seen as an exposé of the alleged unfairness and distortions of the public system.

These events raise at least two issues related to the rise of Big Data in governments. First, we might have more data than ever, but are we asking the right questions with this data? Second, public and private data are not always clear-cut distinctions: are we crossing the line in how we use public data?

Here I will focus only on the second question and leave the first (more difficult) one for a future post.

As a researcher very much in love with big data sets, I cannot say that I do not appreciate this new development. And, I would go even further:  I would love access, like Piketty and Saez had in the United States, to IRS data on income. Their work has been highly influential in informing us about the growing inequality between the top 1% and the bottom 99%.

But we could all use a bit of common sense: is it necessary, for the sake of transparency, to release the full names and the salaries of the employees online? Would it not suffice to have the public employees’ paychecks, organized by job titles and a sort of identification code (the last three digits of their social identification number, for instance)? Is it necessary to put the public employees in a potentially dangerous position or, to say the least, make them socially uncomfortable?

After all, the government’s transparency is not about the individual – not most of the time, anyway. Institutions and elected officials should be held accountable, not individuals or common citizens – at least not in this case. Public employees, unless proven otherwise, are not unlawfully receiving their salaries. Government transparency and data on government are laudable goals, but there are lines that should not be crossed.

Ironically, the garage manager, who received more than 18 times the average income in São Paulo and became the symbol of distortion in the public system, was actually not working in the garage anymore. He was an aide to a councilman and his pay was about seven thousand reais, just slightly higher than the Chamber’s average of six thousand and five hundred reais (approximately 3 thousand dollars). This correction timidly made the news, but it did not make the headlines.

3 thoughts on “When Is Big Data Enough Data?

  1. I like this post. As a research manager for a publication in Washington DC, my last (insert large number) articles have been about Big Data in the federal government. Private-sector companies, federal agencies, and media outlets are trying to get ahead of the Big Data initiative announced by the Obama administration, http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release.pdf. In doing so, we are all trying to make sense of it and, like you said, ask the right questions.
    Despite this, I am not sure how we can make it to the right questions. An endless amount of information is making agencies and companies reactionary to the new initiative. Therefore, we are currently asking-How do we synthesize it, what solutions can we offer to organize it? In doing so, Washington isn’t asking, what are the repercussions of this data? Instead, we are in the nascent phase of, what the hell do I do with it?

    • I completely agree with you that we don’t really know what to do with this information. It’s easy to say we have to ask the right questions, but it’s a whole different thing to actually ask good question and even identify them. The reason I discussed the implications of this trend is simply because the repercussions of this Big Data trend were neglected by most media which actively used these data in this very specific event in Brazil. Oh, thanks for the link. There’s actually a big Brazil-US partnership about open governance (which is also affected by Big Data).

  2. Hahaha. This whole “what is big data?’ thing is cracking me up. It’s just applied statistics! It’s a mathematical method of attaining knowledge about the world, like science is an empirical method. And it’s pretty much on steroids thanks to computer technology.

