The Promise and Perils of Big Data for Social Science Research
In the late 19th century, Charles Booth painstakingly assembled color-coded maps of London describing wealth, poverty, and despair street-by-street. Technological advances over the following century made demographic information on the neighborhood level from the U.S. Census easily accessible to researchers and the public at large. Fast forward to the early 21st century, and large-scale, administrative or "big" data promise to transform social science research (Lazer et al 2009). The City of New York receives millions of requests for city services that draw a detailed but flawed picture of citizen needs across time and space. Online data sources are increasingly employed to study a wide range of subjects including spatial variations. Administrative data from student records, tax returns, and other sources are at the core of major advances in social science research. Harnessing these treasure troves opens major opportunities for social science, including the measurement of neighborhood characteristics on an unprecedented scale. However, it is important to discuss both the promise and perils of "big" data. In this essay, I will reflect on recent work with large-scale, administrative or "big" data, including 311 service requests (Legewie and Schaeffer 2015), stop-and-frisk operations (Legewie 2015), and administrative student records (Legewie 2015). Based on this experience, I discuss two often-underappreciated challenges related to the data collection process and the measurement of sociological concepts.
Big data and the data-generating process
The Oxford English Dictionary defines big data as "data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges," as well as "the branch of computing involving such data." The definition is relative and ambiguous, the general use of the term murky. Prominent commentators admit that "there is no rigorous definition of big data" (Mayer-Schönberger and Cukier 2014). The term often refers not only to the large volume of data, but also implies an unstructured and unorganized nature. This sometimes defining aspect of big data presents unique challenges for social science research. As with conventional, often survey-based data, the data-generating process is essential for working with large-scale administrative or "big" data. In contrast to traditional data sources such as surveys, however, the process is often opaque and nontransparent. There is no carefully compiled documentation and codebooks are rudimentary. The data collection procedure might be expert knowledge shared by the few who work in the respective organization and have the technical expertise.
This problem is not unique to big data, but it is far more common. At worst, these uncertainties about the data collection process undermine the principles of social science research that call for rigorous, transparent, and reproducible data collection protocols. They make it difficult to evaluate and impossible to replicate findings. Careful processing and data cleaning can mitigate the problem and reveal information about the data-generating process. Consider the 311 data from New York. 311 is a centralized non-emergency telephone number, Internet platform, and smart phone application that allows city residents to file a request for or complain about issues as diverse as birth certificate services, fallen tree removal, or broken heating. Initial analysis showed surprising patterns about the daytime of service requests: the number of requests spikes at midnight, day after day. Further investigation revealed that different city agencies started supplying accurate time information step-by-step, and the 311 system assigned 12:00 AM to all service requests without accurate time information. This quirk of the data-generating process drives the temporal pattern in the type of service requests but is entirely undocumented. While some information can be deduced from careful processing and cleaning, detailed knowledge about the data-generating process generally relies on direct access to information about procedures and practices at the respective organization and the data-recording system. This information is essential for the future use of big data in the social science. It allows others to assess data quality, evaluate findings, and compare results to research based on other sources.
Measurement of Sociological Concepts
Big data provides fine-grained information on an entirely new scale that promises to revolutionize social science research. Yet big data also poses important challenges for the measurement of key sociological concepts. Traditionally, the measurement of theoretical concepts is based on carefully designed survey items. However, researchers are generally not involved in the collection and design of large-scale administrative or big data. The operationalization of key theoretical concepts is post-hoc and relies on information collected for an entirely different purpose. Administrative student records, for example, include limited information on parental background so that research relies on "free or reduced lunch" as a flawed measure of parental background. To my knowledge, no existing study compares "free or reduced lunch" with more established survey-based measures. Indeed, a simple regression based on data from the ECLS-K reveals that free lunch status is a highly significant predictor of household income but only explains 22.7% of the variation.
Some of my own work with Merlin Schaeffer uses data from 7.7 million time- and geo-coded 311 service requests to track when and where New Yorkers complain about their neighbors making noise, drinking in public, or blocking driveways (Legewie and Schaeffer 2015). Complaints about neighbors filed through the 311 system "indicate tensions and conflicts that are not resolved in a neighborly way by knocking on someone’s door” (Legewie and Schaeffer 2015, p. 16). As such, 311 complaint calls about neighbors potentially capture an important aspect of everyday neighborhood conflicts that has largely escaped quantitative research. But complaints about neighbors might also reflect the tendency of residents to use the 311 system, and not all calls can be coded unambiguously. Complaints about loud residential noise might be a clear example of conflict, but other complaints are not. Our own study carefully codes over 1,300 call types, uses several alternative coding schemes that are more or less restrictive, and adjusts for differences in the reporting rate by conditioning on the number of 311 service requests that are unrelated to conflict between neighbors. The adjusted measure is related to a number of neighborhood characteristics that have played an important role in previous research. For example, the adjusted number of complaints about neighbors is higher in areas with fewer homeowners and many residents who recently moved into the neighborhood. Residential instability seems to undermine friendly relations between neighbors so that residents are more likely to call 311 instead of knocking on someone’s door. Such associations with well-established predictors from the neighborhood literature validate our measure. Nonetheless, they do not replace a careful comparison with established data sources. O'Brien et al. (2015), for example, develop a measure of physical disorder based on 311 data from Boston. Physical disorder generally refers to concrete signs of decay or negligence, such as litter, graffiti, and abandoned cars. 311 service requests capture these instances of physical disorder, but the challenges of ambiguous service requests and differences in reporting rates across neighborhoods remain. O'Brien et al. (2015) develop a systemic approach. They start with a content and factor analysis to group different request types, continue with a comparison with neighborhood audits to assess biases and ensure the validity of the measure, and finally evaluate measurement reliability based on comparisons for different spatial and temporal scales. Importantly, their approach builds on a conceptual understanding of reporting behavior that incorporates the civic response rate and other aspects. The future use of large-scale administrative or big data in social science research relies on such a scientific approach to measurement that helps us to overcome limitations and ensure measurement quality.
Without doubt, the rise of big data opens major opportunities for social science research, including the measurement of neighborhood characteristics on an unprecedented scale. However, it is important to discuss both its advantages and its limitations. The collection of big data generally does not follow a carefully designed protocol, which undermines the principles of social science research that call for rigorous, transparent, and reproducible data collection procedures based on measurement tools that are designed to capture the underlying theoretical concept. Few acknowledge these limitations. As researchers increasingly adopt big data, it is important that we push not only for access, but also for transparency about the data-generating process, ensure scientific standards for the measurement of key sociological concepts, and in the end acknowledge remaining limitations.
Lazer, David, Alex Pentland, Lada Adamic, Siana Aral, Albert-Laszlo Barabasi, Devon Brewer, Nicholas Christakis, Noshir Contractor, James Fowler, Myron Gutmann, Tony Jebara, Gary King, Michael Macy, Deb Roy, and Marshall van Alstyne. 2009. Computational social science. Science 323(5915):721–23.
Legewie, Joscha and Merlin Schaeffer. 2015. Contested Boundaries: Explaining Where Ethno-Racial Diversity Provokes Neighborhood Conflict. Working paper, New York University.
Legewie, Joscha. 2015. “Racial Profiling in Stop-and-Frisk Operations: How Local Events Trigger Periods of Increased Discrimination.” Working paper, New York University.
Legewie, Joscha. 2015. “Disruptive Change: Peer Effects and the Social Adjustment Process of Mobile Students.” Working paper, New York University.
O’Brien, Daniel Tumminelli, Robert J. Sampson, and Christopher Winship. 2015. “Ecometrics in the Age of Big Data Measuring and Assessing ‘Broken Windows’ Using Large-Scale Administrative Records.” Sociological Methodology (forthcoming).
Mayer-Schönberger, Viktor and Kenneth Cukier. 2014. Big Data: A Revolution That Will Transform How We Live, Work, and Think. Boston: Eamon Dolan/Mariner Books.