Issues of Construct Validity and Reliability in Massive, Passive Data Collections
We are at the beginning of a data-driven transformation of our understanding of human society. Data are at the foundation of any science, and currently vastly more data than ever before are being collected on human behavior, and it is likely that these data will result in a transformation of our understanding of human society. Cities will be at the center of the development of a new science of society for several reasons. Cities represent a majority and increasing share of humanity, and likely a supermajority of human transactions. For practical reasons, cities often reflect a coherent and curated concatenation of many data, and thus provide powerful laboratories for creating generalizable theories about human society. Elsewhere I have written about the intellectual opportunities of “big data” to understand human societies (e.g., Lazer et al 2009); here I will focus on some of the challenges that need to be surmounted in the emerging field of computational social science. Here, in particular, I will focus on the “massive passive”: the ongoing passive collection of data, largely the digital refuse of our daily lives, e.g., our Internet surfing, phone calls, our use of Fastlane, credit cards, and the like.
The rhetoric (sometimes ominous, sometimes enthusiastic) of big data often suggests that everything is being recorded. This is dramatically wrong, of course: for all of the data that are collected, they are but an infinitesimal fraction of the data that could conceivably be collected, or even of the much smaller amount of data that scientists have the imagination to conceive of collecting. Further, given a choice of petabytes of data to collect from humans, the petabytes that are being collected are not the ones that would be chosen. What is left is a recycling task for scientists—how to repurpose existing data into scientifically usable form?
How formidable a task this is varies with (1) how well curated the data are, and (2) how well they align with scientifically useful constructs. Chetty et al (2010) offers a best case scenario, in their use of IRS data. Income, as reported in IRS returns, is only the tiniest of jumps away from central constructs in economics. It is measured with a precision down to the dollar. The IRS has a power to compel compliance that researchers, in their less charitable days, would fantasize about. All US citizens are subject to the same IRS rules. The measurement is not subject to the well known vagaries of human memory or biases toward giving socially desirable answers. Yet even these data are imperfect. People sometimes fail to report all of their income. The data are truncated— income below certain thresholds is not reported. Criteria for what should be reported change over time. Many closely related constructs that ideally would be incorporated into any analyses involving income—e.g., gifts, inheritances, and, most importantly, wealth—are not included. How problematic these issues are needs to be interpreted within the context of a particular research question—e.g., it seems unlikely they are material to the Chetty et al findings.
Other big data offer (substantially) bigger challenges. Twitter has emerged as what is likely the single most studied source of big data about human behavior. This reflects its availability, and its richness. Twitter data offer textual and image content, graph structure, time, and (sometimes) geographical information. However, Twitter data are rife with messiness. A large (but unknown) fraction of Twitter accounts are bots. Many legitimate accounts don’t represent individuals but organizations, and many that represent individuals are representing individuals within certain roles (e.g., consider the many celebrities and news reporters on Twitter). Conventions regarding Twitter use are rapidly evolving and are quite heterogeneous.
The contrast between IRS and Twitter data highlight three key issues around big data: the mapping of big data to relevant social constructs; the instability of big data constructs; and the challenges around the silo-ization of big data.
Mapping constructs: Behavioral trace data typically do not map neatly to important social science constructs. For example, what is a phone call? Is a phone call a meaningful social construct? In fact, it is a very heterogeneous construct, capturing everything from calls to the dry cleaner to check if clothing is ready, to ones spouse to pick up the children, to a hotel to confirm a reservation. In short, call records do not map neatly to existing social constructs, adding together into a single construct apples, oranges, and elephants. It may be possible to build strong classification systems mapping these constructs to behavioral patterns—e.g., because friends behave in distinctive ways (e.g., Eagle et al 2009). However, it is also possible that it will not be possible to build such classification systems, in which case the question becomes: (how) can such a heterogeneous construct be used? Note, of course, that the social sciences are rife with heterogeneous constructs—e.g., a standard survey item in studying social networks is to ask who the respondents discusses “important matters” with. As Bearman and Parigi (2004) point out, important matters are open to wide interpretation. However, while the heterogeneity issue with big data may not be distinctive to big data, they often may be more severe.
Construct instability: Constructions are often rife with ambiguity and controversy in the social sciences. Social science constructs are also often unstable, making longitudinal comparisons difficult. Consider, for example, the evolution of racial self categorization on census forms, where, for example, beginning in 2000 the census allowed individuals to identify with multiple racial groups (Allen and Turner 2001). Or, consider the seemingly straightforward measurement of income—across time—can a dollar in 1900 be compared to a dollar in 2000? Such comparisons are made, e.g., through adjustments for prices in those years, and yet the basket of goods available in those two years is dramatically different. To an individual with a life threatening infection that dollar in 2000, which can buy antibiotics, is worth infinitely more than the dollar in 1900.
The typical solution with traditional data is to assume that these issues are small over short time scales. The issue of construct instability, however, is much more severe with many big data. A change in a socio-technical system can have dramatic and instant effects on the signals in the data collected. Further, the timing, cause and likely consequences of any changes is often much more opaque than with traditional data collection, and often done without consideration of the implications for data quality. A comparison to the Census to Google is illustrative. The Census does on occasion change categories in its forms; resulting in discontinuities, e.g., such as with respect to racial categories, as mentioned. However, such changes are done with an eye on data quality—e.g., regarding the empirical relevance of those categories in contemporary US society, and are tested and mapped to prior methods, and only adopted after considering the losses from lack of continuity. Google flu trends (GFT), which is one of the leading examples of use of big data, offers a useful counterpoint. GFT is an effort to “nowcast” flu prevalence by Google based on searches (Ginsberg et al 2009). The essential intuition is that when people have the flu, they likely do flu related searches. The initial effort had many flaws (Lazer et al 2014), but most relevant here was that it seems likely that Google changed its search algorithm over a few years in ways that encouraged more health (and thus flu) related searches, resulting in dramatic overestimates of flu prevalence. Unsurprisingly, Google changed its search algorithm without evaluating the impact that doing so would have on GFT; the GFT effort, in turn, was built on the flawed assumption that the signal between flu related searches and flu prevalence is stable.
Construct instability is not just the result of algorithmic change, but also of emergent and cultivated conventions (consider the use of hashtags on Twitter); regular attacks on the informational integrity of sociotechnical systems when there are incentives to do so (e.g., Ratkiewicz et al 2011); and even marketing, campaigns and pricing strategies.
Construct instability is not an insurmountable issue, but it does point to the need to develop new approaches and methods that are robust to such abrupt changes. For example, with respect to flu detection, a near-constant recalibration of the underlying model, plus the fusion of multiple and algorithmically unrelated data streams related to flu would likely offer an effective solution to the failures of GFT.
Too many big data: one of the largest challenges for the scientific use of big data is the proliferation of functionally equivalent socio-technical systems, creating multiple big data silos that do not talk to each other. This creates the possibility of all types of artifacts that might affect analysis. Telco data (and equivalent) are exemplary. Beginning with Onnela (2007), there has been a surge of research using Call Detail Records (CDRs) from telcos, which capture source and destination phone numbers, timing of call, and rough location of the caller and recipient. However, generally these analyses only capture calls involving subscribers with that particular telco—e.g., if individuals A and B are subscribers with another company, calls between A and B will be missing from that dataset. This potentially creates peculiar patterns of missingness that might affect inferences being drawn from the data. The challenge is particularly acute when studying networks—e.g., if a carrier only has a 20% market share, this means that data are fully missing for 64% of the dyads, and at least partially missing for about 99% of the triads. Further, it would be likely that the missingness would be systematic in some fashion, e.g., because a carrier might market more to some demographics or regions than others, or have incentives to have the same plans as family and friends.
This understates the issue, however. Generally, the social science construct of interest will not be “phone call” but, at a minimum, something along the lines of “communicate with” or “talk with” (much less talk about important matters, or get emotional support, or find out about potential jobs). There are many ways to talk with people who are at a physical distance that are functionally equivalent, and which should be lumped together. Is there a reason, for example, to distinguish between a call made on a cell phone with a call made using Skype from that same cell phone?
These issues are endemic in many big data—do we care, for example, if someone reads a New York Times article on their tablet versus laptop, versus in paper version? Does it matter that someone communicates via e-mail as compared to Facebook messaging?
IRS data offer a useful contrast—the IRS has a monopoly on tax collection (at the federal level), and the data offer, in principle, a census of individuals who paid taxes. However, in many domains there has been a radical fracturing of systems connecting and getting information to people. This might be laudatory for many reasons, but scientifically the Ma Bell of the 1960s was a much better scientific instrument for data collection than the heavily fractured communication landscape of the 21st century.
Big data will transform how we understand society. It is possible to look at society with a temporal, spatial, and interactive granularity that just a few years ago would have seemed the stuff of science fiction. However, the infinite gap between “big” and “all” presents distinctive challenges to researchers trying to understand how the world works from these amazing but peculiar samples of human behaviors. Development of valid, reliable constructs from behavioral trace data is a substantial but not insurmountable challenge, and there are several ways forward. It will be necessary to study the evolution of the use of emerging media for communication. Some of this will involve use of “small” data paradigms, such as using high quality samples linked to big data to allow inferences to the entire corpus (O’Brien et al 2015; Margolin et al 2013). To the extent possible, building bridges across data silos will be quite powerful. Device based data collection might offer particular opportunities, since many sociotechnical system are mediated through a few devices (increasingly, smartphones). Fusion of multiple data sources measuring similar constructs will provide stability in the face of algorithmic and administrative fickleness. These are old issues in new forms, and addressing these issues around developing useful constructs from digital trace data is the necessary spadework for the social sciences of the 21st century.
Allen, J. P., & Turner, E. (2001). Bridging 1990 and 2000 census race data: Fractional assignment of multiracial populations. Population Research and Policy Review, 20(6), 513-533.
Bearman, P., & Parigi, P. (2004). Cloning headless frogs and other important matters: Conversation topics and network structure. Social Forces, 83(2), 535-557.
Chetty, R., Friedman, J. N., Hilger, N., Saez, E., Schanzenbach, D. W., & Yagan, D. (2010). How does your kindergarten classroom affect your earnings? Evidence from Project STAR (No. w16381). National Bureau of Economic Research.
Ratkiewicz, J., Conover, M., Meiss, M., Gonçalves, B., Flammini, A., & Menczer, F. (2011, July). Detecting and Tracking Political Abuse in Social Media. In ICWSM.
Eagle, N., Pentland, A. S., & Lazer, D. (2009). Inferring friendship network structure by using mobile phone data. Proceedings of the National Academy of Sciences, 106(36), 15274-15278.
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012-1014.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu: traps in big data analysis. Science, 343(14 March).
Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., ... & Van Alstyne, M. (2009). Computational Social Science. Science, 323(5915), 721-723.
Margolin, D., Lin, Y. R., Brewer, D., & Lazer, D. (2013, June). Matching data and interpretation: Towards a rosetta stone joining behavioral and survey data. In Seventh International AAAI Conference on Weblogs and Social Media.
O’Brien, D., Sampson, R., Winship, C. (2015). Ecometrics in the Age of Big Data: Measuring and Assessing ‘Broken Windows’ Using Large-scale Administrative Records. Sociological Methodology (Volume 45, online first version-- DOI: 10.1177/0081175015576601).
Onnela, J. P., Saramäki, J., Hyvönen, J., Szabó, G., Lazer, D., Kaski, K., ... & Barabási, A. L. (2007). Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences, 104(18), 7332-7336.