Privacy Safetyism: finding balance with innovation and transparency

Protecting privacy is a profoundly inoffensive principle. Perhaps for this reason, lawmakers across the political spectrum and around the world have embraced new regulations on data to purportedly protect privacy. Yet abstract concerns about privacy – those that do not mitigate a significant risk of physical harm, economic injury, or other material damage – are increasingly motivating regulatory and legislative actions that inhibit innovation and research. In this way, excessive regulation on the grounds of protecting privacy is a kind of safetyism – embracing the safety or comfort of individuals incommensurate to risks and without regard to competing priorities.

From fining researchers for publishing public Tweets to intentionally damaging the accuracy of Census data, a misguided conception of privacy risks – privacy safetyism – threatens innovation from genomics to public health. I describe two manifestations of this problem: first, the neutering of the US Census by measures taken ostensibly to protect privacy and, second, the stifling of research and innovation created by, in Europe, an overzealous, albeit well intentioned, behemoth privacy law and, in the US, a patchwork of data privacy regulations across states and across different federal agencies.

US Census abandons accuracy for phantom privacy protection

Title 13 of the Federal Code prohibits the Census Bureau from publishing any data that enables any particular establishment or individual to be identified. Though the law has not changed, the Census has reasoned that technological advancement (e.g., greater computing power) has made it easier to link datasets together and to make inferences about individuals from analyzing a large number of summary statistics. This concern has resulted in the use of differential privacy procedures in the 2020 Census release – basically, making it harder to learn about any individuals by deliberately adding inaccuracies to the data, albeit in a systematic way where the errors cancel out at higher levels of aggregation. This decision prompted backlash from various stakeholders: state governments (e.g., the state of Alabama filed a lawsuit, supported by 16 other states, to stop these changes), local governments (who gain little that data are accurate at high levels of aggregation), and social scientists and decision makers concerned about how the data changes will introduce disproportionate errors that adversely impact redistricting, policy analysis, and analysis of topics ranging from COVID and health disparities to county-level migration. Privacy concerns continue to gnaw at the Census, which plans to scale back the release of granular data in the American Community Survey (ACS) and Current Population Survey (CPS), harming the work of researchers, businesses, and planners. For example, the proposed rounding of wage data in the CPS would break the Atlanta Federal Reserve’s Wage Tracker tool.[1]

The Census decided to implement differential privacy in 2018, motivated by the results from a 2016 Census “reconstruction attack” in which Census researchers systematically analyzed the published, block-level summary statistics from  2010 Census data to determine the possible combinations of individual attributes consistent with the summary tables about age, sex, ethnicity, and race of residents. Then, they compared whether these inferred individual level attributes (the reconstructed dataset) matched records in the Census’s (private, not released for 72 years) individual response data. About half of the reconstructed records matched an actual record. Finally, to check the accuracy of these matches, the Census compared the “putative matches” from the database reconstruction with an individual-level commercial database they obtained from a commercial vendor (a credit agency). They found that about 38% of the putative matches had a corresponding record in the credit agency’s data.[2] Census Bureau Chief Scientist John Abowd characterized these kinds of database reconstructions as the “death knell for public-use detailed tabulations and microdatasets as they have been traditionally prepared.”

This concern is contested, however. According to an academic study, the Census’s “reidentification attack” performed about as well as a random simulation; sampling from the national population’s age-sex distribution and assigning the plurality race and ethnicity of the Census’s Block reconstructed almost as many matches. Steven Ruggles, a professor at the University of Minnesota, said “The whole problem is made up… [Abowd] has gotten a whole lot of people worked up about this dire threat that doesn’t exist.” Whatever the theoretical risk is that a person could back out the number of, say, 28-year-old White, non-Hispanic males in a Census Block by computing the logically possible individual-level datasets consistent with Census summary statistics, the practical risk to human beings is small. After all, the Census checked the validity of its test by comparing its results with de-identified, individual level data obtained from a commercial credit agency. Why hack the Census for a good guess about an individual when you could get all that and more from a private data vendor? The cost of these changes to data utility is real, while the gain in safety is contrived.

The Census claims that these accuracy-diminishing changes are mandated by Title 13 to mitigate data leaks about individuals from summary statistics. Congress can help the Census course correct by clarifying its intensions: updating Title 13 to specify that while the Census should not disclose individual’s responses to the Census until 72 years later (current law), aggregate data should not be substantially distorted to mitigate database reconstruction, and granular data products should continue to be released. Data accuracy is a paramount value that the Census should use to balance against privacy concerns.

Data privacy laws burden research and Innovation

More broadly, data privacy concerns have given rise to a thicket of laws and regulations. In 2018, lawmakers in Europe passed a sweeping data privacy law (the GDPR), and several states in the US have done the same. Such regulations have real economic costs that disproportionately burden smaller companies. Attempts to quantify the negative economic impact reach staggering conclusions: the Information Technology and Innovation Foundation estimates that the economic cost of a GDPR type regulation would be about $122 billion annually – about $500 per American adult.[3] The vast majority of this cost – about $100 billion – is attributed to the dampening effect of privacy regulations on data access ($70 billion) and reduced advertisement effectiveness ($30 billion).

The GDPR’s restrictions are already taking a toll on scholarly research. Belgian authorities fined a local NGO 2,700 euros and an academic collaborator 1,200 euros for publishing the raw data (publicly shared Tweets) they used to study state involvement in circulating stories about a French scandal. That the raw data they shared consisted of Tweet voluntarily shared by users to be viewable by anyone on the internet did not matter, nor did the fact that the data enabled a study of the spread of fake news, an important social topic, because “…public information does not fall outside the scope of the GDPR, even if these data are used with the best intentions and for journalistic or scientific purposes.”[4] The implication seems to be that the right to privacy requires that users who choose to post publicly on a social media website need protection from other users seeing that content – particularly when the content is “sensitive” in nature, such as political posts. This principle – public information should be guarded as private – is strange indeed. People who choose to post publicly about politics on Twitter do not need to be protected from people seeing their posts, and sanctioning researchers for “showing their work” is antithetical to good science. Fears of such fines will chill access to data that could enable research and innovation.

On a more mundane level, demonstrating compliance with data privacy regulations – developing lengthy data storage policies demonstrating how the data collection performed for a study satisfies the complex tests specified by law – will suck away time and energy from productive uses. The law establishes a few different legal rationales for research using personal data (e.g., public interest, scientific research, and legitimate interest), but demonstrating compliance is complex and onerous, particularly when different standards apply for public vs private universities (presumably, a collaboration between researchers at each would require documenting compliance with both standards).[5] Better resourced institutions might mitigate this problem by throwing money at it (hiring employees devoted to documenting compliance), but smaller academic institutions and NGOs may well give up on conducting empirical research that risks being sanctioned.

These constraints will not affect research equally. Humanities research is unlikely to be burdened. On the other hand, genomics research is very sensitive to constraints placed on how data can be stored and shared; the GDPR already constrains research on biobanks, in part because different countries have chosen to interpret its provisions differently.[6]

In the US, several existing federal laws around data privacy (e.g. HIPPA for health care, FERPA for education) suppress scholarly research involving data collected by federal agencies. Federal health care data, for example, omits data about substance use disorder diagnoses, undermining research into the effectiveness of policies and programs to ameliorate addiction.[7] In many ways, private, profit-seeking companies have fewer restrictions on using data for research than do university, government, or non-profit researchers.[8]

Numerous state laws have arisen to address the issue of how user data can be collected and shared, resulting in conflicting regulations around data collection, storing, sharing, and selling (a litigator’s dream). To provide consistent and steady guidance, Congress must preempt state laws in this area and establish a single set of rules governing data privacy. It is important that Congress not merely establish a new set of minimum rules – this would allow a patchwork of state regulations to creep up on top; instead, the Congressional rules should apply to all states. Similarly, with regard to independent research on data collected by federal agencies, rules vary: a patchwork of regulations has meant that some federal programs allow researchers to access data to perform research, others allow partial access (e.g., FERPA permits data disclosure for educational research, but not any other types of research), whereas still others (e.g., Title X and SNAP) do not allow any researcher access to data, prohibiting independent assessments of efficacy.[9] A comprehensive federal privacy statute should also address the inconsistencies in existing federal laws by providing for data access to researchers seeking to use deidentified public data for the purpose of academic research or public health.

 Privacy concerns untethered from risk of real harm are justifying increasing restrictions on data. Left unchecked, privacy safetyism will increasingly threaten data sharing that could enable new technological breakthroughs. Federal legislation is needed to unlock the full potential of data collected by the federal agencies and enable sharing of data collected by private actors for research and innovation.

[1] See illustration in John Robinson’s Tweet. H/T Yglesias, M. “Privacy concerns are breaking the Census.”

[2] Table 2 (PDF page 20) from Abowd’s supplemental declaration in the Alabama v Department of Commerce case.

[3] McQuinn, Alan and Castro, Daniel. 2019. “The Cost of an Unnecessarily Stringent Federal Data Privacy Law.”

[4] D’Hulst, Thibault and Reyns, Charlotte. 2022. “Re-use of Twitter data: Belgian DPA fines NGO for ‘fake news’ study.” Van Bael & Bellis

[5] Quinn, Paul. “Research under the GDPR–a level playing field for public and private sector research?.” Life Sciences, Society and Policy 17, no. 1 (2021): 1-33.

[6] Peloquin, David, Michael DiMaio, Barbara Bierer, and Mark Barnes. “Disruptive and avoidable: GDPR challenges to secondary research uses of data.” European Journal of Human Genetics 28, no. 6 (2020): 697-705.

[7] Frakt, A.B. and Bagley, N., 2015. Protection or harm? Suppressing substance-use data. The New England Journal of Medicine, 372(20), pp.1879-1881.

[8] Schmit, C., Giannouchos, T., Ramezani, M., Zheng, Q., Morrisey, M.A. and Kum, H.C., 2021. US Privacy Laws Go Against Public Preferences and Impede Public Health and Research: Survey Study. Journal of Medical Internet Research, 23(7), p.e25266.

[9] Hulkower, R., Penn, M. and Schmit, C., 2020. Privacy and confidentiality of public health information. In Public Health Informatics and Information Systems (pp. 147-166). Springer, Cham. See, especially, Table 9.1.