7 Things You Should Know About Data De-Identification and Anonymization

min read

As the types and amounts of personal data increase, users and institutions need to strengthen the ways they protect the sensitive information they collect and use.

Ovals numbered 1 through 7
Credit: TARASIRI / Shutterstock.com © 2024


Elizabeth Stanton was a faculty member and public health researcher at a state university who had long studied the demographic factors that influence health conditions and outcomes, particularly among lower-income individuals in rural settings. Her extensive datasets, covering nearly 20 years of study, included sensitive information on many thousands of research subjects, information that might be used prejudicially in hiring and other economic matters, as well as in healthcare decisions and community contexts. She had long used a consistent set of techniques to modify her data in ways that would protect the identities of those whose data were included in her research. Stanton typically used datasets for several years in longitudinal research, and for much of her career, she shared her datasets with other researchers, both at her institution and at other entities, including colleges and universities as well as state and national public health organizations.

Over the years she began to hear stories—which grew louder, more frequent, are more concerning—about instances in which data subjects' identities had been exposed. The thread that emerged was that bad actors were becoming increasingly effective at scrutinizing and combining datasets in ways that allowed them to associate individual persons with facts about them that should have been kept private. Then one day she heard about a data breach that revealed a trove of public health data, some of which had been de-identified using the same procedures she employed, and the data had been used to compromise health information for several thousand subjects.

Stanton reached out to faculty and graduate students in the mathematics and statistics departments to understand newer and emerging techniques to either de-identify her data or, when appropriate, anonymize the data. They also helped her better understand the ways in which significant increases in processing power, along with the growing sophistication and reach of artificial intelligence tools, meant that previous methods to protect data were quickly becoming inadequate. Newer approaches were vital for safeguarding the identities of research subjects and, increasingly, to meet the grant requirements of funding agencies. Some additional evidence also persuaded her to alter her research methodology to collect less data from subjects, which constrained some of the downstream uses of her data but, in the end, struck her as a reasonable trade-off for the added security from simply not gathering certain data elements.

1. What Is It?

Data de-identification and anonymization attempt to break the link between data values and data subjects. In the information economy, data are the coin of the realm, and increasing amounts and types of data are being collected by the technology systems and devices that people use every day. Many forms of data include sensitive information about individuals, and data stewards use a variety of techniques to safeguard data, shielding people from a long list of harms they could suffer when personal data are associated with particular individuals. Research involving medical research is an area of obvious concern for personally identifiable information (PII), but privacy risks apply to numerous other kinds of information, including educational, financial, behavioral, location, biometric, and genetic data.

Some kinds of information are sensitive on their own, such as a person's full name, Social Security number, medical records, or credit card information. In other cases, discrete data values that cannot alone identify an individual—such as zip code, date of birth, or ethnicity—can be combined with other data to triangulate and pinpoint a data subject. An important principle of any approach to data management is minimization, which aims to reduce the data elements that are collected, used, or combined with other data, thereby limiting the risks to data subjects. For data that are collected, de-identification and anonymization reduce the possibility that data can be re-identified and re-associated with individuals.

The terms de-identification and anonymization can be difficult to define because they are often used interchangeably and because various laws and regulations propose differing definitions. In broad terms, de-identification is the process of deleting, altering, or limiting certain elements of a dataset to remove the ability to identify a data subject. In some cases, however, de-identification might be done in a reversible manner so that the data can be re-associated through something like a codebook. The term de-identification has compliance implications in certain regulatory contexts, particularly for healthcare regulations such as HIPAA. Anonymization refers to a process that permanently breaks the link between data values and data subjects, though many would contend that such efforts are never fully irreversible.

2. How Does It Work?

Many methods and techniques can be used to anonymize or de-identify data, and often multiple methods are used in combination. As with anonymization and de-identification in general, however, many of the techniques do not have universally agreed definitions that describe clearly how they work. This ambiguity in what certain methods actually do complicates the task of trying to ensure that a dataset can be used in specific ways by particular systems without posing a risk to the individuals whose data are contained in the dataset. Many approaches are relatively simplistic, such as scrambling (randomly reordering alphanumeric characters), shuffling (randomly rearranging values within a specific column), and substitution (replacing values within the dataset with corresponding entries from a lookup table).

Newer methods employ mathematical techniques that are more reliable and provide stronger assurances of safety. These methods include k-anonymity, which ensures that each individual in a dataset is indistinguishable among a group of at least "k" individuals, thereby minimizing the risk of re-identification; differential privacy, which injects calibrated noise into the data, providing a mathematical guarantee of privacy protection; and synthetic data, which involves generating artificial data that closely approximate the statistical properties of the original dataset while providing enhanced privacy protection.

Most anonymization and de-identification techniques are performed on structured datasets, and multiple techniques are often used in combination with other data-protection methods, such as encryption, to protect the privacy of data subjects. Some anonymization and de-identification techniques can be applied during data collection. Certain data could be left on a user's device, for example, sending only summarized or cohort-level data to a dataset. Other methods include modifying a research study design to incorporate anonymity in the study itself, before data are recorded. These types of techniques are extremely effective at breaking the link between data values and data subjects because they often limit the recording of identifiable data in the first place.

3. Who's Doing It?

Colleges and universities are subject to a range of laws, regulations, and contracts that govern the protection of certain kinds of data and, in some cases, require institutions to report breaches or other unauthorized disclosures. The jurisdictions for these rules can be international, national, state, or institutional, or the rules can be part of research grants or otherwise specific to particular projects. Units on campus including legal counsel, IT, cybersecurity, privacy, and research have a stake in data protection, and in some cases, a statistical consulting group might be able to furnish expertise in the work of anonymization and de-identification. It's an evolving area, and given the variability, responsibility is often distributed across departments and individuals. The bottom line is that the applicable regulatory framework is often what determines who on a campus is responsible for doing the work of de-identification or anonymization. Individual researchers might possess or acquire these skills, or, for a large project, someone might be hired to provide guidance or actually do the work necessary to comply with relevant data protection rules. Colleges and universities often work with vendors in ways that require the sharing of data. These companies have their own practices surrounding data, and sometimes a vendor is the source of a data breach. Whenever an institution is negotiating a vendor contract, a crucial part of that work should be to understand how the vendor will handle data and where the lines of responsibility fall.

4. Why Is It Significant?

The need to protect the privacy of individuals has been a concern for as long as anyone has been collecting sensitive data about people. The digital age creates vastly more opportunities to generate and amass data, including new kinds of data, in seemingly limitless amounts. Various tools usher in free or inexpensive ways to store, transmit, and process data, creating new ways to triangulate data elements and deduce or guess identities. Cybersecurity measures can help prevent unauthorized access to data, but even when users have legitimate access to certain kinds or combinations of data, individuals' identities and confidential information about them could be exposed. Entities that collect, store, or process data have an obligation to protect people and institutions that could be harmed if they were identified in the data. Harms can be physical, psychological, economic, or reputational, and they can be rooted in or exacerbated by biases related to gender or gender identity, race/ethnicity, age, or any other protected class. Identity theft and financial frauds are two common kinds of risks to data subjects. In regions with very small populations, knowing just a zip code and a diagnosis can be enough to identify a data subject's private health information.

A growing number of laws and regulations pertain to data privacy. The Family Educational Rights and Privacy Act (FERPA) covers the use and protection of educational data. The Health Insurance Portability and Accountability Act (HIPAA) has long governed the use and protection of health-related data, and the law carries a specific meaning and requirements for de-identification. The European General Data Protection Regulation (GDPR) and the Chinese Personal Information Protection Law (PIPL) represent two international regulations that have implications for data stewards around the world. Compliance with all applicable privacy rules can be expensive, and some laws and regulations explicitly exclude de-identified data from those requirements—if data can be anonymized and still be valuable, there will be significant cost savings.

5. What Are the Downsides?

Failing to properly de-identify or anonymize data carries significant risks, both for the data steward and for those whose information is included in datasets. But thoroughly de-identifying or anonymizing data can be a highly complex and expensive undertaking. Determining conclusively what a direct or indirect identifier might be can be difficult, given that greater access to datasets and increased computational capacity means that the answer might lie in what others know rather than in discrete characteristics of the data. As the landscape of data and information systems changes, those charged with managing data must continually expand their knowledge of how data can be misused. Such misuse compromises the trust of those who furnish data, and if that trust is broken, individuals and entities will be less likely to share data in the future, compromising research and other initiatives that rely on data. The difficulty of completely anonymizing data might create a false sense of security/privacy if the risk of re-identification is misunderstood or misrepresented within the consent process. Meanwhile, the process of anonymizing or de-identifying data inherently reduces the value of the data because of the reduction in the number of data elements that can be analyzed. This diminished value can affect current research, and it can render the data less useful for future research.

6. Where Is It Going?

New capabilities to access, process, and manipulate data will keep data stewards on their toes. Among these novel approaches is artificial intelligence, which promises to bring new sorts of computing power to datasets, searching for ways to use data in ways both beneficial and potentially harmful. The arms race between those tasked with protecting data and those seeking to exploit it will continue, and more regulations are certainly to come. The mathematical complexity of de-identification and anonymization techniques will increase, and newer methods such as homomorphic encryption—which allows data to be processed and analyzed without decrypting it—might become standard practice. The focus might shift away from traditional de-identification approaches to privacy-preserving or privacy-enhancing data analysis practices.

New laws and regulations surrounding data—reflecting evolving attitudes about data privacy among populations of users—will change the requirements for data de-identification and anonymization. Vendors of products that handle sensitive data might face a "highest-common denominator" situation in which, to be able to market their products in all jurisdictions, they choose to meet the highest applicable bar rather than offering different products for different sets of rules. This market-access dynamic could unlock partnerships between data owners/producers and those who seek to capitalize on the power of data.

7. What Are the Implications for Higher Education?

Colleges and universities handle significant amounts of sensitive data, including student records, financial data, human-subjects research data, and health information from medical schools. Higher education institutions are subject to numerous overlapping and intersecting regulations, some of which include notification requirements for data breaches. These regulations vary and can span state and national borders—in addition to GDPR and PIPL, Canada has issued its own guidelines for ethical and responsible conduct regarding private data, and in the United States, the National Institute of Standards and Technology includes data-protection guidelines in many of its standards. The reputational risks are significant for higher education, particularly in heated political times, and any entity or individual working with sensitive data needs to understand the ways in which the confidentiality of a dataset can be compromised. Institutions that can demonstrate compliance with de-identification and privacy requirements might gain a competitive advantage when seeking funding from certain government agencies or other sources. Cultivating these data skills in-house or being able to contract for them will be an increasingly important aspect of institutional operations, not only to meet compliance obligations but also to earn and keep the confidence of the many individuals whose data are entrusted to colleges and universities.


Aaron Collie is Research Data Security Manager at Princeton University.

Jeff Gassaway is Information Security & Privacy Officer at the University of New Mexico.

Michael Laurentius is Research Information Security Specialist at the University of Toronto.

Randy Marchany is University IT Security Officer at Virginia Tech.

© 2024 EDUCAUSE. The content of this work is licensed under a Creative Commons BY 4.0 International License.