Data deidentification
Data deidentification (alternatively spelled as “de-identification”) is a general term for removal of the association between personal data and the individual that this data previously identified.
“Deidentification” is loosely defined and may refer to different approaches, algorithms, and tools that can be applied to data, with varying effectiveness and data utility. For example, it is sometimes used interchangeably with the terms “anonymization” and “pseudonymization”.
A more systematic approach is to treat deidentification as an umbrella term, with anonymization and pseudonymization as two different techniques of deidentification, distinguished as follows:
- Anonymization is irreversible: when anonymized, personal data is transformed in such a way that any link to the original individual is removed, making reidentification unfeasible.
- Pseudonymization is reversible, allowing data to be mapped back to an individual, provided that the entity requesting the original personal data has sufficient permissions.
It is reasonable for the same company to use both deidentification techniques: pseudonymization for handling personal data internally and in production systems, and anonymization for testing, publishing aggregated data, or sharing it with third parties.
These deidentification techniques occupy different positions on the data privacy-to-utility spectrum. While anonymized data sacrifices a significant portion of utility in favor of privacy, pseudonymization maintains higher data utility but ensures a lesser extent of privacy. This seemingly subtle difference is important in terms of compliance: properly anonymized data is no longer classified as personal data under most privacy laws. The key word here is “properly” as truly anonymizing unstructured data (that is, documents as opposed to structured databases) is notoriously hard.
Speaking of regulations, while the General Data Protection Regulation (GDPR) suggests pseudonymization as a means of protecting personal data, the California Privacy Rights Act (CPRA) requires businesses to deidentify information if they intend to freely collect, store, use, sell, share, or disclose it. According to the CPRA:
“Deidentified” means information that cannot reasonably be used to infer information about, or otherwise be linked to, a particular consumer…
The problem of reidentification risk is often discussed alongside deidentification. There is a category of cyberattacks called “reidentification attacks”, which may involve combining several datasets or using advanced data analytics techniques. For businesses implementing deidentification, it is important to remain aware of the evolving cybersecurity landscape and periodically reevaluate the effectiveness of deidentification practices to keep up with state-of-the-art reidentification methods.