Data Anonymization Key Concepts: Preserving Privacy

Jun 16, 2021
5 min read

Updated: May 9, 2022

The growing privacy regulations and incidents of data breaches have made it increasingly challenging to store and share data in a safe and secure environment. To cope with this, organizations need to adopt techniques that can mask sensitive information for preserving the privacy of individuals.

Data anonymization is one such technique that removes or encrypts identifiers that relate an individual to the stored data. It doesn’t allow the information to be traced back to a specific individual, thus protecting privacy while also maintaining the data’s credibility.

Recently, UK’s Information Commissioner's Office (ICO) published a draft guidance that lays out the good practice recommendations for data anonymization. The guidance comes at a time when several states and nations are adopting laws to enhance the privacy of their citizens.

What is the need for data anonymization?

As an organization operating in different regions, it could get complicated to comply with the local data protection laws in the process of storing and sharing data. But data protection laws don’t apply to truly anonymous data. For instance, GDPR allows the collection, storage, and use of anonymized data without consent if all identifiers are removed from the data.

Additionally, the growing awareness around privacy has urged several companies to be more transparent and accountable to the public. The tech giant Google has mentioned anonymization as a critical component of its commitment to privacy.

A 2020 McKinsey survey revealed that 87% of people would not do business with a company if they had concerns about its security practices, while 71% would stop doing business with a company if it gave away sensitive data without permission.

In this digital era driven by data, it is not possible to sustain one’s business without harnessing the data. Data analytics can help you obtain crucial insights and unlock the full potential of your business operations. But in doing so, you may have to share your data with other parties.

Anonymization lowers the risk of data protection even as you share the data with other organizations or the public. It can keep sensitive information private by masking certain attributes of the data while allowing you to derive business value from it. Data anonymization paves the way to ensure your data is utilized but without compromising on user privacy.

The anonymization process will save you precious time and money that would have gone towards developing the skills and infrastructure to ensure compliance with the various laws. It’s a win-win for businesses as well as individuals.

This article will evaluate the different data anonymization techniques that can be used by any organization dealing with large volumes of private or sensitive information. It includes any company dealing with personally identifiable information (PII) like names, phone numbers, social security numbers, etc.

Data Anonymization Techniques

Data Masking: It involves masking the data with modified values, making it impossible to reverse-engineer. A mirror image of a database is created and applied with alteration techniques like character shuffling, character substitution, or encryption.
Generalization: Only some part of the data is removed to make it less identifiable while retaining its accuracy.
Data Swapping/Permutation/Shuffling: It swaps and rearranges dataset attribute values to prevent the data from matching the original information.
Data Perturbation: It makes minor alterations to the original dataset by rounding numbers and adding random noise.
Synthetic Data: It creates artificial datasets based on patterns in the actual dataset. The artificial datasets can be created using statistical techniques like medians, standard deviations, linear regression, etc.
Pseudonymization: It replaces personally identifying data fields with fake identifiers or pseudonyms.

Which one to choose?

The anonymization technique should be selected based on a number of factors, like the purpose of anonymization, characteristics of the technique, availability of tools, and the recipient the data is intended for.

Apart from evaluating the above factors, it is important to perform an identifiability assessment before and after the application of the anonymization technique. It should assess the risk inference from the data before anonymization and determine the residual risk of re-identification after anonymization. We’ll outline a few use cases for the anonymization techniques explained above.

Data masking can be performed when the data a string of characters, and masking a part of it will provide the desired level of anonymity. It should be noted whether the length of the original data provides details about the original data, and subject matter knowledge is needed to ensure the right characters are masked in case of partial masking.
Generalization is suitable for data that can be used even after generalization. For example, converting a person’s age into an age range makes it less precise but still useful.
Data swapping can be used when there is no need for analysis of relationships between data attributes at the record level.
Data perturbation is used for numbers and dates that may be identifying when combined with other data sources. The degree of perturbation should be based on the range of values of the attribute.
Synthetic data is employed when a large volume of data is required for testing while retaining certain aspects of it, like the relation among attributes, format, etc. It is not much useful in data analysis as the data is not real.
Pseudonymized data is reversible as it does not remove all potential identifiers from the data. It should be used when the data values are required to be unique with no trace of information from the original attribute.

The GDPR considers a pseudonym to be personal data if the identifiers can still be exposed with the proper key. Pseudonymization allows the data to be identified when needed but prevents a significant level of unauthorized usage. This may be required in many cases, for instance, anonymized data cannot be used for personalizing the user experience.

Bottom Line

Anonymization can be considered as another tool that can go a long way in reducing unnecessary exposure of personal data. Though it is advisable to de-identify the personal information if you don’t need it, you should decide on the anonymization technique based on the circumstances and the needs of your organization.

It is advisable to employ at least one or more of the anonymization techniques based on your requirements in order to safeguard the data. It must be noted that a single layer of anonymization may not offer the desired level of protection; therefore, it should be executed in multiple layers.

As the ICO warns that a dataset that is anonymous to one organization may not be so to another one. Whether a dataset is truly anonymous or not depends on how likely it is to re-identify the data. The applied anonymization technique should be assessed against the risk of re-identification while ensuring the data is still usable.

The process of anonymization needs to be governed using a risk-based approach, with the data owners taking responsibility to monitor and manage any associated risks. The data owners should be working closely with the General Counsel's Office, Information Security Office, the Privacy team, and with the other subject matter experts to ensure compliance with various regulatory requirements.

The viability of the anonymization technique should be determined based on the factors of transparency, disclosure, references, confidentiality, integrity, etc. Independent third-party assessments of processes and safeguards should also be taken into account, especially if the anonymization process is being outsourced.

Data anonymization can provide a range of benefits for your organization; for example, it could be easier to perform cross-border data transfers. Also, if your data is anonymized, you are not required to notify the anonymized users in case of a data breach as there is no actual link between the affected individuals and the information.

Such measures can save you from reputational damage at a time when data breaches are becoming increasingly rampant. However, a simple approach to anonymization with no due diligence will not meet the requirements of privacy laws.