An introduction to data privacy and anonymization:


Data privacy is a fundamental right of every individual in this day and age. With people’s online presence increasing, data privacy is important now more than ever as handling large amounts of data and protecting the rights of an individual is a complex task

With the rise of Generative AI. A lot of your data can be used without your consent. Sometimes, the raw data used can even cause biases in the model making the models not very feasible to use for real-life purposes

Data anonymization refers to masking sensitive user data in a way that the identity of the user is maintained and if the data is released, the data can not be traced back to the user. I.e. privatizing your data before making it public. It sounds counterintuitive and underproductive but it serves a crucial purpose

Data anonymization is the tradeoff between the usability of the data in terms of statistical parameters and the privacy of the users and maintaining the same statistical models and domain knowledge combined to make sure that the anonymization is done properly.

Why is data anonymization needed?

One of the biggest examples which states the importance of anonymizing the data is the 2006 AOL data search leak. Here are some interesting statistics about the data leak:

  • On the 4th of August 2006, search data of approximately 650,000 users along with 20 Million search results were leaked

  • The data was removed relatively quickly on the 7th of August 2006

  • AOL did not identify the users in the data as the names of the users were not explicitly mentioned in the data

  • However, a popular newspaper magazine called the New York Times were able to identify the users by cross-referencing them with other sources like phone book listing

Another popular data breach is that of Netflix where the data was leaked and the researchers at the university were able to trace it back to the users

Although proper care should be taken that the data does not get leaked in the first place, it is equally necessary to make sure that if the data gets leaked, the sensitive information is not revealed to the users.

To ensure the privacy of the data, a five-parameter framework is used:

  1. Ensure that the data is safe

  2. Ensure that the people working on the data are safe

  3. Ensure that the scope of the project is viable

  4. Ensure that the proper compliant standards are set up to ensure safety in place

  5. Disclose of the output data can be monitored to ensure that sensitive data is not leaked

Fig 1 The process of anonymization

Types of anonymization

There are two main types of anonymization:

  1. Static anonymization: Static anonymization refers to the anonymization of the data all at once and then the data is released to the public or the third-party source or vendor. In static anonymization, often a subset of the original data is released after anonymization to the users. Popular static anonymization tools and software are ARX, Amnesia etc

  2. Dynamic Anonymization: Dynamic anonymization refers to the anonymization of the data using queries. Often the full dataset is released to the public and the anonymization of the dataset is happening in real-time. Dynamic anonymization is considered more reliable than static one because static anonymization needs to specify the anonymization techniques very carefully otherwise the subset of the data can be extracted multiple times to paint the picture of the actual data. Some popular dynamic anonymization techniques are: The popular R package diffpriv, Google’s RAPPOR etc

Static anonymization

Dynamic Anonymization

Ultimately, the usage of privacy techniques depends on the user and the purpose that they are trying to achieve. For instance, dynamic anonymization techniques might not be best suited for people working in academia as it can not reproduce results

Static anoynmization is proven useful when user has definitive paramteres and a very strong idea on how to deal with them. Conventionally, static anonymization algortihms have been well researched and a lot of open source and propritery solutions offers support for static anonymization. However, ARX has been a steady popular choice when it comes to static anonymization. The next few posts in this series will dive deeper around the idea of building solutions using ARX. Stay tuned for them!