History of arx:
Arx is a cloud-based service that provides data anonymization and privacy-enhancing technologies to organizations. The history of ARX can be traced back to the early development of the Arx algorithm for cryptography. It uses the arx cryptographic primitive to protect data by replacing sensitive information with random or pseudonymous values while preserving the statistical properties of the original data
Arx module offers a range of anonymization techniques, including k-anonymity, L diversity, T-closeness and differential privacy, which help organizations comply with data protection regulations while preserving data utility. The service is accessed through an API that allows developers to easily integrate the anonymization capabilities of Arxaas into their existing applications. The service is also available as software but it has limited use cases and cannot be customised. There is also a Python module called ‘pyarxaas’ which helps users access the arx service right from their Python IDE
Why do we need arx:
ARX might seem like a ‘nice-to-have’ for a lot of normal businesses dealing with the data however, with the concerns of data privacy and ethics rising the applications of the data can be far and wide. Here are some use cases that define the role of arx in an organization:
Compliance: Many organizations are subject to data protection regulations, such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), that require them to protect the privacy of their customers' data. Arxaas provides privacy-enhancing technologies that help organizations comply with these regulations.
Risk management: Organisations may want to reduce the risk of data breaches or other security incidents by de-identifying sensitive data. Arxaas can help organizations minimize the risk of data exposure by anonymizing sensitive data before it is processed or stored.
Data analysis: Organisations may want to use data for analysis or research purposes while protecting the privacy of individuals whose data is being used. Arxaas can help organizations preserve the utility of the data while ensuring that the privacy of individuals is protected.
One example of an organization using Arxaas is the UK's National Health Service (NHS), which has used the service to anonymize patient data for research purposes. Another example is the German National Library of Science and Technology (TIB), which has used Arxaas to anonymize data on digital preservation and research data management. Additionally, several academic institutions have used Arxaas for research purposes, including the University of Copenhagen and the University of Vienna.
Installation guide for arx:
Installing ARXAAS:
The arxaas can be installed by installing the docker image and your system and running the local docker image
To run the local docker image, you have to make sure that the docker and docker desktop are installed.
Before installing docker on the system, the curl and sudo need to be installed or updated on the system
Installing docker for Ubuntu 22.04
Before we install the docker engine on our system, it is crucial to update our packages and dependencies and that can be done by running the following command:
sudo apt-get update
After updating the same, we install packages to allow apt to use a repository over HTTPS:
sudo apt-get install \
ca-certificates \
curl \
gnupg \
lsb-release
Now we add the docker’s official gpg key
sudo mkdir -m 0755 -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
we set up the repository
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
After that is done, we move on to installing the docker
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-build-plugin docker-compose-plugin
We can verify the installation by running ‘hello world’ using the command:
sudo docker run hello-world
After the docker has been installed, we can install the docker image of the pyarxaas
docker pull navikt/arxaas
And then running
docker run -p 8080:8080 navikt/arxaas
and run the image locally using the port 8080. Simply typing the following URL in your browser after running the docker command would run your local docker instance
http://localhost:8080/
This is what the arx local docker image looks like
Installing pyarxaas traditionally:
The pyarxaas can be installed by cloning the github repository
git clone https://github.com/navikt/pyarxaas
The pyarxaas can also be installed using the package manager pip.
One thing to note before installing pyarxaas is that without downgrading your system version of Python from the latest (python 3.10 or 3.11) to the older versions of Python (3.8.10 or older) pyarxaas might not be installed and you might face errors such as
Hence we need a different environment of Python. In order to do that, we can either downgrade the entire system (which is not recommended because it might break other systems' dependencies that run on the current version of Python) or you can install multiple versions of Python (using a virtual environment) or pyenv
Installing pyenv for Ubuntu 22.04
Update the system and the dependencies using the command
$Sudo apt -get upgrade $curl https://pyenv.run | bash
After the installation is complete, we add the pyenv variable to the bash file by using the exec command
export PATH="$HOME/.pyenv/bin:$PATH" && eval "$(pyenv init --path)" && echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bashrc
The pyenv is now installed and can be verified using the pyenv--version command
One thing to note before installing pyarxaas is that without downgrading your system version of python from the latest (python 3.10 or 3.11) to the more older versions of python (3.8.10 or older) pyarxaas might not be installed and you might face errors such as
Hence we need a different environment of python. In order to do that, we can either downgrade the entire system (which is not recommended because it might break other systems dependencies which run on the current version of the python) or you can install multiple versions of python (using a virtual environment) or pyenv
Installing pyenv for Ubuntu 22.04
The installation of pyenv using Python is fairly simple
First we update the system and then install the pyenv script
Sudo apt -get upgrade
$ curl https://pyenv.run | bash
After the installation is complete, we add the pyenv variable to the bash file by using the exec command
export PATH="$HOME/.pyenv/bin:$PATH" && eval "$(pyenv init --path)" && echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bashrc
The pyenv is now installed and can be verified by running ‘pyenv- - version’.
Note: In order to see the latest version of the pyenv and the changes that take place, you might need to restart the shell
pyenv install -v 3.8.10
Pyenv global 3.8.10
At the time of writing this, the pyarxaas module has been a little outdated and only worked with python 3.8.10 hence, installing a specific version using pyenv as show above is probably the easiest way to go about it
After we are done upgrading the system, we can now install pyarxaas using the following command
pip install pyarxaas
Pyarxaas -version
After installing the pyarxaas and the python 3.8.10, we still have to change the python version interpreter. This can be done by going into vscode in the prompt and choosing your desired version of Python to run
Thus, the kernel of Python is ready to run our code
This works perfectly fine. However, please do note that if you choose a version of Python that is not 3.8.10 to run the pyarxaas code, it will simply not run because we have installed the pyarxaas on Python 3.8.10 version
What is pyarxaas
Pyarxaas is the Python wrapper for accessing the Arx functions on a local Arx instance. It can be downloaded via Git Hub or using pip.
Functions supported:
AttributeType
Attribute type is used to set the attributes to a given dataset. Usually the columns of a dataset fall into either of these categories:
Identifying: A column of a dataset is called identifying when it contains values that can be used to directly identify the attributes of a person. E.g. Name, E-mail, Aadhar card number etc
Quasi-Identifying: Quasi-identifying columns are the columns that do not contain the identifying information directly but they can be used in combination with other data and or columns to re-identify the person
Let us take an example to understand the quasi-identifying columns a little better:
A dataset contains zip code, gender and items purchased. Now, the columns on their own might not be enough to identify the user who purchased the dataset but they can be used in combination. E.g zip code narrows down the area of the user and we can use the gender column to filter out the gender of the user. This dataset in conjunction with customer reviews or let's say orders placed can be used to identify the user and hence reveal sensitive personal information.
Sensitive: Sensitive attributes are the columns in your data that contain important and crucial information about the user which cannot be reveleade. E.g. Bank account number, social security details etc. These columns should be removed from the data entirely
Insensitive: Insensitive columns are the columns that do not contain any identifying or sensitive information and they need not be anonymized. E.g when looking at the customer dataset insensitive information could be the brand he rejected or the number of items that he purchased
Types of hierarchy supported
Hierarchy is nothing but the levels of generalization defined when anonymizing the data. The pyarxaas module supports a variety of hierarchy types. You can either set your own hierarchy for generalisation or create one in pyarxaas. Here are some of the supported hierarchy types in pyarxaas:
Order-based Hierarchy: Order-based hierarchies are suited for categorical variables. I.e The value of variables can not be something other than the value from the specified list of values
Here, we have a list of diseases for a patient dataset
This is what the hierarchy looks like for the following diseases:
Redaction-based hierarchy: Redaction-based hierarchy is best suited for numeric but categorical values. E.g. phone number or zip code. They take in a list and delete one number at a time from the attribute column until the privacy criteria are met
Example:
Here we have a list of zip codes
Interval based hierarchy: Interval based hierarchy typically works well for the continuous numeric values such as age, height, weight, credit card number etc. The attribute column here gets divided into generalised level based instead of the actual numbers
Let's say we have a list of the age group for the customer data:
This is what the final hierarchy looks like:
privacy models
What are privacy models?
Privacy models refer to a set of techniques, methods, and frameworks that are used to protect the privacy of individuals in a data collection or analysis process. These models provide a structured approach to ensure that sensitive or personally identifiable information is not disclosed, while still allowing useful information to be extracted for research or analysis purposes.
Privacy models often involve a combination of statistical, cryptographic, and computational techniques to achieve their goals. They can be used to enforce different levels of privacy protection, depending on the specific needs and requirements of a particular use case or application.
Privacy Models supported by arx:
ARX supports the following privacy models:
K-anonymity:
K-Anonymity is a privacy model designed to protect the identity of individuals whose personal information is being used in a dataset. This model ensures that an individual cannot be re-identified from a dataset by ensuring that each record in the dataset is indistinguishable from at least k-1 other records.
In other words, K-Anonymity is a technique that anonymizes data by grouping individuals with similar characteristics into clusters, where each cluster contains at least k individuals. By doing so, it makes it harder for an attacker to identify a particular individual in the dataset.
L-diversity:
L-Diversity is a privacy model that ensures that sensitive information of individuals in a dataset is not revealed by adding diversity to the dataset. It aims to prevent attackers from identifying individuals by adding enough diversity to the dataset to make it difficult to link specific sensitive attributes to a particular individual.
In other words, L-Diversity ensures that every group of individuals in the dataset is diverse enough in terms of sensitive attributes such as race, religion, or medical condition, so that it is not possible to link these attributes to a specific individual in the group.
L-Diversity is particularly useful in scenarios where sensitive attributes need to be protected while still allowing data to be used for analysis, research, or other purposes. It is used in many different applications, including healthcare, finance, and social research.
T-closeness:
T-closeness is a privacy model that measures the degree to which a dataset preserves the privacy of individuals by ensuring that the distribution of sensitive attributes in the dataset is similar to their distribution in the general population. The aim is to prevent attackers from using background knowledge to link specific sensitive attributes to particular individuals in the dataset.
In other words, T-closeness ensures that the distribution of sensitive attributes (such as age, race, or medical condition) in the dataset is not significantly different from the distribution of those attributes in the general population, to avoid revealing sensitive information about specific individuals.
T-closeness can be measured using a distance metric that measures the difference between the distribution of a sensitive attribute in the dataset and its distribution in the general population. The goal is to minimise this distance, or "closeness", to protect the privacy of individuals in the dataset.
kanon = K Anonymity(2) ldiv = LDiversityDistinct(2, "disease") # in this example the dataset has a disease field anonymize_result = arxaas.anonymize(dataset, [kanon, ldiv], 0.2) anonymized_dataset = anonymize_result.dataset
End to end data anonyization process with pyarxaas