A gentle introduction to configuring ARX:


History of arx:

Arx is a cloud-based service that provides data anonymization and privacy-enhancing technologies to organizations. The history of ARX can be traced back to the early development of the Arx algorithm for cryptography. It uses the arx cryptographic primitive to protect data by replacing sensitive information with random or pseudonymous values while preserving the statistical properties of the original data

Arx module offers a range of anonymization techniques, including k-anonymity, L diversity, T-closeness and differential privacy, which help organizations comply with data protection regulations while preserving data utility. The service is accessed through an API that allows developers to easily integrate the anonymization capabilities of Arxaas into their existing applications. The service is also available as software but it has limited use cases and cannot be customised. There is also a Python module called ‘pyarxaas’ which helps users access the arx service right from their Python IDE

Why do we need arx:

ARX might seem like a ‘nice-to-have’ for a lot of normal businesses dealing with the data however, with the concerns of data privacy and ethics rising the applications of the data can be far and wide. Here are some use cases that define the role of arx in an organization:

  1. Compliance: Many organizations are subject to data protection regulations, such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA), that require them to protect the privacy of their customers' data. Arxaas provides privacy-enhancing technologies that help organizations comply with these regulations.

  2. Risk management: Organisations may want to reduce the risk of data breaches or other security incidents by de-identifying sensitive data. Arxaas can help organizations minimize the risk of data exposure by anonymizing sensitive data before it is processed or stored.

  3. Data analysis: Organisations may want to use data for analysis or research purposes while protecting the privacy of individuals whose data is being used. Arxaas can help organizations preserve the utility of the data while ensuring that the privacy of individuals is protected.

    One example of an organization using Arxaas is the UK's National Health Service (NHS), which has used the service to anonymize patient data for research purposes. Another example is the German National Library of Science and Technology (TIB), which has used Arxaas to anonymize data on digital preservation and research data management. Additionally, several academic institutions have used Arxaas for research purposes, including the University of Copenhagen and the University of Vienna.

Installation guide for arx:

Installing ARXAAS:

The arxaas can be installed by installing the docker image and your system and running the local docker image

To run the local docker image, you have to make sure that the docker and docker desktop are installed.

Before installing docker on the system, the curl and sudo need to be installed or updated on the system

Installing docker for Ubuntu 22.04

Before we install the docker engine on our system, it is crucial to update our packages and dependencies and that can be done by running the following command:

sudo apt-get update

After updating the same, we install packages to allow apt to use a repository over HTTPS:

sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

Now we add the docker’s official gpg key

sudo mkdir -m 0755 -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

we set up the repository

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

After that is done, we move on to installing the docker

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-build-plugin docker-compose-plugin

We can verify the installation by running ‘hello world’ using the command:

sudo docker run hello-world

After the docker has been installed, we can install the docker image of the pyarxaas

docker pull navikt/arxaas

And then running

docker run -p 8080:8080 navikt/arxaas

and run the image locally using the port 8080. Simply typing the following URL in your browser after running the docker command would run your local docker instance

http://localhost:8080/

This is what the arx local docker image looks like

Installing pyarxaas traditionally:

The pyarxaas can be installed by cloning the github repository

git clone https://github.com/navikt/pyarxaas

The pyarxaas can also be installed using the package manager pip.

One thing to note before installing pyarxaas is that without downgrading your system version of Python from the latest (python 3.10 or 3.11) to the older versions of Python (3.8.10 or older) pyarxaas might not be installed and you might face errors such as

Hence we need a different environment of Python. In order to do that, we can either downgrade the entire system (which is not recommended because it might break other systems' dependencies that run on the current version of Python) or you can install multiple versions of Python (using a virtual environment) or pyenv

Installing pyenv for Ubuntu 22.04

  1. Update the system and the dependencies using the command

     $Sudo apt -get upgrade
     $curl https://pyenv.run | bash
    
  2. After the installation is complete, we add the pyenv variable to the bash file by using the exec command

     export PATH="$HOME/.pyenv/bin:$PATH" && eval "$(pyenv init --path)" && echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bashrc
    
  3. The pyenv is now installed and can be verified using the pyenv--version command

  1. One thing to note before installing pyarxaas is that without downgrading your system version of python from the latest (python 3.10 or 3.11) to the more older versions of python (3.8.10 or older) pyarxaas might not be installed and you might face errors such as

    Hence we need a different environment of python. In order to do that, we can either downgrade the entire system (which is not recommended because it might break other systems dependencies which run on the current version of the python) or you can install multiple versions of python (using a virtual environment) or pyenv

    Installing pyenv for Ubuntu 22.04

The installation of pyenv using Python is fairly simple

First we update the system and then install the pyenv script

Sudo apt -get upgrade
$ curl https://pyenv.run | bash

After the installation is complete, we add the pyenv variable to the bash file by using the exec command

export PATH="$HOME/.pyenv/bin:$PATH" && eval "$(pyenv init --path)" && echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bashrc

The pyenv is now installed and can be verified by running ‘pyenv- - version’.

Note: In order to see the latest version of the pyenv and the changes that take place, you might need to restart the shell


pyenv install -v 3.8.10
Pyenv global 3.8.10

At the time of writing this, the pyarxaas module has been a little outdated and only worked with python 3.8.10 hence, installing a specific version using pyenv as show above is probably the easiest way to go about it

After we are done upgrading the system, we can now install pyarxaas using the following command

pip install pyarxaas
Pyarxaas -version

After installing the pyarxaas and the python 3.8.10, we still have to change the python version interpreter. This can be done by going into vscode in the prompt and choosing your desired version of Python to run

Thus, the kernel of Python is ready to run our code

This works perfectly fine. However, please do note that if you choose a version of Python that is not 3.8.10 to run the pyarxaas code, it will simply not run because we have installed the pyarxaas on Python 3.8.10 version

What is pyarxaas

Pyarxaas is the Python wrapper for accessing the Arx functions on a local Arx instance. It can be downloaded via Git Hub or using pip.

Functions supported:

AttributeType

Attribute type is used to set the attributes to a given dataset. Usually the columns of a dataset fall into either of these categories:

Identifying: A column of a dataset is called identifying when it contains values that can be used to directly identify the attributes of a person. E.g. Name, E-mail, Aadhar card number etc

Quasi-Identifying: Quasi-identifying columns are the columns that do not contain the identifying information directly but they can be used in combination with other data and or columns to re-identify the person

Let us take an example to understand the quasi-identifying columns a little better:

A dataset contains zip code, gender and items purchased. Now, the columns on their own might not be enough to identify the user who purchased the dataset but they can be used in combination. E.g zip code narrows down the area of the user and we can use the gender column to filter out the gender of the user. This dataset in conjunction with customer reviews or let's say orders placed can be used to identify the user and hence reveal sensitive personal information.

Sensitive: Sensitive attributes are the columns in your data that contain important and crucial information about the user which cannot be reveleade. E.g. Bank account number, social security details etc. These columns should be removed from the data entirely

Insensitive: Insensitive columns are the columns that do not contain any identifying or sensitive information and they need not be anonymized. E.g when looking at the customer dataset insensitive information could be the brand he rejected or the number of items that he purchased

Types of hierarchy supported

Hierarchy is nothing but the levels of generalization defined when anonymizing the data. The pyarxaas module supports a variety of hierarchy types. You can either set your own hierarchy for generalisation or create one in pyarxaas. Here are some of the supported hierarchy types in pyarxaas:

Order-based Hierarchy: Order-based hierarchies are suited for categorical variables. I.e The value of variables can not be something other than the value from the specified list of values

Here, we have a list of diseases for a patient dataset


This is what the hierarchy looks like for the following diseases:

Redaction-based hierarchy: Redaction-based hierarchy is best suited for numeric but categorical values. E.g. phone number or zip code. They take in a list and delete one number at a time from the attribute column until the privacy criteria are met

Example:

Here we have a list of zip codes

Interval based hierarchy: Interval based hierarchy typically works well for the continuous numeric values such as age, height, weight, credit card number etc. The attribute column here gets divided into generalised level based instead of the actual numbers

Let's say we have a list of the age group for the customer data:

This is what the final hierarchy looks like:

privacy models

What are privacy models?

Privacy models refer to a set of techniques, methods, and frameworks that are used to protect the privacy of individuals in a data collection or analysis process. These models provide a structured approach to ensure that sensitive or personally identifiable information is not disclosed, while still allowing useful information to be extracted for research or analysis purposes.

Privacy models often involve a combination of statistical, cryptographic, and computational techniques to achieve their goals. They can be used to enforce different levels of privacy protection, depending on the specific needs and requirements of a particular use case or application.

Privacy Models supported by arx:

ARX supports the following privacy models:

K-anonymity:

K-Anonymity is a privacy model designed to protect the identity of individuals whose personal information is being used in a dataset. This model ensures that an individual cannot be re-identified from a dataset by ensuring that each record in the dataset is indistinguishable from at least k-1 other records.

  • In other words, K-Anonymity is a technique that anonymizes data by grouping individuals with similar characteristics into clusters, where each cluster contains at least k individuals. By doing so, it makes it harder for an attacker to identify a particular individual in the dataset.

    L-diversity:

  • L-Diversity is a privacy model that ensures that sensitive information of individuals in a dataset is not revealed by adding diversity to the dataset. It aims to prevent attackers from identifying individuals by adding enough diversity to the dataset to make it difficult to link specific sensitive attributes to a particular individual.

  • In other words, L-Diversity ensures that every group of individuals in the dataset is diverse enough in terms of sensitive attributes such as race, religion, or medical condition, so that it is not possible to link these attributes to a specific individual in the group.

  • L-Diversity is particularly useful in scenarios where sensitive attributes need to be protected while still allowing data to be used for analysis, research, or other purposes. It is used in many different applications, including healthcare, finance, and social research.

    T-closeness:

  • T-closeness is a privacy model that measures the degree to which a dataset preserves the privacy of individuals by ensuring that the distribution of sensitive attributes in the dataset is similar to their distribution in the general population. The aim is to prevent attackers from using background knowledge to link specific sensitive attributes to particular individuals in the dataset.

  • In other words, T-closeness ensures that the distribution of sensitive attributes (such as age, race, or medical condition) in the dataset is not significantly different from the distribution of those attributes in the general population, to avoid revealing sensitive information about specific individuals.

  • T-closeness can be measured using a distance metric that measures the difference between the distribution of a sensitive attribute in the dataset and its distribution in the general population. The goal is to minimise this distance, or "closeness", to protect the privacy of individuals in the dataset.

      kanon = K Anonymity(2) ldiv = LDiversityDistinct(2, "disease") # in this example the dataset has a disease field anonymize_result = arxaas.anonymize(dataset, [kanon, ldiv], 0.2) anonymized_dataset = anonymize_result.dataset
    

End to end data anonyization process with pyarxaas