Anonymization of the data using AWS Data brew
AWS has been a popular choice for cloud solutions and for all the right reasons. The following article suggests some approaches to designing a data privacy solution using AWS. They have been a part of my bachelor's thesis
Diagram
Components:
AWS S3: As explained in the previous solution, s3 is an object-based service used to store the data
AWS Glue DataBrew: This is a service by Amazon that aids data scientists and machine learning engineers in cleaning and normalizing the data.
AWS Athena: Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives.
Explanation:
the data in question to be anonymized is stored in an s3 bucket. Moreover, Macie can be used to detect sensitive data from the already present data. The data from s3 is then loaded into the aws glue data brew which has jobs that mask the sensitive PII column-wise after the data is masked, the masked data is then stored in s3 which creates an external table on top of Athena
Drawbacks:
Using aws Macie to detect sensitive data might work for use cases where the user might have accidentally entered sensitive data such as credit card information inside the s3 bucket. In our use case, the data that we want to mask is just not sensitive data, but data that can be used in conjunction with other data to identify the users as well which is something where Macie fails. Additionally, the AWS data brew has limited support for masking the sensitive data and if the attacker knows the method or the technique using which the data was masked, it might be easy for the attacker to re-identify the data.
Data Anonymization Pipeline using Snowflake and airflow
Diagram
Components:
Snowflake: Snowflake is a SaaS data warehouse solution designed solely for the cloud, it supports popular cloud providers like GCP, AWS and Azure. Snowflake can be used to build a data warehouse and data lake right from your browser as Snowflake is a managed service
PYARXAAS: This is a popular Python module that is used as a static anonymization tool. Pyarxaas provides a wrapper to access functions for your local Arx instance.
Apache Airflow: Apache airflow is an open source, etl tool that is used to automate the data loading and data transforming pipeline
Explanation:
The following solution provides an end-to-end pipeline for anonymizing data that gives users the flexibility to scale the solution as per their demand or data. Here is the process of the pipeline:
Takes in the raw data either from the source or from a SQL database. What makes the solution truly customizable is that Snowflake can take in data from multiple sources Snowflake natively supports Avro, Parquet, CSV, JSON and ORC hence data from pretty much any source can be taken
The raw data is then loaded into the snowflake to create a data warehouse. Access to the given snowflake is heavily restricted. This step is also customizable and scalable if you want to give certain users access to the raw data warehouse, you can simply assign their role access and they can be managed using Snowflake organization. Thus, reducing costs and increasing flexibility
After that, an ETL job is run on the data warehouse which validates the data if needed, sets the hierarchy of anonymization for the data and actually anonymizes the data. The anonymization being used here is a function written manually and will be triggered after a certain time when the data is loaded in the data warehouse. Here, since we already know the project columns and all the details about the data being present inside we will use static anonymization instead of dynamic but dynamic anonymization can be used as well
The anonymized data is then stored in another Snowflake data table or this can also be connected to the cloud provider of your choice an interesting thing about Snowflake is that it really lets you create stunning visualizations using the Snowflake console itself
The masked data which is not public facing can be used to generate charts, analysis and visualization
Drawbacks:
One of the biggest advantages of this solution is the scalability in terms of computing and storage. Since Snowflake is a hybrid between the shared nothing and shared everything architecture, it is really easy to scale up or down. The manual data anonymization gives the users greater flexibility to implement their own algorithm. However, one potential challenge to solve is trying to anonymize the data automatically without having the schema of the data. This is still a challenge since pyarxaas is a static anonymization algorithm
References: