Paper: Bitkom: Anonymisierung und Pseudonymisierung von Daten für Projekte des maschinellen Lernens

Anonymization and Pseudonymization of data used in Machine Learning Projects

https://www.bitkom.org/sites/default/files/2020-10/201002_lf_anonymisierung-und-pseudonymisierung-von-daten.pdf

Examples given:

Processing of geolocation profiles (movements)
Google’s COVID-19 Community Mobility Reports
De-coupled pseudonyms, e.g. for manufacurers remote monitoring machine performance at customers
Speech recognition as example of federated learning
Anonmyization and pseudonymization of medical text data using Natural Language Processing
Use of sematic anonymization of sensitive data with inference-based AI and active ontolgies in the financial industry

Key words:

Anonymization of structured data
- Attacks
  - Was personal data of a known person used to genrate the anonymous data?
  - Which data in the anonymous data relates to personal data of a known person?
  - Predicting attributes of a known person
- Static anonymization, Dynamic anonymization, Interactive anonymization
- Pseudonymization
  - Format preserving encryption, Tokenization, Trusted third party, Pseudonymous Authentication (PAUTH), Oblivious transfer
- Anonymization of texts
  - Ensure that free text inlcudes no identifying terms (e.g. via organizational measures)
  - Masking of identifying terms as part of post-processing
  - Structuring via Natural Language Processing
  - Caveat: Author might be identifiable based on writing style
- Anonymization of multimedia data
- Privacy via on-prem analysis and decentralization (see also: federated learning)
  - Homomorphic encryption: fully homomorphic, partially homomorphic, somewhat homomorphic
  - Secure multi-party computation
  - Garbled circuits
- Privacy risks related to machine learning and controls
  - Identification of persons
  - Deanonmymization of data (e.g. of blurred images)
  - Memmbership inference
  - Model inversion
  - Defeating noise, others..
Federated learning
- (Moving algorithms to the local data – instead of moving data to central algorithm)
- (Local data doesn’t leave device)
- AI models as personal data
- Legal advantages of federated learning
Attacks and controls
- Model inversion
  - Querying the trained AI model to reconstruct its training data
- Membership inference
  - Was a given data point used to train the model?
- Model extraction
  - “Stealing” the trained model – by cloning the behaviour and predictive capabilities of a given AI model
- Adversial examples (creating inputs that trigger unintended responses)
- Countermeasures
  - Restriction son outputs
  - Adversarial Regularization
  - Distillation
  - Differential Privacy
  - Cryptography
  - Secure multi-party computation (MPC)
  - Federated machine learning
  - Differential Private Data Synthesis (DIPS) (e.g. via Copula functions, Generative Adversarial Networks)