Paper: Bitkom: Anonymisierung und Pseudonymisierung von Daten für Projekte des maschinellen Lernens

Anonymization and Pseudonymization of data used in Machine Learning Projects

https://www.bitkom.org/sites/default/files/2020-10/201002_lf_anonymisierung-und-pseudonymisierung-von-daten.pdf

Examples given:

  • Processing of geolocation profiles (movements)
  • Google’s COVID-19 Community Mobility Reports
  • De-coupled pseudonyms, e.g. for manufacurers remote monitoring machine performance at customers
  • Speech recognition as example of federated learning
  • Anonmyization and pseudonymization of medical text data using Natural Language Processing
  • Use of sematic anonymization of sensitive data with inference-based AI and active ontolgies in the financial industry

Key words:

    • Anonymization of structured data
        • Approaches
        • Aggregation approach
          • Generalization, Microaggregation
          • k-anonymity, l-diversity, t-closeness
          • Mondrian algorithm, MDAV method (Maximum Distance to Average Vector)
        • Randomization approach
          • Adding noise
        • Synthetic approach
          • (Creating a synthetic model based on original data to generate “matching” synthetic data)
      • Attacks
        • Was personal data of a known person used to genrate the anonymous data?
        • Which data in the anonymous data relates to personal data of a known person?
        • Predicting attributes of a known person
      • Static anonymization, Dynamic anonymization, Interactive anonymization
      • Pseudonymization
        • Format preserving encryption, Tokenization, Trusted third party, Pseudonymous Authentication (PAUTH), Oblivious transfer
      • Anonymization of texts
        • Ensure that free text inlcudes no identifying terms (e.g. via organizational measures)
        • Masking of identifying terms as part of post-processing
        • Structuring via Natural Language Processing
        • Caveat: Author might be identifiable based on writing style
      • Anonymization of multimedia data
      • Privacy via on-prem analysis and decentralization (see also: federated learning)
        • Homomorphic encryption: fully homomorphic, partially homomorphic, somewhat homomorphic
        • Secure multi-party computation
        • Garbled circuits
      • Privacy risks related to machine learning and controls
        • Identification of persons
        • Deanonmymization of data (e.g. of blurred images)
        • Memmbership inference
        • Model inversion
        • Defeating noise, others..
    • Federated learning
      • (Moving algorithms to the local data – instead of moving data to central algorithm)
      • (Local data doesn’t leave device)
      • AI models as personal data
      • Legal advantages of federated learning
    • Attacks and controls
      • Model inversion
        • Querying the trained AI model to reconstruct its training data
      • Membership inference
        • Was a given data point used to train the model?
      • Model extraction
        • “Stealing” the trained model – by cloning the behaviour and predictive capabilities of a given AI model
      • Adversial examples (creating inputs that trigger unintended responses)
      • Countermeasures
        • Restriction son outputs
        • Adversarial Regularization
        • Distillation
        • Differential Privacy
        • Cryptography
        • Secure multi-party computation (MPC)
        • Federated machine learning
        • Differential Private Data Synthesis (DIPS) (e.g. via Copula functions, Generative Adversarial Networks)