Machine Unlearning

Wednesday, September 21, 2022

Privacy laws such as the General Data Protection Regulation (GDPR) in the European Union (EU), or similar legislation such as the California Consumer Privacy Act (CCPA) or the Act on the Protection of Personal Information (APPI) define the rights of individuals with respect to their personal information and organizations that might collect such information, but these organizations might be unintentionally disrespecting those rights.

Without loss of generality the GDPR grants EU citizens the right to request that their personal data be deleted by a data controller, or the organization that decides why and how personal data will be processed. Under GDPR, personal data is defined as:

... any information that relates to an individual who can be directly or indirectly identified. Names and email addresses are obviously personal data. Location information, ethnicity, gender, biometric data, religious beliefs, web cookies, and political opinions can also be personal data. Pseudonymous data can also fall under the definition if it’s relatively easy to ID someone from it.

The obvious way to serve such a request is for the data controller to delete any data falling into the categories listed above. However, if personal data is used to train a machine learning model, the model has the potential to leak the data it was trained on while using the model. The article What does GPT-3 “know” about me? is a great read to understand what personal data common machine learning models like GPT-3 can divulge on individuals whose personal data was used to train such a model. From the article:

With a little prodding, GPT-3 told me Mat has a wife and two young daughters (correct, apart from the names), and lives in San Francisco (correct). It also told me it wasn’t sure if Mat has a dog.

A clean and conservative approach would be to retrain the model with the deleted data excluded from the training set; This approach is termed "Exact data deletion". However, exact data deletion can be intractable given the time it takes to train such models in the first place. Given this challenge there is a lot of interest in the idea of "Approximate data deletion" which can ideally achieve the same results without a full retraining.

The idea of data deletion has been around for specific classes of models for a while. The paper by Ginart et al. 2019 covering K-means clustering models is one example. More recently Izzo et al. 2021 have discussed a general and efficient approach to approximate data deletion which is applicable to all linear models. In addition their approach can be applied to common non-linear models, such as deep learning models. Using their approach Mat, and any other individuals covered by the aforementioned privacy laws, can truly be forgotten.