Machine Unlearning

Wednesday, September 21, 2022

Privacy laws such as the General Data Protection Regulation (GDPR) in the European Union (EU), or similar legislation such as the California Consumer Privacy Act (CCPA) or the Act on the Protection of Personal Information (APPI) define the rights of individuals with respect to their personal information and organizations that might collect such information, but these organizations might be unintentionally disrespecting those rights.

Without loss of generality the GDPR grants EU citizens the right to request that their personal data be deleted by a data controller, or the organization that decides why and how personal data will be processed. Under GDPR, personal data is defined as:

... any information that relates to an individual who can be directly or indirectly identified. Names and email addresses are obviously personal data. Location information, ethnicity, gender, biometric data, religious beliefs, web cookies, and political opinions can also be personal data. Pseudonymous data can also fall under the definition if it’s relatively easy to ID someone from it.

The obvious way to serve such a request is for the data controller to delete any data falling into the categories listed above. However, if personal data is used to train a machine learning model, the model has the potential to leak the data it was trained on while using the model. The article What does GPT-3 “know” about me? is a great read to understand what personal data common machine learning models like GPT-3 can divulge on individuals whose personal data was used to train such a model. From the article:

With a little prodding, GPT-3 told me Mat has a wife and two young daughters (correct, apart from the names), and lives in San Francisco (correct). It also told me it wasn’t sure if Mat has a dog.

A clean and conservative approach would be to retrain the model with the deleted data excluded from the training set; This approach is termed "Exact data deletion". However, exact data deletion can be intractable given the time it takes to train such models in the first place. Given this challenge there is a lot of interest in the idea of "Approximate data deletion" which can ideally achieve the same results without a full retraining.

The idea of data deletion has been around for specific classes of models for a while. The paper by Ginart et al. 2019 covering K-means clustering models is one example. More recently Izzo et al. 2021 have discussed a general and efficient approach to approximate data deletion which is applicable to all linear models. In addition their approach can be applied to common non-linear models, such as deep learning models. Using their approach Mat, and any other individuals covered by the aforementioned privacy laws, can truly be forgotten.


Fair Use Exceptions For AI

Friday, September 16, 2022

In AI Copyright Violations, I made the assertion that the creators of AI powered content generation tools are likely guilty of copyright infringement of the content that is used to train the algorithms. I would like to justify that assertion as best as I can while acknowledging that I am not a lawyer and that I am only writing about copyright law within the United States.

Copyright ownership gives the creator of an original work exclusive rights to use the work. There are several limitations on exclusive rights within the copyright law of the United States, but the only notable limitation on copyright for the purposes of this post is Section 107, which outlines the policy of fair use or a set of rules for using copyrighted content without being culpable of copyright infringement.

According to §107. Limitations on exclusive rights: Fair use:

... the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include:

  1. The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.
  2. The nature of the copyrighted work.
  3. The amount and substantiality of the portion used in relation to the copyrighted work as a whole.
  4. The effect of the use upon the potential market for or value of the copyrighted work.

The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.

So is the usage of copyrighted content to train AI powered content generation tools fair use based on the above factors?

The purpose and character of the use

To summarize the opinions stated in Campbell v. Acuff-Rose Music, The more transformative content is by "altering the first with new expression, meaning, or message" the less significant other factors will be in determining whether use is fair use. The transformative nature of AI is an interesting debate by its own merit. There are many valid observations of AI memorizing data (take OpenAI's GPT as one example). Conversely the ability to generate new content sounds entirely different in terms of the value that it delivers. Regardless transformative content is not deemed not fair use just if it happens to be is profitable.

The nature of the copyrighted work

Whether a copyrighted work is factual or creative, or published or unpublished is a factor in determining whether usage of that copyrighted work qualifies as fair use. Usage of factual content that has been published is more likely to fall under fair use. In the context of training AI models this factor only appears to be generally relevant in the case of training models on unpublished work.

The amount and substantiality of the portion used in relation to the copyrighted work

While there are no guidelines for determining what amount of usage is substantial, it seems clear that usage in the context of training AI models is substantial given that the entire work is consumed during the process of training.

The effect of the use upon the potential market for or value of the copyrighted work

To summarize the opinions stated in Nunez v. Caribbean Int'l News Corp, Usage which monetizes copyrighted work or otherwise negatively affects the copyright holder's ability to monetize their work is less likely to qualify as fair use. This is likely the most significant factor in the context of training AI models. Companies such as OpenAI and Midjourney both have an expectation to monetize their models, and the usage of these models may drastically reduce a creator's ability to monetize their own content.

It is an interesting idea that increasing demand for a copyright holder's work while monetizing it at the same time can result in the use being deemed as fair use. The trouble with AI content generation tools though is that they do not provide attribution to the sources that were used to train the model. Therefore demand for the copyright holder's work is likely only ever diminished.

Conclusion

Based on the factors in Section 107, the usage of copyrighted content to train models used in AI powered content generation tools is both substantial and has high monetization capability while also reducing a copyright holder's ability to monetize directly. The transformative nature of AI generated content does not seem significant enough to qualify this use as fair use at least in an ethical sense.


AI Copyright Violations

Wednesday, September 14, 2022

This week I have been spending a good amount of time thinking about the impact of AI on creators. This was initially prompted by the discussion on the Dithering Podcast, AI Illustrators, and then by Ben Thompson in his Monday article, The AI Unbundling. I recommend reading the articles, but I will summarize and share my thoughts below.

Both articles discuss the use of Midjourney, an AI powered image generation tool, by Charlie Warzel to illustrate his article Where Does Alex Jones Go From Here?. There was a bit of a mob reaction to the use of AI generated content, and Charlie issued a later apology in I Went Viral in the Bad Way. In Charlie's words:

I was caught up in my own work and life responsibilities and trying to get my newsletter published in a timely fashion. I went to Getty and saw the same handful of photos of Alex Jones, a man who I know enjoys when his photo is plastered everywhere. I didn’t want to use the same photos again, nor did I want to use his exact likeness at all. I also, selfishly, wanted the piece to look different from the 30 pieces that had been published that day about Alex Jones and the Sandy Hook defamation trial. All of that subconsciously overrode all the complicated ethical issues around AI art that I was well apprised of.

What are the complicated ethical issues around AI art you ask? A salient point that is made in Charlie's apology is as follows:

DALL-E is trained on the creative work of countless artists, and so there’s a legitimate argument to be made that it is essentially laundering human creativity in some way for commercial product.

This is a very valid concern, to which the billions of people who have their data used to build commercial products should be very sympathetic. Legislation and licensing to protect a creator's content should be discussed as widely as protections for a user's privacy. Charlie continues:

Like others, I also have questions about the corpus used to train these art tools and the possibility that they are using a great deal of art from both big-name and lesser-known artists without any compensation or disclosure to those artists. (I reached out to Midjourney to ask some clarifying questions as to how they choose the corpus of data to train the tool, and they didn’t respond.)

While declining to respond is not an admission of guilt to copyright infringement, my understanding of how these things are usually done makes me very suspicious of companies creating AI powered content generation tools. It is all too easy to scrape Google images indiscriminately for content to train a AI models with no regard for whether an image is subject to copyright.

I see a clear parallel between the use of a user's data and a creator's content to power commercial AI products. It seems to be a significant ethical (and probably legal) problem, which is also difficult in a technical sense because it is very challenging to back out which data points were used to train a model after the fact. I plan on noodling on this in the future.

Profile picture

iidBlog is written by me, Josh Howard, a software engineer currently working as a Senior Engineering Manager at Starburst Data in Atlanta, Georgia, US.

Josh Howard, Software Engineer