Does Training AI Necessarily Entail a Copyright Violation?
Few technologies are more heavily featured in the news—or more controversial—than Artificial Intelligence (AI). [1] This post focuses on a subsect of AI, Large Language Models (LLMs), which are computer algorithms that synthesize a massive amount of data in order to “learn” how human language works and recreate it.[2]
They work by synthesizing data, categorizing it, drawing connections between different types of works, determining the defining features of the different types of work from those connections, and using that “understanding” to generate original works in the style of the works they synthesized—a process called training.[3] After training, LLMs are capable of astonishing tasks: OpenAI’s ChatGPT 4.0 can summarize the plot of any movie, hold natural conversations with human beings, write songs, and much more.[4] In spite of the benefits they already offer and could offer in the future, LLMs have met significant social and legal controversy, particularly in the intersection of their training processes and the copyright laws.[5]
The Copyright Act of 1976 grants copyright owners a “a bundle of exclusive rights, including the rights to ‘reproduce the copyrighted work in copies’ and ‘to prepare derivative works based upon the copyrighted work.’”[6] To establish copyright infringement, a plaintiff must prove (1) the ownership of a valid copyright, and (2) that the copyrighted material was unlawfully copied.[7] Unlawful copying itself consists of two further elements: (1) actual copying (established “either by direct…or indirect evidence,”) and (2) “substantial similarity” between the derivative and the original copyrighted work.[8] Substantial similarity “requires that the copying [be] quantitatively and qualitatively sufficient to support the legal conclusion that infringement (actionable copying) has occurred.”[9]
In cases concerning the data used to train LLMs, copying so far has been relatively easy to establish. For example, at the summary judgment stage, in [10], the court held as a matter of law that copying had occurred since the defendant admitted to downloading and hosting the plaintiff’s copyrighted works on their servers without permission to do so, stating that (“ [s]imply hosting a copy on a server might not seem like copying, but it is.”).[11] Since the nature of training LLMs entails the use of massive amounts of data, and such amounts of data are most easily acquired from other parties rather than created fresh, training LLMs can be hard to accomplish without downloading and hosting somebody else’s copyrighted data on a server.[12]
With this in mind, the training aspect of LLMs clearly seems to be in danger of being a copyright violation. As a result, there has unsurprisingly been a flood of litigation over alleged copyright infringement during the training of LLMs.[13] Since the technology is new, most of the litigation is still processing, with final rulings on many aspects of copyright infringement yet to come. The Thomson Reuters suit is one of the more advanced suits, but the 2023 summary judgment order declined to rule on whether substantial similarity existed, sufficient to prove infringement.[14]
Another interesting aspect of the Thomson Reuters case is the potential application of the Fair Use defense. The Fair Use defense is a balancing test, which when satisfied, provides a total defense to an otherwise valid copyright infringement claim.[15] It considers four factors: “(1) the purpose and character of the use, (2) the nature of the copyrighted work, (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole, and (4) the effect of the use upon the potential market for the copyrighted work.”[16] In that case, genuine disputes of material fact existed for each factor, so the court denied the motion for summary judgment, allowing the dispute to proceed to the trial phase.[17] Of note, while the court was able to present a persuasive analysis of each factor, pointing both in favor of and against finding fair use, the fourth factor presented “a hotly debated question: Is it in the public benefit to allow AI to be trained with copyrighted material?”[18]
The answer to that question is one that courts have yet to decide, but is one that many inside and outside the courts are seriously considering.[19] Considering that infringement is likely to be relatively easy to prove in cases of AI training, it will be interesting whether the courts decide to apply the Fair Use defense, and if so, on what grounds they will justify that decision.
Footnotes