In a class action lawsuit filed in federal court in California, authors Paul Tremblay and Mona Awad allege that OpenAI broke copyright laws by using their works to teach its program without authorization.
The action lodged in the U.S. District Court for the Northern District of California in San Francisco alleges that ChatGPT, a considerable language model, is trained by copying vast volumes of text and extracting evaluative information to create the training dataset.
The complaint claims that Tremblay and Awad, two authors residing in Massachusetts, did not authorize using their intellectual works in ChatGPT’s educational materials. Despite this, ChatGPT was trained using content that was protected by copyright.
ChatGPT was trained on Plaintiffs’ copyrighted works to generate summaries of such works, according to the 17-page lawsuit. Using ChatGPT, the Defendants derive substantial economic gain and profit at the expense of the Plaintiffs and Class Members’ intellectual content.
The lawsuit refers to a document released in June 2018 by OpenAI, in which the company admitted training its GPT-1 tool on BookCorpus, a database including more than 7,000 unique unpublished works in categories such as Adventure, Fantasy, and Romance.
According to OpenAI, a book dataset is valuable because it contains large blocks of text that may be used to train a generative model to make predictions based on context. The complaint states that OpenAI, Google, Amazon, and other companies have all trained language models using BookCorpus.
Intellectual property law expert Andres Guadamuz from the University of Sussex told The Guardian that this is the first copyright-related lawsuit against OpenAI.
Books are suitable for training big language models because they include high-quality, well-edited, long-form writing, as attorneys Joseph Saveri and Matthew Butterick noted. Books, in their view, represent the pinnacle of human knowledge preservation.
Infringed Works and systems, including ChatGPT, belonging to Plaintiffs and Class members were allegedly collected, maintained, and controlled by Defendants without permission, constituting a violation of Defendants’ responsibilities. They’ve been charged with being careless, reckless, and negligent.