CASE BACKGROUND
Salesforce, Inc. (Salesforce) provides cloud-based services to its clients, with a particular focus on sales and e-commerce. In June 2023, it released its XGen series of large language models (LLMs): artificial intelligence software designed to emit convincingly naturalistic text outputs in response to user prompts. XGen is trained by copying an enormous quantity of textual works and then feeding these copies into the model. This input material is called the training dataset.
Once the LLM has copied and ingested the textual works in the training dataset, the LLM is able to emit convincing simulations of natural written language in response to user prompts. Whenever an LLM generates text output in response to a user prompt, it is performing a computation with the goal of imitating the protected expression ingested from the training dataset.
Salesforce allegedly pirated hundreds of thousands of copyrighted books to develop its XGen series LLMs. The training dataset for these models consists of the RedPajama and The Pile datasets that contain copies of these copyrighted books.
Plaintiffs and class members are authors. They own registered copyrights in certain books that were included in the RedPajama and The Pile datasets that Salesforce used to develop the XGen models. Plaintiffs and the class never authorized Salesforce to download, copy, store, or use their copyrighted works. Likewise, Salesforce has never compensated plaintiffs and class members for downloading, copying, storing, or using their copyrighted works.
Salesforce benefitted commercially from its acts of massive copyright infringement, including by securing investments and contracts with customers for use of its LLMs through its Agentforce AI platform. Through the above acts, it has infringed plaintiffs’ copyrighted works, and it continues to do so by continuing to store, copy, use, and process the datasets containing copies of plaintiffs’ and the class’s copyrighted books.