DATABRICKS, INC. ACCUSED OF COPYRIGHT INFRINGEMENT
In May 2023, Mosaic ML, Inc. (Mosaic)—now a Databricks, Inc. (Databricks) subsidiary—released its MosaicML Pretrained Transformer (MPT) series of large language models (LLMs). LLMs are artificial intelligence software programs designed to emit convincingly naturalistic text outputs in response to user prompts.
Mosaic trained its MPT LLMs by copying an enormous quantity of textual works, extracting protected expression from these works, and transforming that protected expression into a large set of numbers called weights that are stored within the models. Much of the training dataset’s material, however, comes from copyrighted works that Mosaic copied without consent, credit, and compensation.
Mosaic announced the first MPT LLM to release publicly, MPT-7B, in a blog post. In the post, Mosaic described MPT-7B’s training dataset as a “Mosaic-curated mix of sources” that “includes elements of the [then-]recently released RedPajama dataset.” RedPajama is hosted on the Hugging Face website, which indicates that RedPajama’s “Books” component is actually a copy of the “Books3” dataset. Books3 derives from a copy of the contents of the Bibliotik collection of ebooks and other electronic resources. Bibliotik is one of multiple notorious “shadow library” websites that host and distribute vast quantities of unlicensed copyrighted material in violation of the U.S. Copyright Act.
The Books3 dataset was available from Hugging Face until October 2023. At that time, it was removed, and a message was posted in its place stating that the dataset “is defunct and no longer accessible due to reported copyright infringement.” Thus, Mosaic has admitted training its MPT models in a way that directly infringes the copyrights of the plaintiffs.
Also, in June 2023, Mosaic released MPT-30B, another model in the MPT series of LLMs. In a table describing the composition of the MPT-30B training dataset, Mosaic admitted once again that a large quantity of training data came from RedPajama—Books. By copying the RedPajama—Books dataset, Mosaic necessarily copied Books3, thereby again directly infringing authors’ copyrights.
Since it was acquired by Databricks in July 2023, Mosaic has allegedly continued to make copies of these infringed works for LLM training and other commercial purposes.

CASE FILED
On March 8, 2024, the firm filed a lawsuit on behalf of plaintiff and class-member authors who own registered copyrights in certain books that were included in the training dataset that Mosaic has admitted copying multiple times to train MPT LLMs. Plaintiffs and class members never authorized Mosaic to use their copyrighted works as training material. Mosaic has commercially benefitted from these acts of massive copyright infringement. As Mosaic’s parent company, Databricks also has commercially benefitted from these acts. The case, O’Nan v. Databricks, Inc., in the United States District Court for the Northern District of California, seeks damages for the authors and an injunction to prevent Databricks and Mosaic from further infringing on their copyrights. The lawsuit also seeks destruction or other reasonable disposition of works that Mosaic made or used in violation of the exclusive rights of plaintiffs and the class.
"Artificial intelligence is changing every aspect of the modern world. The rights of authors such as these must be recognized and protected against unlawful theft and fraud," said firm founder, Joseph Saveri. “MPT infringes these authors’ rights and endangers their livelihood. This case is part of a larger fight for preserving ownership rights for all artists and other creators."
