Tramite il modello AI chiamato Claude2, Anthropic violerebbe il copyright di molte canzoni (della loro parte letterariA) . Così la citazione in giudizio da parte di molti produttori (tra i maggiori al mondo, parrebbe).
Ne dà notizia The Verge oggi 19 ottobre (articolo di Emilia David), ove trovi pure il link all’atto introduttivo di citazione in giudizio.
Riposto solo i passi sul come fuinziona il traininig e l’output di Claude2 e poi dove stia la vioalzione.
<<6 . Anthropic is in the business of developing, operating, selling, and licensing AI technologies. Its primary product is a series of AI models referred to as “Claude.” Anthropic builds its AI models by scraping and ingesting massive amounts of text from the internet and potentially other sources, and then using that vast corpus to train its AI models and generate output based on this copied text. Included in the text that Anthropic copies to fuel its AI models are the lyrics to innumerable musical compositions for which Publishers own or control the copyrights, among countless other copyrighted works harvested from the internet. This copyrighted material is not free for the taking simply because it
can be found on the internet. Anthropic has neither sought nor secured Publishers’ permission to use their valuable copyrighted works in this way. Just as Anthropic does not want its code taken without its authorization, neither do music publishers or any other copyright owners want their works to be exploited without permission.
Anthropic claims to be different from other AI businesses. It calls itself an AI “safety and research” company, and it claims that, by training its AI models using a so-called “constitution,” it ensures that those programs are more “helpful, honest, and harmless.” Yet, despite its purportedly principled approach, Anthropic infringes on copyrights without regard for the law or respect for the creative community whose contributions are the backbone of Anthropic’s infringing service.
As a result of Anthropic’s mass copying and ingestion of Publishers’ song lyrics, Anthropic’s AI models generate identical or nearly identical copies of those lyrics, in clear violation of Publishers’ copyrights. When a user prompts Anthropic’s Claude AI chatbot to provide the lyrics to songs such as “A Change Is Gonna Come,” “God Only Knows,” “What a Wonderful World,” “Gimme Shelter,” “American Pie,” “Sweet Home Alabama,” “Every Breath You Take,” “Life Is a Highway,” “Somewhere Only We Know,” “Halo,” “Moves Like Jagger,” “Uptown Funk,” or any other number of Publishers’ musical compositions, the chatbot will provide responses that contain all or significant portions of those lyrics>>.
<<11. By copying and exploiting Publishers’ lyrics in this manner—both as the input it uses to train its AI models and as the output those AI models generate—Anthropic directly infringes Publishers’ exclusive rights as copyright holders, including the rights of reproduction, preparation of derivative works, distribution, and public display. In addition, because Anthropic unlawfully enables, encourages, and profits from massive copyright infringement by its users, it is secondarily liable for the infringing acts of its users under well-established theories of contributory infringement and vicarious infringement. Moreover, Anthropic’s AI output often omits critical copyright management information regarding these works, in further violation of Publishers’ rights; in this respect, the composers of the song lyrics frequently do not get recognition for being the creators of the works that are being distributed. It is unfathomable for Anthropic to treat itself as exempt from the ethical and legal rules it purports to embrace>>
Come funziona il training di AI:
<<54. Specifically, Anthropic “trains” its Claude AI models how to generate text by taking the following steps:
a. First, Anthropic copies massive amounts of text from the internet and potentially other sources. Anthropic collects this material by “scraping” (or copying or downloading) the text directly from websites and other digital sources and onto Anthropic’s servers, using automated tools, such as bots and web crawlers, and/or by working from collections prepared by third parties, which in turn may have been harvested through web scraping. This vast collection of text forms the input, or “corpus,” upon which the Claude AI model is then trained.
b. Second, as it deems fit, Anthropic “cleans” the copied text to remove material it perceives as inconsistent with its business model, whether technical or subjective in nature (such as deduplication or removal of offensive language), or for other reasons.
In most instances, this “cleaning” process appears to entirely ignore copyright infringements embodied in the copied text.
c. Third, Anthropic copies this massive corpus of previously copied text into computer memory and processes this data in multiple ways to train the Claude AI models, or establish the values of billions of parameters that form the model. That includes copying, dividing, and converting the collected text into units known as “tokens,” which are words or parts of words and punctuation, for storage. This process is referred to as “encoding” the text into tokens. For Claude, the average token is about 3.5 characters long.4
d. Fourth, Anthropic processes the data further as it “finetunes” the Claude AI model and engages in additional “reinforcement learning,” based both on human feedback and AI feedback, all of which may require additional copying of the collected text.
55. Once this input and training process is complete, Anthropic’s Claude AI models generate output consistent in structure and style with both the text in their training corpora and the reinforcement feedback. When given a prompt, Claude will formulate a response based on its model, which is a product of its pretraining on a large corpus of text and finetuning, including based on reinforcement learning from human feedback. According to Anthropic, “Claude is not a bare language model; it has already been fine-tuned to be a helpful assistant.”5 Claude works with text in the form of tokens during this processing, but the output is ordinary readable text>>.
First, Anthropic engages in the wholesale copying of Publishers’ copyrighted lyrics as part of the initial data ingestion process to formulate the training data used to program its AI models.
Anthropic fuels its AI models with enormous collections of text harvested from the internet. But just because something may be available on the internet does not mean it is free for Anthropic to exploit to its own ends.
For instance, the text corpus upon which Anthropic trained its Claude AI models and upon which these models rely to generate text includes vast amounts of Publishers’ copyrighted lyrics, for which they own or control the exclusive rights.
Anthropic largely conceals the specific sources of the text it uses to train its AI models. Anthropic has stated only that “Claude models are trained on a proprietary mix of publicly available information from the Internet, datasets that we license from third party businesses, and data that our users affirmatively share or that crowd workers provide,” and that the text on which Claude 2 was trained continues through early 2023 and is 90 percent English-language.6 The reason that Anthropic refuses to disclose the materials it has used for training Claude is because it is aware that it is copying copyrighted materials without authorization from the copyright owners.
Anthropic’s limited disclosures make clear that it has relied heavily on datasets (e.g., the “Common Crawl” dataset) that include massive amounts of content from popular lyrics websites such as genius.com, lyrics.com, and azlyrics.com, among other standard large text
collections, to train its AI models.7
Moreover, the fact that Anthropic’s AI models respond to user prompts by generating identical or near-identical copies of Publishers’ copyrighted lyrics makes clear that Anthropic fed the models copies of those lyrics when developing the programs. Anthropic had to first copy these lyrics and process them through its AI models during training, in order for the models to subsequently disseminate copies of the lyrics as output.
Second, Anthropic creates additional unauthorized reproductions of Publishers’ copyrighted lyrics when it cleans, processes, trains with, and/or finetunes the data ingested into its AI models, including when it tokenizes the data. Notably, although Anthropic “cleans” the text it ingests to remove offensive language and filter out other materials that it wishes to exclude from its training corpus, Anthropic has not indicated that it takes any steps to remove copyrighted content.
By copying Publishers’ lyrics without authorization during this ingestion and training process, Anthropic violates Publishers’ copyrights in those works.
Third, Anthropic’s AI models disseminate identical or near-identical copies of a wide range of Publishers’ copyrighted lyrics, in further violation of Publishers’ rights.
Upon accessing Anthropic’s Claude AI models through Anthropic’s commercially available API or via its public website, users can request and obtain through Claude verbatim or near-verbatim copies of lyrics for a wide variety of songs, including copyrighted lyrics owned and controlled by
Publishers. These copies of lyrics are not only substantially but strikingly similar to the original copyrighted works>>
Claude’s output is likewise identical or substantially and strikingly similar to Publishers’ copyrighted lyrics for each of the compositions listed in Exhibit A. These works that have been infringed by Anthropic include timeless classics as well as today’s chart-topping hits, spanning a range of musical genres. And this represents just a small fraction of Anthropic’s infringement of Publishers’ works and the works of others, through both the input and output of its AI models.
Anthropic’s Claude is also capable of generating lyrics for new songs that incorporate the lyrics from existing copyrighted songs. In these cases, Claude’s output may include portions of one copyrighted work, alongside portions of other copyrighted works, in a manner that is entirely inconsistent and even inimical to how the songwriter intended them.
Moreover, Anthropic’s Claude also copies and distributes Publishers’ copyrighted lyrics even in instances when it is not asked to do so. Indeed, when Claude is prompted to write a song about a given topic—without any reference to a specific song title, artist, or songwriter—Claude will often respond by generating lyrics that it claims it wrote that, in fact, copy directly from portions of Publishers’ copyrighted lyrics>>.
In other words, Anthropic infringes Publishers’ copyrighted lyrics not only in response to specific requests for those lyrics. Rather, once Anthropic copies Publishers’ lyrics as input to train its AI models, those AI models then copy and distribute Publishers’ lyrics as output in response to a wide range of more generic queries related to songs and various other subject matter>>.