How U.S. Copyright Law Applies to LLM Training

mrarup826 hours ago

0 0 7 minutes read

ANDREA BARTZ, CHARLES GRAEBER, and KIRK WALLACE JOHNSON v. ANTHROPIC PBC, retrieved on June 25, 2025, is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 4 of 10.

ANALYSIS

Section 107 of the Copyright Act identifies four factors for determining whether a given use of a copyrighted work is a fair use:

[T]he fair use of a copyrighted work . . . for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include —

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work.

These factors presuppose a “use.” So, at the threshold, a court must decide whether a “copyrighted [work] has been used in multiple ways,” then evaluate each. Warhol, 598 U.S. at 533. Uses do not turn on “the subjective intent of the user” but on “an objective inquiry into what use was made, i.e., what the user d[id] with the original work.” Id. at 544–45. A “use” should be construed narrowly enough to not “swallow” distinguishable infringing uses, much less categories of exclusive rights in toto. Id. at 541, 543 n.18, 546–48. Sometimes, the challenged copying involves just one use: In Perfect 10, Inc. v. Amazon.com, Inc., Google visited websites having full-sized images, made only reduced-sized copies, and incorporated those directly into its search engine — the sole use of the thumbnails being as “pointer[s]” to the images themselves. 508 F.3d 1146, 1157, 1160, 1165 (9th Cir. 2007). Sometimes, the copying involves many uses: In the Google Books cases, Google borrowed books from libraries, made both full-image and text-only copies, and incorporated different copies into different tools — one use being to reveal information “about those books,” another use being to provide the books to print-disabled patrons, and still another being to back up the print books if lost. Authors Guild v. Google, Inc., 804 F.3d 202, 217 (2d Cir. 2015) (quoted); Authors Guild, Inc. v. HathiTrust, 755 F.3d 87, 97, 101, 103 (2d Cir. 2014) (other cited uses).

Our parties debate an instructive decision. In American Geophysical Union v. Texaco Inc., Texaco employees used scientific articles in a central library, used copies of them in personal desk libraries, and used selected copies again in the scientific laboratory — the first use paid for, the second infringing, and the third plausibly fair but in fact a rare occurrence. 802 F. Supp. 1, 4–5, 14 (S.D.N.Y. 1992) (Judge Pierre Leval), aff’d, 60 F.3d 913, 918–19, 926 (2d Cir. 1994).

Here, our parties contest what use or uses are at issue. Anthropic contends it copied Authors’ books only for one use: Only to train LLMs. By contrast, Authors contend it did so for at least two uses: First to build a vast, central library of potentially useful content, and second to train specific LLMs using shifting sets and subsets of that content — over time selecting the more well-organized and well-expressed works for training. Authors also complain that the print-to-digital format change was itself an infringement not abridged as a fair use (Opp. 15, 25). Authors do not allege, however, that any LLM outputs infringing upon their works ever reached users of the public-facing Claude service. This order addresses each of the four factors in turn, pointing out how each applies to the training copies and to the purchased and pirated library copies. It concludes with an integrated analysis.

1. THE PURPOSE AND CHARACTER OF THE USE.

For a given use at issue, the first factor addresses “the purpose and character of th[at] use, including whether [it] is of a commercial nature or is for nonprofit educational purposes.” 17 U.S.C. § 107(1).

A. THE COPIES USED TO TRAIN SPECIFIC LLMS.

All agree that one use at issue was training LLMs to receive text inputs and return text outputs. More specifically, Anthropic used copies of Authors’ copyrighted works to iteratively map statistical relationships between every text-fragment and every sequence of text-fragments so that a completed LLM could receive new text inputs and return new text outputs as if it were a human reading prompts and writing responses. Authors further argue — and this order takes for granted — that such training entailed “memoriz[ing]” works by “compress[ing]” copies of those works into the LLM (Opp. 16–17; see Opp. Expert Zhao ¶ 74). The LLMs “memorize[d] A LOT, like A LOT” (Opp. Exh. 35 at -029109). Regardless, the “purpose and character” of using works to train LLMs was transformative — spectacularly so. To repeat and be clear: Authors do not allege that any LLM output provided to users infringed upon Authors’ works. Our record shows the opposite. Users interacted only with the Claude service, which placed additional software between the user and the underlying LLM to ensure that no infringing output ever reached the users. This was akin to the limits Google imposed on how many snippets of text from any one book could be seen by any one user through its Google Books service, preventing its search tool from devolving into a reading tool. Google, 804 F.2d at 222. Here, if the outputs seen by users had been infringing, Authors would have a different case. And, if the outputs were ever to become infringing, Authors could bring such a case. But that is not this case.

Instead, Authors challenge only the inputs, not the outputs, of these LLMs. They point to the fully trained LLMs and the Claude service only to shed light on how training itself uses copies of their works and the ways the Claude service could be used to produce still other works that would compete with their works. This order does the same. Authors’ arguments that the training use is not transformative are unavailing.

First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.

Second, to that last point, Authors further argue that the training was intended to memorize their works’ creative elements — not just their works’ non-protectable ones (Opp. 17). But this is the same argument. Again, Anthropic’s LLMs have not reproduced to the public a given work’s creative elements, nor even one author’s identifiable expressive style (assuming arguendo that these are even copyrightable). Yes, Claude has outputted grammar, composition, and style that the underlying LLM distilled from thousands of works. But if someone were to read all the modern-day classics because of their exceptional expression, memorize them, and then emulate a blend of their best writing, would that violate the Copyright Act? Of course not. Copyright does not extend to “method[s] of operation, concept[s], [or] principle[s]” “illustrated[ ] or embodied in [a] work.” 17 U.S.C. § 102(b); see, e.g., Nichols v. Universal Pictures Corp., 45 F.2d 119, 120–22 (2d Cir. 1930) (Judge Learned Hand) (stage properties and storytelling elements); Apple Comput., Inc. v. Microsoft Corp., 35 F.3d 1435, 1445 (9th Cir. 1994) (“user-friendly” design principles and elements); Swirsky v. Carey, 376 F.3d 841, 848 (9th Cir. 2004) (music theory principles and chord progressions).

Third, Authors next argue that computers nonetheless should not be allowed to do what people do.

Authors cite a decision seeming to say as much (Opp. 16–17). But the judge there twice emphasized while discussing “purpose and character” of the use that what was trained was “not generative AI (AI that writes new content itself).” Rather, what was trained — using a proprietary system for finding court opinions in response to a given legal topic — was a competing AI tool for finding court opinions in response to a given legal topic. That was not transformative. Thomson Reuters Enter. Centre GmbH v. Ross Intell. Inc., 765 F. Supp. 3d 382, 398 (D. Del. 2025) (Judge Stephanos Bibas), appeal docketed, No. 25-8018 (3d Cir. Apr. 14, 2025).

A better analogue to our facts would be an AI tool trained — using court opinions, and briefs, law review articles, and the like — to receive legal prompts and respond with fresh legal writing. And, on facts much like those, a different court came out the other way. It found fair use. White v. W. Pub. Corp., 29 F. Supp. 3d 396, 400 (S.D.N.Y. 2014) (Judge Jed Rakoff).

The latter use stood sufficiently “orthogonal” to anything that any copyright owner rightly could expect to control. See Warhol, 598 U.S. at 538–40. It could thus be freed up for the copyist to use, “promot[ing] the progress of science and the arts, without diminishing the incentive to create.” Id. at 531 (emphasis added); see U.S. CONST. art. I, § 8, cl. 8.

In short, the purpose and character of using copyrighted works to train LLMs to generate new text was quintessentially transformative. Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different. If this training process reasonably required making copies within the LLM or otherwise, those copies were engaged in a transformative use.

The first factor favors fair use for the training copies. But that is not the only use at issue.

About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.

This court case retrieved on June 25, 2025, from storage.courtlistener.com, is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.

mrarup826 hours ago

0 0 7 minutes read

ANALYSIS

1. THE PURPOSE AND CHARACTER OF THE USE.

A. THE COPIES USED TO TRAIN SPECIFIC LLMS.

mrarup82

Related Articles

Arctic Pablo Raises $475K in Days—Peanut the Squirrel and Mog Coin Offer Big Gains for 2025

Xiaomi Tablet Launch, Ramp Funding & More

‘There are going to be rough storms ahead,’ warns R. Kiyosaki

Schweden: H100 wird erstes börsennotiertes Unternehmen mit Bitcoin in der Bilanz

Leave a Reply Cancel reply