Data Strategy

Copyright Office Part 3: AI Training on Copyrighted Works Is Not Clearly Fair Use

Jillian Bommarito

On May 9, 2025, the U.S. Copyright Office released Part 3 of its AI report, Copyright and Artificial Intelligence, Part 3: Generative AI Training. The headline is simple enough to fit on a slide, which is probably why people keep pretending the answer is simple: AI training on copyrighted works is not clearly fair use.

That is the useful part of the report. Not because it delivers a neat one-line rule. It does the opposite. It says the legal analysis is fact-specific, the market effects can be serious, and the era of “we scraped it, so it’s fine” is not a legal theory. It is a wish.

The Copyright Office is not saying every use of copyrighted material in AI training is infringing. That would be too easy, and courts dislike easy answers almost as much as litigants do.

The Office’s view is more nuanced: some training uses may qualify as fair use, and some will not. On one end of the spectrum, noncommercial research or analysis that does not reproduce protected expression in outputs is more likely to be treated favorably. On the other end, using expressive works from plainly unauthorized sources to generate commercial outputs that compete in the market is a much harder sell. Especially when licensing is available.

The report also makes an important point that too many vendors still want to skate past: the fair use analysis is not abstract. It turns on what works were used, from what source, for what purpose, and with what controls on the outputs. In other words, provenance matters. A lot.

The Office’s AI hub confirms the report is a pre-publication version released on May 9, with no substantive change expected in the final version. So this is not a stray comment from a staffer on a panel. It is the Register’s current view of the landscape.

Why this matters now

Because the market has spent the last two years behaving as if “training data” is a magical phrase that dissolves copyright into compute.

That was always a little optimistic.

The Office points to two pressure points that should make every AI team and every buyer pause. First, the copying involved in training can harm the market for copyrighted works when a model generates outputs that substitute for them. Second, even when outputs are not literally close copies, they can still dilute the market by producing stylistically similar material. That is not a theoretical harm. That is a business harm.

And then there is the licensing question. The report notes that voluntary licensing is already happening in some sectors, and that it appears likely to develop in others. Which is a polite way of saying: if a market exists, the law notices.

That matters because fair use is not a permanent escape hatch. It is a balancing test. The Office is basically telling everyone to stop asking whether the entire category is blessed and start asking whether their particular use can survive a real analysis.

That is the part people hate. It requires work.

The uncomfortable part is the data

This is where the conversation stops being ideological and starts being operational.

If you do not know what went into your model, you do not know your exposure. If you do not know your exposure, you cannot credibly say your system is compliant. And if your answer to that is “well, the vendor says it’s fine,” then congratulations, you have outsourced risk to the people with the strongest incentive to minimize it.

A crawl log is not a rights analysis. A dataset is not magic because it is large. And “publicly available” is not the same as “free to use for any purpose whatsoever.”

The Copyright Office’s report does not create a new doctrine out of thin air. It makes the existing one harder to ignore. It also reinforces a point that comes up constantly in diligence: source quality and legal provenance are now part of model quality.

That is true whether you are building an internal copilot, a customer-facing product, or a foundation model that is supposed to power a whole platform. If the training set is dirty, the legal risk is dirty too. The clever label on the front of the box does not change that.

What organizations should do

Data strategy stops being a presentation here and starts being a control environment.

If you are building or buying AI systems, you should be doing at least four things now:

  • Inventory training sources and identify where rights were actually cleared, licensed, or asserted.
  • Separate uses by purpose, because research, internal tools, and commercial deployment do not live in the same risk bucket.
  • Preserve provenance and lineage so you can explain what was used, when, and under what terms.
  • Add output controls and monitoring so the model does not become a machine for reproducing protected expression.

That last point matters. The Office’s analysis is not just about inputs. It is also about what the system does with them.

For companies doing acquisitions, financing, or strategic transactions, this belongs in technology diligence alongside security, privacy, and architecture review. If you are paying for an AI asset, you should care whether the model has a defensible data chain or just a confident demo. One of those survives contact with counsel.

This is also why we keep coming back to copyright-clean AI development. We built KL3M to prove that it is possible to train useful language models without acting like copyright is a rumor. That does not mean every use case can be solved the same way. It does mean the industry has no excuse for pretending the choice is between innovation and lawlessness.

The practical takeaway

The Copyright Office is not banning AI training on copyrighted works. It is saying the legal answer is not a blanket yes, and it is certainly not “fair use because scale.”

That is a meaningful shift. Not because it changes the statute, but because it changes the default assumption. The burden is moving back to the people building the systems to show why their use fits the law, not why everyone else should be impressed by the transformer stack.

And frankly, that is how it should work.

AI is not exempt from copyright because the business model is aggressive, the dataset is big, or the marketing deck is clean. If your plan depends on somebody else’s creative labor, at minimum you need to know whether you have permission, a plausible fair use theory, or a licensing strategy that can hold up in daylight.

Otherwise, you are not building a strategy. You are building a problem.

Bottom line

The Office just made the obvious harder to dodge: AI training on copyrighted works is not clearly fair use in the general sense people have been hoping for. Some uses will be defensible. Some will not. The difference will turn on provenance, purpose, market impact, and whether you bothered to license what you could have licensed.

That is a very unsexy answer. It is also the correct one.

And for organizations trying to ship AI without stepping on a rake, correctness has a better ROI than denial.

Related posts

Want to discuss this topic?

We'll give you a straight answer — not a sales pitch.