Data Strategy

First Court Rejects AI Fair Use: What Thomson Reuters v. ROSS Means for AI Companies

Jillian Bommarito

On February 11, 2025, Judge Stephanos Bibas issues a decision that every AI builder, data team, and in-house lawyer should read twice: in Thomson Reuters Enterprise Centre GmbH v. ROSS Intelligence Inc., he holds that using Westlaw headnotes to train a competing legal search tool is not fair use.

That makes this the first U.S. federal court decision to squarely reject a fair-use defense in an AI training-data dispute on the merits. Not a side remark. Not a footnote. Not a “we’ll see later.” A real ruling.

And the facts matter. A lot.

ROSS is not training a general-purpose chatbot that writes poetry, drafts code, and occasionally hallucinates a tax code citation at 2 a.m. The court says this is non-generative AI. ROSS is a legal research tool that takes a user’s legal question and returns relevant opinions. Thomson Reuters owns Westlaw, which uses editorial headnotes and the Key Number System to help researchers find cases. ROSS wanted access to that content, was denied a license because it was a competitor, and then used a third party’s “Bulk Memos” built from Westlaw headnotes to train its system anyway.

That is not a cute fact pattern. That is a compliance problem with a business model attached.

What The Court Actually Held

Bibas reverses part of his own 2023 view after renewed briefing and concludes that Thomson Reuters wins on both direct infringement and fair use. The opinion says Ross infringed 2,243 headnotes, and on fair use, the key factors line up like this: factors one and four favor Thomson Reuters; factors two and three favor Ross; overall, Ross loses.

That matters because the court is not just saying “copyright is hard.” It is saying something more specific and more uncomfortable:

  • Ross’s use is commercial
  • Ross’s use is not transformative
  • Ross and Westlaw are competitors
  • Thomson Reuters has a plausible market for AI training data
  • Ross’s copying affects both the existing market and a potential derivative market

In other words, this is not the kind of “innovation” story that gets a ribbon and a grant check. It is the kind that ends up in a docket.

Judge Bibas is unusually direct about the purpose issue. He says Ross uses Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw, and that the process resembles Westlaw’s own research function. The court also notes that the headnotes do not appear in the final product. Ross leans hard on that fact, but the opinion is not persuaded. Intermediate copying is not some magical invisibility cloak.

Why This Decision Is Different

The biggest takeaway is not “AI is illegal.” That would be lazy, and lazy law is how people end up spending six figures explaining a dataset they should have documented on day one.

The real takeaway is that provenance now has legal teeth.

For years, a lot of AI teams have operated on a very simple assumption: if the model output does not reproduce the source material verbatim, the training step is probably fine. Maybe the data came from a vendor. Maybe it was scraped. Maybe nobody wants to ask too many questions because the answer might slow down launch.

That assumption just took a hit.

This opinion is especially important because it focuses on a very ordinary commercial use case: a company using copyrighted editorial content to train a competing product. Not a research toy. Not a one-off experiment. A product designed to sit in the same market and take the same customers.

That is the kind of fact pattern that shows up in diligence, licensing negotiations, and board meetings.

The Practical Problem Is Data, Not Hype

If you are building AI systems, the immediate question is not “Can we train on it?” The question is:

Can we prove we had the right to use it?

That means the team needs answers before the model trains, not after the complaint lands.

You need to know:

  • Where the data came from
  • Whether the source had redistribution rights
  • Whether there is a license chain you can actually defend
  • Whether the corpus includes copyrighted editorial material, not just raw facts
  • Whether the dataset is being used to create a market substitute for the original
  • Whether the vendor’s representations are specific enough to survive litigation, diligence, and audit

If that list feels boring, good. Boring is what compliance looks like when it works.

This is exactly why AI training data compliance and copyright risk assessment belong inside data strategy, not as a cleanup exercise after model development. If you cannot trace the rights to your training corpus, you do not have governance. You have optimism dressed up as a roadmap.

What AI Companies Should Do Now

The obvious reaction to this ruling is panic. That is not useful. The useful reaction is to tighten the process.

Start with three moves:

  1. Inventory the corpus.
    Know what is in it, where it came from, and what rights attach to each source.

  2. Document the chain of rights.
    If a vendor supplied the data, get more than a vague assurance that it was “legally obtained.” You want specific contractual language, provenance records, and indemnity terms that mean something.

  3. Separate raw facts from protected editorial work.
    Courts care about the difference between factual material and the creative selection, arrangement, and synthesis that make a dataset or editorial product original.

If your AI program touches legal, financial, healthcare, or other heavily curated content, this is not optional. It is the core of the risk model.

And if you are doing diligence on an acquisition or investment, this belongs in the first sprint. A shiny AI demo without rights documentation is not an asset. It is a future witness exhibit.

Some people will read this decision and say, “Well, that is just legal research. My model is different.” Maybe. But the underlying logic is broader than legal search.

The court is looking at commercial substitution, market harm, and the use of copyrighted material to train a system that competes with the original. That framework will be cited again. And again. And again.

The industry has spent two years pretending that “AI training” is one bucket. It is not. There is a big difference between:

  • a model trained on licensed or openly available materials,
  • a model trained on proprietary editorial content,
  • a model trained on material gathered with no defensible rights story, and
  • a model trained in a way that creates a market substitute for the original work

Those are not subtle distinctions. They are the difference between a launch and a subpoena.

The Real Lesson

This opinion validates something a lot of teams have been saying quietly for a while: copyright-clean AI development is not a slogan, it is an operating discipline.

That is where data strategy, AI governance, and legal review stop being separate silos and start being the same conversation. Training data compliance. Vendor diligence. Board education. AI audits. Data governance. It all starts to look suspiciously like the same problem once a judge asks where the data came from.

And that is the point.

If you are building AI systems, especially in regulated or high-value knowledge markets, the winning move is not to hope the legal theory stays fuzzy long enough for you to ship. The winning move is to build with provenance, licensing, and compliance-by-design from the start.

Because when the court finally asks, “What exactly did you train on?” the answer had better be better than, “We found it on the internet.”

Related posts

Want to discuss this topic?

We'll give you a straight answer — not a sales pitch.