Data Strategy

NYT v. OpenAI Survives Dismissal: The Copyright Case Moves Forward

Jillian Bommarito

On April 4, 2025, Judge Sidney H. Stein of the Southern District of New York denied OpenAI’s motion to dismiss most of the core copyright claims in The New York Times Company v. Microsoft Corporation et al., and the case is very much alive. That does not mean the Times has won on the merits. It does mean OpenAI does not get to wave the case away at the pleading stage and call it a day.

And that matters, because a lot of AI companies still seem to think the legal system will eventually shrug and bless the training-data free-for-all with a neat little “fair use” stamp. Maybe that story works in pitch decks. It is less persuasive in federal court.

What The Court Actually Said

The opinion is broader than the Times case alone. It covers consolidated actions brought by the Times, Daily News plaintiffs, and the Center for Investigative Reporting. But the headline is simple: the court denied OpenAI’s motion to dismiss the Times’s direct copyright infringement claims that reach back more than three years before the complaint was filed, and it also let contributory copyright infringement claims move forward.

That is not a trivial outcome. OpenAI had argued that the relevant copying happened in 2019 and 2020, outside the three-year limitations period. The court was not convinced. Judge Stein held that OpenAI had not met its burden of showing that the Times knew, or should have known, about the alleged infringement before December 27, 2020, which is the three-year cutoff tied to the Times’s December 27, 2023 complaint.

In plain English: the court is saying, “You do not get to assume the plaintiffs should have discovered your copying just because they were generally aware that you existed.”

That is a pretty important distinction. A company can know that AI is changing the world without knowing that its own copyrighted works were copied into a particular training set. Those are not the same thing, despite the very enthusiastic PowerPoint slides people may have produced otherwise.

Why OpenAI’s Dismissal Argument Missed

OpenAI’s statute-of-limitations argument leaned hard on a 2020 Times article about GPT-3 and the company’s broader public profile. Judge Stein was not having it. The opinion says OpenAI’s “sophisticated publisher” theory is a straw man, and that there is no special “sophisticated rightsholder” exception to the discovery rule.

That is the part companies should pay attention to.

The court is not saying the Times wins because it is the Times. The court is saying the legal standard is about actual or constructive knowledge of the alleged infringement, not a generic assumption that a well-resourced publisher should have connected the dots earlier. A newspaper reporting on OpenAI’s rise is not the same thing as discovering that its own copyrighted works were copied into GPT-2 or GPT-3 training datasets.

That distinction gets even sharper when you look at the record the plaintiffs pleaded. The court pointed to more than 100 pages of examples in the Times complaint, dozens more in the Daily News complaint, and allegations of widely publicized instances of infringing outputs after ChatGPT and other OpenAI products launched. In other words, this was not a case built on vibes. It was built on examples.

And that is the real legal takeaway: specificity wins at the pleading stage. Broad claims about “the internet” and “transformative technology” are not enough if the complaint can point to concrete outputs, concrete works, and concrete pathways from source material to allegedly infringing results.

The Bigger Problem For AI Builders

This ruling is not just about OpenAI. It is about every company training or fine-tuning models on content it did not create.

If your business depends on third-party text, images, code, audio, or video, you need a training data provenance strategy. Not a wish. Not a folder called legal_maybe. A real strategy.

That means knowing:

  • where each dataset came from,
  • what rights you actually have to use it,
  • whether copyright management information was preserved or stripped,
  • whether your vendor promises are backed by records,
  • and whether your model outputs can be tied back to protected works in a way that creates risk.

If you cannot answer those questions, you do not have governance. You have exposure.

AI training data compliance belongs inside a broader data strategy program. The problem is not just legal; it is operational. The company that can document lineage, permissions, exclusions, and overrides will be in a very different position from the company that says, “We scraped responsibly, I think.”

That is not a compliance framework. That is a confession with latency.

What This Means For The Market

The court did not decide fair use on April 4. It did not decide whether every training use is infringing. It did not decide whether OpenAI ultimately wins on summary judgment or after trial. What it did do was refuse to end the case early.

That matters because motions to dismiss are where defendants often try to narrow the battlefield before the expensive part starts. If the judge lets the core copyright claims survive, then the next rounds of litigation get much more expensive, much more document-heavy, and much less theoretical.

For AI companies, the practical message is blunt: assume your training data will be questioned later. Not someday. Later. Maybe by a publisher, maybe by a customer, maybe by a regulator, maybe in diligence when you are trying to raise money or sell the business.

And once that question is asked, “we used public data” is not a complete answer. Public does not automatically mean licensed. Available does not automatically mean cleared. Internet-scale does not automatically mean rights-safe.

The market has now seen enough litigation to know that the provenance problem is not going away just because the models got better. Like most risks, this one does not disappear when we ignore it. It just compounds.

What To Do Now

If you are building or buying AI systems, this is the moment to get serious about the paper trail.

Start with a dataset inventory. Then map rights, exclusions, vendor representations, and retention policies. Then test whether your outputs are exposing copyrighted material, CMI, or other downstream risk. If the system cannot support those controls, redesign it before someone else does it for you in a complaint.

For companies doing AI governance and compliance work, this is also where board education matters. Directors do not need every technical detail of model training, but they do need to understand that training data provenance is now a first-order business issue, not a niche legal footnote. The same is true for investors performing diligence. If the model depends on unverified data sources, that risk should show up in valuation, deal terms, and reps and warranties.

Because here is the uncomfortable truth: the AI industry can either build provenance into its stack now, or wait until a judge, plaintiff, or acquirer makes the point with exhibits.

That is usually the less pleasant option.

And if your response to all of this is still “but everybody does it,” congratulations. That is also what people say right before discovery.

Related posts

Want to discuss this topic?

We'll give you a straight answer — not a sales pitch.