AI Training Data & Copyright

Copyright lawsuits are multiplying. Disclosure laws are in effect. Training data that was fine two years ago may not be defensible today. We audit training data practices and build defensible paths forward.

We built KL3M — 132M+ copyright-clean documents, 1.35 trillion tokens — and oversaw the first Fairly Trained LLM certification. When we audit your training data, the advice comes from direct experience, not theory.

Starting at $25K | 2-8 weeks

Services

Training Data Provenance Audit

Full audit of training data sources, licensing terms, copyright status, and consent documentation. Chain-of-custody review from source through ingestion.

2-6 weeks

Fairly Trained Certification Support

End-to-end guidance through the Fairly Trained certification process — data inventory, licensing review, remediation, and application. Led by the team behind the first LLM to earn certification.

4-8 weeks

Copyright Risk Assessment

Quantified analysis of copyright exposure across your AI systems, including third-party model usage. Written risk report with litigation-informed recommendations.

2-4 weeks

Training Data Transparency Compliance

Preparation for EU AI Act GPAI training data summaries and California AB 2013 disclosure requirements. Documentation that satisfies regulators.

2-4 weeks

AI-BOM / Training Data Documentation

Structured AI Bills of Materials in CycloneDX ML-BOM or SPDX AI profile format. Documents data sources, processing steps, licensing status, and model lineage.

2-4 weeks

Why us

We built the dataset, not just the advice

KL3M wasn't a side project. It was 132M+ documents sourced, cleaned, licensed, and governed to meet Fairly Trained certification standards. When we audit your training data, we've lived every step of the process — from source identification through chain-of-custody documentation.

First Fairly Trained LLM certification

We oversaw the governance process for the first LLM to receive Fairly Trained L-Certification. We know what the assessors look for, where organizations typically fall short, and how to structure a successful application.

70+ lawsuits, tracked in real time

Thomson Reuters v. Ross. NYT v. OpenAI. The $1.5B Anthropic settlement. We track every AI copyright case, every ruling, and every settlement. Our risk assessments reflect what courts are actually deciding, not what we hope they'll decide.

Why licens.io?

Training data experience

Big 4

Advise from theory

licens.io

Built largest copyright-clean dataset: 132M+ docs

Certification

Big 4

No direct certification experience

licens.io

Oversaw first Fairly Trained LLM certification

Litigation context

Big 4

General IP awareness

licens.io

Track 70+ active copyright lawsuits

Research

Big 4

Marketing whitepapers

licens.io

Published KL3M Data Project paper

Credentials

Big 4

Legal or tech, rarely both

licens.io

CPA + CIPP/US + CIPP/E + Certified AI Auditor

Who this is for

  • AI companies building or fine-tuning models that need training data audits before regulators or plaintiffs come knocking
  • Enterprises using third-party AI that need copyright risk assessment across their AI vendors
  • Companies seeking Fairly Trained certification for competitive differentiation and regulatory readiness
  • Organizations subject to EU AI Act GPAI requirements that need training data summaries and transparency documentation
  • Boards and investors assessing AI portfolio copyright exposure

Frequently asked questions

What is Fairly Trained certification and how do I get it?

Fairly Trained is an independent certification verifying that an AI model was trained on data obtained with the consent of copyright holders. The process involves a data inventory, licensing review, remediation of any non-compliant sources, and a third-party assessment. We guided the first LLM through this process and can do the same for your models.

Does the EU AI Act require training data disclosure?

Yes. Providers of general-purpose AI (GPAI) models must provide sufficiently detailed summaries of training data content. These obligations are already enforceable for new GPAI models as of August 2025.

What is California AB 2013?

AB 2013 requires AI developers to disclose information about the datasets used to train generative AI systems, including data sources, size, and whether the data includes personal information. The law took effect January 1, 2026.

How much copyright risk does my training data carry?

Risk depends on data sources, licensing terms, jurisdiction, and use case. The Thomson Reuters v. Ross ruling and recent settlements ($1.5B+) show courts are not uniformly accepting fair use defenses. A provenance audit quantifies your specific exposure.

What is an AI-BOM?

An AI Bill of Materials documents all data sources, processing steps, model components, and licensing terms used in an AI system. It serves a similar function to a software BOM but covers AI-specific supply chain risks including training data copyright and provenance.

Can I use Creative Commons licensed content for AI training?

It depends on the specific CC license. Some restrict commercial use or derivative works, and whether AI training constitutes a "derivative work" remains legally contested. Each license and jurisdiction requires individual analysis.

Training data risk is not going away

We'll assess your training data sources, quantify your copyright exposure, and give you a clear remediation plan — fixed price, defined timeline.