The Linux Foundation’s Community Data License Agreement
The Linux Foundation released version 2.0 of its permissive Community Data License Agreement (CDLA-Permissive-2.0), a licensing option designed to make sharing data easier for machine learning and artificial intelligence projects.
Why Data Licensing Matters
Software licensing is well-established — developers choose from MIT, Apache 2.0, GPL, and dozens of other options with decades of legal precedent. Data licensing is far less mature. Creative Commons licenses were designed for creative works, not datasets. The Open Knowledge Foundation’s Open Data Commons licenses addressed some gaps but predate the current era of large-scale ML training data.
The CDLA was created specifically for data sharing in technical contexts, recognizing that datasets have different practical requirements than code or creative works.
What Changed in Version 2.0
The CDLA-Permissive-2.0 is a significant simplification of the original CDLA-Permissive-1.0. The key change: removing the attribution requirement.
Under the original CDLA-Permissive-1.0, data had to be attributed to its source:
3.1(c) If You Publish Data You Receive, You must preserve all credit or attribution to the Data Provider(s).
This posed unforeseen problems. Datasets get combined, split, filtered, and transformed constantly during ML workflows. Tracking which individual data points came from which sources — and carrying attribution metadata through every transformation — created a logistical burden that discouraged adoption.
Version 2.0 keeps it simple: include the license text with the shared data, and you can use, share, and modify the data freely. This mirrors the approach of permissive software licenses like MIT.
CDLA in Context
The CDLA-Permissive-2.0 fills an important gap in the data licensing landscape:
- Creative Commons (CC-BY, CC0): Designed for creative works. CC0 is used for some datasets but lacks provisions specific to data combination and enhancement.
- Open Data Commons (ODC-By, ODbL, PDDL): Purpose-built for databases, but predates modern ML data pipelines and can be ambiguous about derived datasets.
- CDLA-Permissive-2.0: Purpose-built for data sharing in AI/ML contexts, with clear terms for combining and modifying datasets.
For organizations building or distributing training data for machine learning, understanding these licensing options — and their implications for data provenance — is increasingly important as AI training data copyright questions reach courts and regulators worldwide.
If you’re evaluating licensing options for your data assets or assessing the provenance of training data in your ML pipeline, our AI & Data team can help you understand the risks and options.
Related posts
Zero to a Million in Twelve Weeks: Why YC's Incentive Structure Is an Enterprise Vendor Risk Problem
When a startup accelerator tells founders that failing to hit a million dollars in revenue in twelve weeks is a 'skill issue,' the pressure does not just produce growth. It produces shortcuts. Enterprise buyers should pay attention.
Read moreFive Lawsuits in One Week: The Legal Fallout from the Mercor Data Breach
Five class action lawsuits filed against Mercor in a single week trace a direct line from a supply chain compromise through fake compliance certifications to 4 terabytes of stolen contractor data.
Read moreDelve and the 494 Fake SOC 2 Reports: What the Compliance Industry Should Learn
A Y Combinator-backed compliance startup allegedly fabricated 494 SOC 2 reports with auditor conclusions pre-written before clients submitted any evidence.
Read more