The Linux Foundation’s Community Data License Agreement
The Linux Foundation released version 2.0 of its permissive Community Data License Agreement (CDLA-Permissive-2.0), a licensing option designed to make sharing data easier for machine learning and artificial intelligence projects.
Why Data Licensing Matters
Software licensing is well-established — developers choose from MIT, Apache 2.0, GPL, and dozens of other options with decades of legal precedent. Data licensing is far less mature. Creative Commons licenses were designed for creative works, not datasets. The Open Knowledge Foundation’s Open Data Commons licenses addressed some gaps but predate the current era of large-scale ML training data.
The CDLA was created specifically for data sharing in technical contexts, recognizing that datasets have different practical requirements than code or creative works.
What Changed in Version 2.0
The CDLA-Permissive-2.0 is a significant simplification of the original CDLA-Permissive-1.0. The key change: removing the attribution requirement.
Under the original CDLA-Permissive-1.0, data had to be attributed to its source:
3.1(c) If You Publish Data You Receive, You must preserve all credit or attribution to the Data Provider(s).
This posed unforeseen problems. Datasets get combined, split, filtered, and transformed constantly during ML workflows. Tracking which individual data points came from which sources — and carrying attribution metadata through every transformation — created a logistical burden that discouraged adoption.
Version 2.0 keeps it simple: include the license text with the shared data, and you can use, share, and modify the data freely. This mirrors the approach of permissive software licenses like MIT.
CDLA in Context
The CDLA-Permissive-2.0 fills an important gap in the data licensing landscape:
- Creative Commons (CC-BY, CC0): Designed for creative works. CC0 is used for some datasets but lacks provisions specific to data combination and enhancement.
- Open Data Commons (ODC-By, ODbL, PDDL): Purpose-built for databases, but predates modern ML data pipelines and can be ambiguous about derived datasets.
- CDLA-Permissive-2.0: Purpose-built for data sharing in AI/ML contexts, with clear terms for combining and modifying datasets.
For organizations building or distributing training data for machine learning, understanding these licensing options — and their implications for data provenance — is increasingly important as AI training data copyright questions reach courts and regulators worldwide.
If you’re evaluating licensing options for your data assets or assessing the provenance of training data in your ML pipeline, our AI & Data team can help you understand the risks and options.
Related posts
Delve and the 494 Fake SOC 2 Reports: What the Compliance Industry Should Learn
A Y Combinator-backed compliance startup allegedly fabricated 494 SOC 2 reports with auditor conclusions pre-written before clients submitted any evidence.
Read moreFive Supply Chain Attacks in Twelve Days: March 2026 Broke Open Source Trust
In twelve days, attackers compromised Trivy, Checkmarx, LiteLLM, Telnyx, and Axios — and the supply chain security model most organizations rely on did not survive.
Read moreSCOTUS Settles It: No Copyright Without a Human Author
The Supreme Court’s denial in Thaler v. Perlmutter leaves one rule standing: if no human authorship exists, there is no copyright.
Read more