How Data Provenance Drives Machine Learning Risk + Value
The Provenance of Provenance
For many, provenance is a foreign term, frequently (and ironically) confused with the Provence region of France. But if you’ve ever heard the story of a rediscovered work of art, a classic car with a famous owner, or a counterfeit bottle of expensive wine, you already understand the importance of provenance.
Provenance is just knowing where something came from. You might even recognize the veni in its Latin form, provenire, as the “veni” in Julius Caesar’s famous “veni, vidi, vici.” So, while provenance is the technical term, you can substitute “origin” or “lineage” in most conversations.
Historically, provenance has been most famously applied in the context of art. In the world of art, honest mistakes and malicious forgeries have resulted in many famous stories of long-lost masterworks or fortunes ill-gotten. And after World War II, the colossal scale of appropriation on the European continent still echoes in auctions houses and private sales. Today, there are thousands of people employed around the world focused solely on provenance of works of art.
In art, we can simplify provenance as answering two questions:
- Who created the work?
- Does the current possessor have the right to transfer?
In theory, answering the legal conveyance question – question #2 – seems like the same thing as the authorship question – question #1. Ideally, you’d prove that all prior transfers were clean, including from the creator originally. But unlike other forms of conveyance we might be used to, like stock certificates or real property, many works of art have much longer, less-documented, and sometimes informal histories.
At this point, our discussion of title and real property might have clued you on to a very similar context – real estate. Anyone who has bought or sold real estate, especially if financed, knows that title insurance and real property deeds are typically key to close. This process – of proving the chain of legal ownership of a parcel – is conceptually identical to provenance in art.
When it comes to collectibles, the chain of ownership may create more value than the work itself! A vintage car might be worth $1M on its own, but much more (or less!) if Frank Sinatra or James Dean once owned it. Conversely, those works without provenance or with poorly-supported documentation – often stolen – are typically sold far below true market value.
Know Thy Data
First and foremost, how much value can you create from data if you don’t understand or trust it? If you aren’t clear about the entities, actors, or actions in a data model or data sample, then how are you going to make inferences or take action? And if you can’t make inferences or take action based on your data, what’s the point in collecting it? Clearly, you can create more value if you spend the time to “know thy data.”
When you collect data directly from the “source” of data – e.g., when you ask someone to rate a product they have purchased from you – you can directly record the source and examine data quality issues. But when you acquire data “second-hand” or “third-hand” from someone else, trust becomes increasingly important. In these second-hand and third-hand situations, “know thy data” really means “know thy data provider” too.
Contracts
Second, you may have purely contractual obligations or rights that need to be considered. For example, if a contract explicitly prohibits an organization from re-using or re-distributing data, then any “downstream” use or work products will create breach of contract risks.
There are two common examples of relevant contract terms:
Regulations
The third reason for ensuring data provenance is the most publicized and fundamental: regulation. Laws and rules are meant to be complied with, and almost all commercial contracts, financing documents, and purchase/sale agreements include fundamental representations and warranties related to compliance with applicable laws and rules.
Federal and state laws (and EU-wide regulations, for our continental readers) may require organizations to engage in specific documentation with respect to data that they gather. Generally, these requirements are limited to data that is personally identifiable; the specifics of what data is considered identifiable varies by regulation and may take additional factors into account.
As we’ve discussed in prior posts, neglecting to appropriately consider data protection regulations can have a direct and severe impact impact on an organization’s machine learning models, operations, and overall financial wellbeing.
While we won’t dive into the specifics of data processing regulations in this post, it’s sufficient to understand that oftentimes a company’s use of consumer data is regulatorily limited to those purposes for which they have legal grounds to do so. So how do you ensure that the data you’re using is allowed under law?
Data lineage or data provenance has become increasingly important, as technology relies more extensively on previously collected or generated data, which is itself becoming more voluminous. Without information about where data originated, how it was obtained, and what has been done to it, users of said data expose themselves and their organizations to the risk of negative financial, legal, and reputational outcomes.
What the Future Looks Like
We’re headed down an interesting path: technology is both enabling the exponential growth of data – which complicates the process of establishing provenance and lineage – and offering potential solutions to managing the very problems it is creating.
The best example of this phenomenon is the proliferation of MLOps platforms like MLflow. Fundamentally, “machine learning operations” platforms provide databases and APIs for the management of datasets and models. The original goal behind such MLOps platforms was to increase efficiency and quality for data engineers and data scientists, but over time, their value from a compliance perspective has become clear.
MLOps systems allow organizations to create and version datasets, including their provenance and lineage. These datasets can then be used to train machine learning models, which are themselves stored, versioned, and even run by these MLOps platforms. If this sounds like tracking provenance, that’s because it is!
It’s 2022, which means we can’t finish this post without a paragraph on blockchain or DLT. To be fair, however, legal scholars arguably first described blockchain-like systems while solving these exact provenance problems. For example, Nick Szabo first published Secure Property Titles with Owner Authority in 1998, in which he wrote:
“The property is represented by titles: names referring to the property, and the public key corresponding to a private key held by its current owner, signed by the previous owner, along with a chain of previous such titles. Title names may “completely” describe the property, for example allocations in a namespace. (Of course, names always refer to something, the semantics, so such a description is not really complete). Or the title names might simply be labels referring to the property. Various descriptions and rules – maps, deeds, and so on – may be included.”
Two decades later, ideas like this have matured to the point where multiple technical solutions capable of implementing these ideas are available. Whether the future of provenance will live on web3 in a public or private chain remains to be seen, as the cost of storage and adoption may outweigh benefits for at least the foreseeable future. However, given the rate of innovation and unpredictable history of technology adoption, it’s wise not to rule any possibility out completely. The future has a habit of surprising us all.
It can be daunting to play “catch up” if your organization hasn’t had a good data and software provenance strategy. Licens.io offers data protection maturity, data science, and machine learning assessments to identify strategic areas for improvement.
For investors or acquirers, who may not have a deep knowledge of an organization’s operations, Licens.io performs deep-dive analysis and maturity assessments of source code, machine learning models, and data privacy practices. This information can then be used to realign valuation or improve enterprise performance and value within your investment window.