WHY YOU REALLY NEED A DATA BOM, NOT A SOFTWARE BOM

The Bill of Materials (BOM) concept has taken over the world of software, but should most organizations be focused on Data Bills of Material (DBOM) instead?

A decade ago, Marc Andreessen penned his famous claim that “software is eating the world.” But the world has moved quickly since then. Today, an increasing number of thought leaders and investors are convinced that data, not software, is the concept really eating the world. So, who is right? Is data or software more important?

Software or Data?

While almost all companies use software, the truth is that very few of them are creating the software – or capturing its intrinsic value. Under the hood, many companies that claim to be technology-driven or “software” companies are actually using platforms, applications, or services owned by others. Sometimes, these implementations are material “low-code” applications, whereas other so-called “tech-enabled” companies are really just running on Excel and Tableau.

So, what about companies that actually build software? Well, increasingly, companies that build software don’t “distribute” that software to you. While they provide you with access to a tiny fraction of that software through front-end web applications, mobile apps, or APIs, the vast majority remains locked up within the four corners of their infrastructure. This is the nature of modern “cloud” or Software-as-a-Service (SaaS).

If most “tech” companies are just using software, and those companies building software don’t even distribute 99% of it anymore, why the focus on SBOMs? Even more pointedly, the primary benefits of SBOMs have to do with critical vulnerabilities that predominantly occur in widely-adopted packages developed by open source teams, not for-profit companies. So while SBOMs clearly offer important benefits, they only address a small percentage of software components and predominantly burden non-commercial open source developers.

Data, on the other hand, is being distributed and received by almost every organization. Documents, images, spreadsheets, and designs in industries ranging from marketing to medical records are passed back and forth every day. In some cases, the records in these data files make a single trip from the first organization to the recipient. But increasingly, the data in these files has gone on a lengthy journey of annotation, enhancement, de-identification, aggregation, and subsequent redistribution.

Data, Data, Data Daily

For almost all organizations doing business today, data is not just eating their world, but also responsible for much of their new risk. Even organizations using only Office 365 or Google Workspaces run the risk of catastrophic financial or reputational loss due to data; no other licensed or company-developed software required.

If PII, PHI, or financial information is used inappropriately, there are serious legal risks. Companies can face private lawsuits or regulatory enforcement with material consequences. As privacy and data protection frameworks like GDPR and CCPA have spurred dozens of similar rules across US states and the world, the magnitude and complexity of risk has only grown.

Similarly, organizations often rely on data they receive from another party to train machine learning or AI models – and that information may have itself taken a number of previous stops along the way. What happens when the distributing party didn’t actually have the rights to share that information or if there were cascading contractual restrictions on its use?

Even worse, if an organization receives sensitive information unknowingly, they might not properly classify and handle the data from an information security perspective. Unauthorized access or data loss are therefore more likely to occur, and when an incident does occur, the impact may be even more severe.

Many GRC or data governance platforms now focus on identifying sensitive information. Vendors provide tools that scan emails or shared drives to locate fields like social security numbers or bank accounts. Even enterprise editions of Office 365 and Google Workspaces can now help detect categories of private information and prevent loss. But much like software composition analysis and dynamic application security testing, these are reactive tools that can only point out risks that are already present. Furthermore, they don’t offer support for tracking the contractual rights and obligations that follow the data. We need metadata for our data that addresses today’s pain points.

Shift Left, but Data

The real solution, just as in software development, is to “shift left” for data by implementing DBOM metadata. And just as in software development, a Data Bill of Materials or “DBOM” concept offers a roadmap for how to build a safer, more efficient ecosystem. Organizations that receive data should mandate that their data providers disclose and label key types of risk at the field or record level. Providers of information should welcome this opportunity to attest to data ownership and shift the liability downstream as well.

As data supply chains continue to expand and evolve, the benefits of DBOM adoption will only grow. As more and more companies build machine learning models or AI use cases on top of such data, the risks of catastrophic economic or regulatory loss will increase as well.

So, what would a Data Bill of Materials look like? Let’s start with a few questions.

  • How much would a DBOM share with SBOMs or the OpenChain standard?
  • Do data scientists and machine learning engineers need their own separate standard?
  • Can we automate at least some elements of DBOM generation?
  • What would contractual provisions related to DBOMs look like in common agreements?

We’ll answer these questions and more in future articles, including examples of automatic and manually-generated DBOMs as well as more specific use cases. Follow along to learn more about how you can protect your organization’s data supply chain to build a better, safer digital world.