Shift Left for Data: Data Processing Agreements and Data Bills of Material

It’s hard to make it very far these days without hearing the phrase “Shift Left.” While some argue that Shift Left is just following CMMI and PMBOK practices, it’s clear that the DevOps and DevSecOps communities have embraced the phrase to describe the idea that software and infrastructure problems should be found sooner, not later.

For many organizations, Shift Left means pushing as much quality control (QC) and quality assessment (QA) upstream to the developer as possible. In some cases, this means empowering developers through IDE plugins, linters, or other static analysis tools to identify quality or security issues in the lines of code they write. In other cases, Shift Left approaches focus on assessing components continuously within source control and the CI/CD pipeline. Both cases, however, tend to emphasize tasks like identifying security issues through static code analysis or checking dependency versions for CVEs.

Shift Left for Data

But what about data? Why do conversations about Shift Left never address the need to identify issues related to data protection, privacy, or other data compliance concerns? Are the Privacy by Design (PbD) folks and DevSecOps folks saying the same thing in different languages?

The theory of Shift Left is that software defects are cheaper and easier to fix when they are identified sooner. When defects become embedded in the design or behavior of an application or its users, the effort to fix or impact of change are more likely to be material.

For many organizations, this is even more true of data “defects.” When data is collected in the wrong format or stored in a poorly designed data model, subsequent migrations, application functionality, or data science activities are permanently affected. Even worse, when data is collected or used in violation of a regulatory or contractual obligation, these defects can turn into serious liability for companies.

Privacy by Design (PbD) is arguably intended to address many of these issues by enforcing a data protection mindset on the requirement and design cycle of development. Product managers and software developers are encouraged to, by default, only collect the information that is absolutely needed But for many organizations, data subject privacy is just one of many considerations for compliance or data strategy execution.

For example, many companies sell services, software, or SaaS in B2B markets. These transactions are generally viewed as less risky than traditional B2C offerings. While this may oftentimes be true from an individual privacy perspective, it ignores the fact that companies may still have other contractual or regulatory restrictions.

Contractual Restrictions on Data

The most common examples of contractual restrictions have to do with “purpose” or confidentiality. Customers obviously have concerns about the redistribution of confidential information, and it is critical that service providers don’t resell or redistribute information verbatim in such cases. Most organizations think of this obligation as it relates to data breach or unauthorized retransmission of source material.

But what about when data is used for data science or machine learning? Can a service provider aggregate information collected from or provided by customers, train a machine learning model from it, and subsequently use or sell access to that machine learning model?

The answer to the contractual question is typically “no” – the service provider does not have this right. But in reality, many companies today sign agreements and fail to meet these obligations. They do actively use or aggregate customer information in ways or for purposes that violate their contractual obligations.

This non-compliance is often not intentional or malicious; it simply stems from a lack of understanding, oversight, and communication within the service provider. Legal is aware of the restriction, but no one in Legal explains this to the product management, software development, or data science teams.

Risk-Based Calculations

When companies are cognizant of their obligations but decide to proceed anyway, they often make risk-based calculations about the potential statutory or contractual damages – “if we were found out, would the penalty outweigh the benefits?” Historically, most unauthorized use was never discovered; if it was, harm was difficult to prove. Recent trends in statutory and civil penalties should incentivize organizations to revisit this calculus, however.

There’s often an assumption that customers don’t care about most of their data. They believe that only a small subset of their information, alone or in combination, could potentially cause economic or reputational harm. But without a mechanism for communicating these preferences, it’s safer to Just Say No. When large buyers procure software or SaaS services, they increasingly use Data Processing Agreements (DPA) – required under regulations like GDPR – to express these restrictions. Combine these DPAs with the complexity of second-hand and third-hand data supply chains, and you have a very tricky situation.

Service providers who follow the letter of these DPAs are often left with fewer opportunities to monetize data strategies. When customers do not allow them to create aggregated data products or derivative machine learning models, then additional sources of revenue or margin may be limited.

Privacy Pro Jill’s Thoughts: This type of behavior opens up serious legal risk (both contractual + regulatory) in addition to reputational risk. Don’t do it!

But since many service providers assume that their competitors are also using customer data, they feel pressured into skirting their contractual obligations to keep up. As a result, some companies simply breach their DPAs under the theory that such breach is difficult to detect and quantify.

A Better Way

Is there another way? Could organizations that share data better document what is inside the data and the rights and obligations related to those attributes and records?

Previously, we discussed why organizations should be looking at not just Software Bills of Material (SBOMs), but also Data Bills of Materials (DBOMs). Data Bills of Material might provide exactly the mechanism for standardizing and streamlining the negotiation and management of Data Processing Agreements.

When an organization that owns or supplies data agrees to provide it to another organization, they need to document what they are providing and what rights and obligations relate to the data. To do this, they need to include metadata about the provenance, processing, or known regulatory restrictions. This description of the data, its history, and its constraints is a DBOM – and should serve as a common addendum or artifact for Data Processing Agreements.

By including Data Bills of Material in the DPA negotiation process, both contracting parties are required to be explicit about what is being provided by whom for what purpose – something that is increasingly required by ESG programs or regulatory frameworks.

When data is regularly published or consumed via automated, machine-readable mechanisms like APIs or SFTP, DBOMs can be used to validate that the transmitted data matches the contractually-agreed-upon data. Variations or exceptions can be noted for legal or compliance purposes.

Privacy Pro Jill’s Thoughts: This approach won’t work for all arrangements that involve data transfer; in those cases it’s even more essential to ensure that DPAs and other related contracts address data use.

Data providers may change their data specifications. Service providers may change what information they collect from their users. In both cases, when these changes are not adequately documented or communicated, subsequent contractual or regulatory issues may emerge. By requiring that DBOMs be kept up-to-date and incorporated into the DPA, organizations can reduce the risk of subsequent litigation or regulatory action.

If data breaches or regulatory changes later occur, organizations also have a convenient reference to understand what processes might need to be implemented or changed. Instead of paying for post-hoc, emergency review, organizations could quickly search and filter DBOM records to understand their reporting obligations or regulatory exposure.

Privacy Pro Jill’s Thoughts: Think of this as shifting the process of data mapping even further left.

By standardizing the inclusion of DBOMs into DPA negotiations, organizations can normalize Shift Left for Data – all the way back to the contract that governs the data. All parties involved get increased transparency during and after negotiations. Uncertainty related to data collection and data breach risks are reduced. In some cases, customers may be willing to share more data than previously agreed to, allowing service providers to explore new ways to create enterprise value through expanding data strategy frontiers. Getting there, however, will require increasing transparency and building trust.

DBOMs offer a better, more sustainable future for both consumers and producers of data. We can all agree that today’s world of DPAs and data mapping leaves much to be desired. Let’s change that by digitally transforming data itself.