It’s hard to make it very far these days without hearing the phrase “Shift Left.” While some argue that Shift Left is just following CMMI and PMBOK practices, it’s clear that the DevOps and DevSecOps communities have embraced the phrase to describe the idea that software and infrastructure problems should be found sooner, not later.
For many organizations, Shift Left means pushing as much quality control (QC) and quality assessment (QA) upstream to the developer as possible. In some cases, this means empowering developers through IDE plugins, linters, or other static analysis tools to identify quality or security issues in the lines of code they write. In other cases, Shift Left approaches focus on assessing components continuously within source control and the CI/CD pipeline. Both cases, however, tend to emphasize tasks like identifying security issues through static code analysis or checking dependency versions for CVEs.
Shift Left for Data
But what about data? Why do conversations about Shift Left never address the need to identify issues related to data protection, privacy, or other data compliance concerns? Are the Privacy by Design (PbD) folks and DevSecOps folks saying the same thing in different languages?
The theory of Shift Left is that software defects are cheaper and easier to fix when they are identified sooner. When defects become embedded in the design or behavior of an application or its users, the effort to fix or impact of change are more likely to be material.
For many organizations, this is even more true of data “defects.” When data is collected in the wrong format or stored in a poorly designed data model, subsequent migrations, application functionality, or data science activities are permanently affected. Even worse, when data is collected or used in violation of a regulatory or contractual obligation, these defects can turn into serious liability for companies.
Privacy by Design (PbD) is arguably intended to address many of these issues by enforcing a data protection mindset on the requirement and design cycle of development. Product managers and software developers are encouraged to, by default, only collect the information that is absolutely needed But for many organizations, data subject privacy is just one of many considerations for compliance or data strategy execution.
For example, many companies sell services, software, or SaaS in B2B markets. These transactions are generally viewed as less risky than traditional B2C offerings. While this may oftentimes be true from an individual privacy perspective, it ignores the fact that companies may still have other contractual or regulatory restrictions.
Contractual Restrictions on Data
The most common examples of contractual restrictions have to do with “purpose” or confidentiality. Customers obviously have concerns about the redistribution of confidential information, and it is critical that service providers don’t resell or redistribute information verbatim in such cases. Most organizations think of this obligation as it relates to data breach or unauthorized retransmission of source material.
But what about when data is used for data science or machine learning? Can a service provider aggregate information collected from or provided by customers, train a machine learning model from it, and subsequently use or sell access to that machine learning model?
The answer to the contractual question is typically “no” – the service provider does not have this right. But in reality, many companies today sign agreements and fail to meet these obligations. They do actively use or aggregate customer information in ways or for purposes that violate their contractual obligations.
This non-compliance is often not intentional or malicious; it simply stems from a lack of understanding, oversight, and communication within the service provider. Legal is aware of the restriction, but no one in Legal explains this to the product management, software development, or data science teams.
When companies are cognizant of their obligations but decide to proceed anyway, they often make risk-based calculations about the potential statutory or contractual damages – “if we were found out, would the penalty outweigh the benefits?” Historically, most unauthorized use was never discovered; if it was, harm was difficult to prove. Recent trends in statutory and civil penalties should incentivize organizations to revisit this calculus, however.
There’s often an assumption that customers don’t care about most of their data. They believe that only a small subset of their information, alone or in combination, could potentially cause economic or reputational harm. But without a mechanism for communicating these preferences, it’s safer to Just Say No. When large buyers procure software or SaaS services, they increasingly use Data Processing Agreements (DPA) – required under regulations like GDPR – to express these restrictions. Combine these DPAs with the complexity of second-hand and third-hand data supply chains, and you have a very tricky situation.