What is Software Composition Analysis and What are the Limitations?
How Does Software Composition Analysis Work?
Automation and Scale
SCA can technically be performed manually, but given the sheer volume of software components that most organizations utilize, manual reviews of packages and dependencies would be costly and time-consuming. The proliferation and adoption of small open source components has only exacerbated this trend. Furthermore, since many organizations are also constantly developing new software, updating their application’s dependencies, or acquiring new components, manual review would result in significant friction for development pace. Language coverage and functionality can vary significantly between tools, leading many organizations to utilize multiple tools; most organizations use at least one automated SCA tool.
Implementation
Software composition analysis can be implemented in a number of different ways. Some implementations work better for “compiled” languages in specific operating systems, like tools that handle statically-linked libraries for Linux applications written in C. Such tools typically use specific information about library or executable formats, like ELF or DLL files, to identify components statically or dynamically linked into a component. Such tools can identify both open and closed source components.
Other tools are designed to work only with open source package repositories like npm, PyPI, or Maven. This approach looks for common dependency declarations in these repositories, like the pom.xml files used by Maven, the setup.py or requirements.txt files used with PyPI, or the package.json files used by npm. These tools sometimes rely on language-native layers to extract these components from the dependency files, but in other cases, rely on regular expressions or JSON parsing. Depending on which approach is used, dependencies may still be missed, such as when they are distributed from non-public repositories.
As application distribution methods have become more varied, software composition tools have had to adapt as well. In the past, application binaries or source code were typically distributed standalone. Details of the environment, like the operating system, operating system version, system packages, or network configuration were verified or configured manually based on documentation. Increasingly, applications are distributed with their environment definitions together – beginning with virtualized deployments like those on VMWare’s stack and now often via containerized deployments like those on Docker and Kubernetes. Cloud orchestration technologies like CloudFormation or Terraform have also provided another deployment alternative for vendors and consumers.
As a result, software composition tools must be aware of many “environment specifications” for applications, including images used for virtualization, container or cluster definitions used for containerization, and infrastructure-as-code specifications for cloud orchestration. SCA tools today need to parse CloudFormation templates or scan Helm charts to achieve 100% recall for application dependencies.
Scanning for Legal Notices
Some tools also scan for common legal disclaimers, like statements of copyright, license headers or license agreements, or trademarks. The types of findings generated by these scans are valuable for legal and compliance purposes, but can often be difficult to directly associate with a specific component, source file, or binary file. As a result, they are often less useful for information security considerations. In many cases, these copyright or license results are listed separately from more specific software component findings.
What Questions Can’t SCA Answer?
Software composition analysis is a powerful tool, but there are some cases it was either not designed to handle – or cases in which tools often struggle to achieve complete and accurate answers.
Transitive Dependencies and Metadata
Most SCA tools do not natively support resolving transitive dependencies – the “hidden” supply chain of dependencies. Because many dependencies have their own dependencies, which also have their own dependencies, the resulting “network” of components can be very large.
Resolving these dependencies typically requires using metadata from package repository APIs or vendor databases – both of which come with their own costs or risks. In many modern languages like Python or Node, applications may have more transitive dependencies than explicit dependencies. As a result, the components listed by many tools are woefully incomplete if they omit transitive dependencies.
Even when software composition analysis tools do include transitive dependencies, they often do so using package metadata. In almost all circumstances, package metadata is self-reported by the package maintainer using a combination of machine-readable and freeform fields. In most cases, there is no standardization enforced – for example, maintainers in popular ecosystems like NPM and PyPI can create their own custom labels or categories. The consequence is that many packages have missing, inaccurate, or unused dependencies in their metadata. When SCA tools rely on this metadata alone, their findings are therefore also inaccurate, incomplete, or misleading.
Versions and Dependency Pinning
A number of these issues relate specifically to package versions. For example, some common programming languages do not require “dependency pinning” – where developers specify exactly which versions of a dependency are required. As a result, while software composition tools may identify which components are present, they may not know which versions of those components are used – a critical detail needed to resolve vulnerabilities or licensing changes. Because of this limitation, many SCA tools err on the side of producing warnings for vulnerable packages when versions are unknown, resulting in large quantities of false positives.
Recall vs. Precision
Since software composition analysis is typically used to identify and manage risks, most organizations are primarily concerned that coverage is complete. The idea is that you would rather cast a wide net and find all possible components – just in case there may be an infosec or compliance issue with one of them.
In statistics and machine learning, these ideas are explained through the mirror concepts of precision and recall:
To understand real-world recall performance, a federally-funded university study examined nine common software composition tools. This research examined how the different tools identified dependencies and vulnerabilities within the same large web application. The results showed extreme variability, even though the tools were all used to analyze the exact same software. Below is a summary of findings from the paper:
- Within Maven (Java) components, the number of vulnerable dependencies identified was as low as 17 and as high as 332.
- Within NPM (JS) components, the number of vulnerable dependencies identified was as low as 32 and as high as 239.
- Overall, the total number of components identified varied as much as 500-1000% from the lowest to highest recall tools.
While some of this variation may be due to omitted packages, there may conversely be overreporting from some SCA tools that creates false positives. The end result is that organizations face a trade-off between using tools that find too many components and tools that find too few.
Data, Data, Data – The Not-So-Hugging Face
You may not be surprised to hear that software composition analysis does not identify data components. You would be forgiven for thinking that this is no big deal. Increasingly, however, data components either play material roles in the functioning of software or introduce novel compliance or legal risks related to the operation of software.
For example, many software components today function largely as “containers” for executing machine learning models or querying datasets. While the same software components may be distributed to many organizations, each organization might install different machine learning models or data sets for use by the software – thereby rendering the software’s behavior or legal obligations completely different from one organization to another.
Take for example, platforms like Hugging Face. Today, Hugging Face makes it simple for organizations to access and utilize cutting-edge models with just a few lines of code. This is the beauty and attraction of their transformers library, which has been rapidly adopted by many academic and industrial users. But in this case, the ease of use through a single Python package is exactly what makes compliance and information security so difficult. While the first step is still identifying the presence of Hugging Face’s library, true compliance and risk management require that organizations go deep on the data installed by the transformers library.
PS: We have a whole post dedicated to technical explanations and examples of SCA shortcomings. If you’re interested in learning more about this topic, you should read that one next.