What is Software Composition Analysis and What are the Limitations?

What is Software Composition Analysis?

Software Composition Analysis (SCA – yes…another SCA) is a type of analysis designed to identify and document software components. Many organizations focus their SCA efforts narrowly on open source components to track information security and legal compliance. In reality, however, SCA can help manage risk across all components, including proprietary or closed-source software, by detecting undocumented or illicit components or dependencies.

Once detected, these components and their dependencies can be managed through a software component inventory. These inventories allow organizations to effectively understand and act on risks related to information security vulnerabilities, licensing, patents, or other regulatory risks. For example, organizations can subscribe to notifications about new vulnerabilities in packages or identify affected packages when sanctions or security standards change.

Want to jump ahead? Here’s what we’re covering:

How does software composition analysis work?
Relationship to SBOMs
What questions can SCA answer?
What questions can’t SCA answer?
Financing and M&A decisions

How Does Software Composition Analysis Work?

Automation and Scale

SCA can technically be performed manually, but given the sheer volume of software components that most organizations utilize, manual reviews of packages and dependencies would be costly and time-consuming. The proliferation and adoption of small open source components has only exacerbated this trend. Furthermore, since many organizations are also constantly developing new software, updating their application’s dependencies, or acquiring new components, manual review would result in significant friction for development pace. Language coverage and functionality can vary significantly between tools, leading many organizations to utilize multiple tools; most organizations use at least one automated SCA tool.

Implementation

Software composition analysis can be implemented in a number of different ways. Some implementations work better for “compiled” languages in specific operating systems, like tools that handle statically-linked libraries for Linux applications written in C. Such tools typically use specific information about library or executable formats, like ELF or DLL files, to identify components statically or dynamically linked into a component. Such tools can identify both open and closed source components.

Other tools are designed to work only with open source package repositories like npm, PyPI, or Maven. This approach looks for common dependency declarations in these repositories, like the pom.xml files used by Maven, the setup.py or requirements.txt files used with PyPI, or the package.json files used by npm. These tools sometimes rely on language-native layers to extract these components from the dependency files, but in other cases, rely on regular expressions or JSON parsing. Depending on which approach is used, dependencies may still be missed, such as when they are distributed from non-public repositories.

As application distribution methods have become more varied, software composition tools have had to adapt as well. In the past, application binaries or source code were typically distributed standalone. Details of the environment, like the operating system, operating system version, system packages, or network configuration were verified or configured manually based on documentation. Increasingly, applications are distributed with their environment definitions together – beginning with virtualized deployments like those on VMWare’s stack and now often via containerized deployments like those on Docker and Kubernetes. Cloud orchestration technologies like CloudFormation or Terraform have also provided another deployment alternative for vendors and consumers.

As a result, software composition tools must be aware of many “environment specifications” for applications, including images used for virtualization, container or cluster definitions used for containerization, and infrastructure-as-code specifications for cloud orchestration. SCA tools today need to parse CloudFormation templates or scan Helm charts to achieve 100% recall for application dependencies.

Scanning for Legal Notices

Some tools also scan for common legal disclaimers, like statements of copyright, license headers or license agreements, or trademarks. The types of findings generated by these scans are valuable for legal and compliance purposes, but can often be difficult to directly associate with a specific component, source file, or binary file. As a result, they are often less useful for information security considerations. In many cases, these copyright or license results are listed separately from more specific software component findings.

Relationship to SBOMs

Software component inventories are closely related to Software Bills of Materials (SBOM), which have emerged recently as a proposed standard for machine-readable component inventories. The vision behind SBOMs is that company-specific software inventories like those produced through SCA will be replaced by SBOM data distributed directly from the software authors. While adoption of SBOM standards has been enthusiastic in some communities, there are numerous hurdles to complete adoption for actively-developed software, let alone for unmaintained, legacy products. As a result, SCA capabilities will likely continue to be necessary for many years to come.

What Questions Can SCA Answer?

Software Composition Analysis can help answer questions in two general categories:

Legal + Compliance

What copyrights or licenses are present in this software?
Are there any improperly licensed or unlicensed software components?
Are there any copyleft licenses present in this software?
Are there any export-controlled technologies present in this software?
What percentage of this software did we create or vendor ourselves?

Information Security

Which versions of packages are used by this software?
Are there any unexpected or unknown components in this software?
Are there any known-vulnerable components in this software?
Are there any known-malicious components in this software?

During due diligence, contracting, or ongoing account management, these questions often take a more detailed turn, such as:

Are there any components with an AGPL, SSPL, or equivalent license?
Are there any components with non-OEM commercial licenses?
Does this Docker container include log4j versions between 2.0 and 2.14.1?
Which outdated dependencies are present in this Node web application?

In many ways, these sample questions are the original motivation behind software composition analysis. The tools we have today were created because of the need to manage these security and compliance risks.

What Questions Can’t SCA Answer?

Software composition analysis is a powerful tool, but there are some cases it was either not designed to handle – or cases in which tools often struggle to achieve complete and accurate answers.

Transitive Dependencies and Metadata

Most SCA tools do not natively support resolving transitive dependencies – the “hidden” supply chain of dependencies. Because many dependencies have their own dependencies, which also have their own dependencies, the resulting “network” of components can be very large.

Resolving these dependencies typically requires using metadata from package repository APIs or vendor databases – both of which come with their own costs or risks. In many modern languages like Python or Node, applications may have more transitive dependencies than explicit dependencies. As a result, the components listed by many tools are woefully incomplete if they omit transitive dependencies.

Even when software composition analysis tools do include transitive dependencies, they often do so using package metadata. In almost all circumstances, package metadata is self-reported by the package maintainer using a combination of machine-readable and freeform fields. In most cases, there is no standardization enforced – for example, maintainers in popular ecosystems like NPM and PyPI can create their own custom labels or categories. The consequence is that many packages have missing, inaccurate, or unused dependencies in their metadata. When SCA tools rely on this metadata alone, their findings are therefore also inaccurate, incomplete, or misleading.

Versions and Dependency Pinning

A number of these issues relate specifically to package versions. For example, some common programming languages do not require “dependency pinning” – where developers specify exactly which versions of a dependency are required. As a result, while software composition tools may identify which components are present, they may not know which versions of those components are used – a critical detail needed to resolve vulnerabilities or licensing changes. Because of this limitation, many SCA tools err on the side of producing warnings for vulnerable packages when versions are unknown, resulting in large quantities of false positives.

Recall vs. Precision

Since software composition analysis is typically used to identify and manage risks, most organizations are primarily concerned that coverage is complete. The idea is that you would rather cast a wide net and find all possible components – just in case there may be an infosec or compliance issue with one of them.

In statistics and machine learning, these ideas are explained through the mirror concepts of precision and recall:

Precision measures the true positive rate of a model.

If the model says this application contains a vulnerable component, is it true?

Recall measures the false negative rate or completeness of a model.

If the model says that this application does not contain a component, is it true?
Out of N true components in an application, how many did the model detect?

To understand real-world recall performance, a federally-funded university study examined nine common software composition tools. This research examined how the different tools identified dependencies and vulnerabilities within the same large web application. The results showed extreme variability, even though the tools were all used to analyze the exact same software. Below is a summary of findings from the paper:

Within Maven (Java) components, the number of vulnerable dependencies identified was as low as 17 and as high as 332.
Within NPM (JS) components, the number of vulnerable dependencies identified was as low as 32 and as high as 239.
Overall, the total number of components identified varied as much as 500-1000% from the lowest to highest recall tools.

While some of this variation may be due to omitted packages, there may conversely be overreporting from some SCA tools that creates false positives. The end result is that organizations face a trade-off between using tools that find too many components and tools that find too few.

Data, Data, Data – The Not-So-Hugging Face

You may not be surprised to hear that software composition analysis does not identify data components. You would be forgiven for thinking that this is no big deal. Increasingly, however, data components either play material roles in the functioning of software or introduce novel compliance or legal risks related to the operation of software.

For example, many software components today function largely as “containers” for executing machine learning models or querying datasets. While the same software components may be distributed to many organizations, each organization might install different machine learning models or data sets for use by the software – thereby rendering the software’s behavior or legal obligations completely different from one organization to another.

Take for example, platforms like Hugging Face. Today, Hugging Face makes it simple for organizations to access and utilize cutting-edge models with just a few lines of code. This is the beauty and attraction of their transformers library, which has been rapidly adopted by many academic and industrial users. But in this case, the ease of use through a single Python package is exactly what makes compliance and information security so difficult. While the first step is still identifying the presence of Hugging Face’s library, true compliance and risk management require that organizations go deep on the data installed by the transformers library.

PS: We have a whole post dedicated to technical explanations and examples of SCA shortcomings. If you’re interested in learning more about this topic, you should read that one next.

Financing and M&A Considerations

Why Founders and Operators Need More

Software composition analysis tools are valuable. Thanks to SCA, organizations are able to efficiently identify software components in their applications. While most usage of SCA is focused on open source components, technically, SCA tools can also be used to identify other undocumented or illicit software components.

Why, then, would an organization need anything more than a standard SCA tool?

As we noted above, SCA tools have widely-varying recall. Furthermore, SCA tools were not historically designed to answer every question, and increasingly-common use cases like machine learning are known to be omitted from traditional analysis methods.

In situations where recall is exceedingly important – like financing and M&A – these factors can cause senior management, investors, or board members to seek additional analysis. Typical deal documents, like credit agreements or securities purchase agreements, require the company raising funds or selling shares to make certain statements regarding their business. These statements – often referred to as representations and warranties, reps and warranties, or simply R&W – typically include a number of items related to software security and compliance. And when these clauses aren’t effectively negotiated or disclosures aren’t complete or accurate, then things can go south – fast. Statistically speaking, key stakeholders often understand that relying on a single SCA tool could cost them a material amount of money down the road.

Therefore, many organizations engage experts to help perform targeted reviews and audits of their codebase. Either by custom automation, manual review, or creative combination of available SCA tools, experts like our team can help organizations reduce their risk or demonstrate value – often on the scale of tens of millions of dollars.

What Should Acquirers or Funders Expect?

Similarly, when an acquirer or funder approaches a target, they have their own risks and opportunities to evaluate. While direct source code access is not guaranteed, these buyers or investors often do best when they can engage experts to directly review the target’s source code. Such access allows for the assessment of many factors in addition to typical SCA questions, like the maturity level of the software development process and velocity of the team.

The biggest impediment for M&A or financing transactions is typically timing. Many funding or acquiring organizations typically end up “going light” on this area of diligence, not because the risks are small, but because the duration to engage resources and synthesize results is too long for the deal timeline.

The solution is for funding and acquiring organizations to find partners who can help them quickly adapt to each target. The right diligence and audit partners combine experience, expertise, technology, and data with a business understanding. When it comes to software composition analysis, it doesn’t make sense to pay for standalone analysis or tools; instead, SCA is best performed as part of a more holistic assessment of SDLC, architecture, infrastructure, operations, and data science.

If you’re looking to streamline your organization’s SDLC and risk management processes, Licens.io offers software development, information security, data protection, data strategy, and machine learning assessments that can help. And if you’re looking to raise capital or sell your business, our pre-diligence assessments and diligence support services can reduce your risk and time to close.

For investors or acquirers who need comprehensive and timely diligence, valuation, or remediation support, Licens.io offers rapid, technology-enabled services that can help identify, understand, improve, and communicate on risks and opportunities in target companies or portfolios.

What is Software Composition Analysis and What are the Limitations?