How Does Software Composition Analysis Work?
Automation and Scale
SCA can technically be performed manually, but given the sheer volume of software components that most organizations utilize, manual reviews of packages and dependencies would be costly and time-consuming. The proliferation and adoption of small open source components has only exacerbated this trend. Furthermore, since many organizations are also constantly developing new software, updating their application’s dependencies, or acquiring new components, manual review would result in significant friction for development pace. Language coverage and functionality can vary significantly between tools, leading many organizations to utilize multiple tools; most organizations use at least one automated SCA tool.
Software composition analysis can be implemented in a number of different ways. Some implementations work better for “compiled” languages in specific operating systems, like tools that handle statically-linked libraries for Linux applications written in C. Such tools typically use specific information about library or executable formats, like ELF or DLL files, to identify components statically or dynamically linked into a component. Such tools can identify both open and closed source components.
Other tools are designed to work only with open source package repositories like npm, PyPI, or Maven. This approach looks for common dependency declarations in these repositories, like the pom.xml files used by Maven, the setup.py or requirements.txt files used with PyPI, or the package.json files used by npm. These tools sometimes rely on language-native layers to extract these components from the dependency files, but in other cases, rely on regular expressions or JSON parsing. Depending on which approach is used, dependencies may still be missed, such as when they are distributed from non-public repositories.
As application distribution methods have become more varied, software composition tools have had to adapt as well. In the past, application binaries or source code were typically distributed standalone. Details of the environment, like the operating system, operating system version, system packages, or network configuration were verified or configured manually based on documentation. Increasingly, applications are distributed with their environment definitions together – beginning with virtualized deployments like those on VMWare’s stack and now often via containerized deployments like those on Docker and Kubernetes. Cloud orchestration technologies like CloudFormation or Terraform have also provided another deployment alternative for vendors and consumers.
As a result, software composition tools must be aware of many “environment specifications” for applications, including images used for virtualization, container or cluster definitions used for containerization, and infrastructure-as-code specifications for cloud orchestration. SCA tools today need to parse CloudFormation templates or scan Helm charts to achieve 100% recall for application dependencies.
Scanning for Legal Notices
Some tools also scan for common legal disclaimers, like statements of copyright, license headers or license agreements, or trademarks. The types of findings generated by these scans are valuable for legal and compliance purposes, but can often be difficult to directly associate with a specific component, source file, or binary file. As a result, they are often less useful for information security considerations. In many cases, these copyright or license results are listed separately from more specific software component findings.