Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syft cycloneDX: create sBOM data from source packages instead of binary packages (e.g. debian packages) ? #1700

Open
ericbl opened this issue Mar 28, 2023 · 11 comments
Labels
enhancement New feature or request needs-research

Comments

@ericbl
Copy link

ericbl commented Mar 28, 2023

tl;dnr: could syft offer an option to generare a cycloneDX sBOM for os packages by considering only the source (put in the upstream part, also in metadata:source and version from metadata:sourceVersion on some package manager) and not the binaries?

Hello,
let's start with the business background: every software delivered by our company need a proper clearance of open source software (OSS clearing).
each team must generate a sBOM and get all software component analyzed on the shared SW360 platform: components must be properly identified and the source code provided.
A dedicated team will go through the source code to check the licenses.

Each team can use the tool of its choice to create the component on sw360. Some even take the path of doing it manually.
In our team, we create software that will be eventually deployed as a container image (docker for now): we use debian bullseye slim as base image and our software can further packages either built from source, or from some package manager (debian, pip, npm, nuget) or depending of the language (go, python, nodejs, ruby, c#, etc)
Therefore, in my team, we want to use Syft to generate a CycloneDX BOM and eventually tranform it to get the components uploaded in our sw360.

Syft is already providing the list of licenses but this is unfortunately not considered (yet) in our process.

Considering debian packages, the internal team dealing with debian OS (let's call it DebT) insists of using only the source package and not the binary.
DebT start with the list of debian components with this command:
dpkg-query -f '${source:Package}|${source:Version}|${binary:Package}|${Version}\n' -W

DebT eventually only take ${source:Package}|${source:Version}

Currently, the syft command is however generating a cycloneDX bom based on the binaries. Source is sometimes set as metadata property and then attached to the upstream part in the purl. It is particularly true for libraries, generating duplicates component of the not lib variant (e.g. curl and libcurl both pointing to the same source)
I've seen this upstream= addition only for debian packages, not yet on other package providers.

This however create a purl with this upstream extension not defined in the standard:
https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#deb

Let's take a real exemple This is one line from the dpkg-query above.

util-linux|2.36.1-8+deb11u1|bsdutils|1:2.36.1-8+deb11u1

DebT is only interessted by scanning the source file so consider this package as
name: util-linux
version: 2.36.1-8+deb11u1

Syft generates the following in the cycloneDX sBON:

  "bom-ref": "pkg:deb/debian/bsdutils@1:2.36.1-8+deb11u1?arch=amd64&upstream=util-linux%402.36.1-8+deb11u1&distro=debian-11&package-id=677e6ace24dce684",
  "type": "library",
  "publisher": "util-linux packagers <[email protected]>",
  "name": "bsdutils",
  "version": "1:2.36.1-8+deb11u1",
  "cpe": "cpe:2.3:a:bsdutils:bsdutils:1\\:2.36.1-8\\+deb11u1:*:*:*:*:*:*:*",
  "purl": "pkg:deb/debian/bsdutils@1:2.36.1-8+deb11u1?arch=amd64&upstream=util-linux%402.36.1-8+deb11u1&distro=debian-11",
  "properties": [
    {
      "name": "syft:metadata:installedSize",
      "value": "394"
    },
    {
      "name": "syft:metadata:source",
      "value": "util-linux"
    },
    {
      "name": "syft:metadata:sourceVersion",
      "value": "2.36.1-8+deb11u1"
    }
  ]

Source and sourceVersion are set as properties, as well as in the upstream part.
For us, the correct package data would be

  "name": "util-linux",
  "version": "2.36.1-8+deb11u1",
  "purl": "pkg:deb/debian/[email protected]+deb11u1&arch=source"

(according to purl spec, arch should be set as source when we speak about the source package)

We are working on our own transformation from the syft output, but I wonder if this could better be an special output from syft directly.
What do you think?

@ericbl ericbl added the enhancement New feature or request label Mar 28, 2023
@gernot-h
Copy link

I would be very interested in the background of the upstream qualifier. I couldn't find much about it besides the initial PR #769. Is this somehow aligned with other sBOM scanning tools, CycloneDX team etc? I guess this was a workaround for the restriction of CycloneDX to one package-url, right?

Note that I once requested CycloneDX support for specifying source information via externalReferences, but additional URLs won't allow to specify source references in a unique way (think about mirrors, .zip vs. .tar.gz link etc.).

So we now prefer the arch=source qualifier in Debian purls these days, here's the background discussion: package-url/purl-spec#57.

cc: @wagoodman

@ericbl
Copy link
Author

ericbl commented Mar 30, 2023

the following jq query is doing the job AFTER the syft scan in all layers, transforming the "binary based" CycloneDX to a "source based" CycloneDX:

   jq '"syft:metadata:source" as $srcName | "syft:metadata:sourceVersion" as $srcVersion 
        |.components[] 
        |= . + ( ( 
          .version as $componentVersion 
          | .properties//[] 
          | from_entries 
          | select(has($srcName)) 
          | (.[$srcVersion]//$componentVersion) as $version 
          | .[$srcName] as $name 
          | { $name, $version, purl: "pkg:deb/debian/\($name)@\($version)?arch=source" } 
        ) // {}
      )' syft_cyclonedx_bom.json > dx_bom_src.json

Thanks to StackOverflow to help me implementing this query!

@wagoodman
Copy link
Contributor

@ericbl -- that's one heck of a jq command! (adding mental note to work on my jq chops... ). Let me see if I can answer a few questions.

I would be very interested in the background of the upstream qualifier.

We hesitated on adding this for a long time, specifically because the upstream param is out-of-spec, as pointed out. Using the pURL in this way has been very useful from a vulnerability matching point-of-view in grype, probably for the same reason that your internal DebT wants the SBOM results oriented with a source purl instead of being aligned with the binary: for vulnerability matching the source package matters most, since vulnerabilities tend to be written against the source package and not downstream packages.

Syft supports multiple SBOM formats, and the goal is to allow for grype to interop with these SBOMs in a way where vulnerability matching will not differ just because you've decided to use a different SBOM format. We explored multiple options for both SPDX and CycloneDX to express a source package clearly for the purposes of vulnerability matching but also wanted to ensure that it was clear to the SBOM consumer that these source packages were not found to be installed. At the time the methods we explored couldn't check all the boxes (the boxes were roughly: a) be clear to the user what's being expressed, 2) be able to show what's installed vs upstream relationships, and 3) be interoperable with multiple formats).

Grype also supports being able to perform vulnerability matching when only specifying a pURL or set of pURLs. This, combined with the other efforts, made me lean towards adding an out-of-spec qualifier onto the pURL. upstream aligned nicely with multiple OS ecosystems that have these vulnerability matching requirements.

I've seen this upstream= addition only for debian packages, not yet on other package providers.

All OS catalogers tend to have this feature: https://github.com/search?q=repo%3Aanchore%2Fsyft%20PURLQualifierUpstream&type=code (alpm, apk, dep, rpm).

we now prefer the arch=source qualifier in Debian purls these days

Correct, no dispute here about the source qualifier 👍 I agree that using the source qualifier is the right thing to do when writing a pURL for a source package.

However, this did not fulfill the needs of what we're trying to convey, which is "here is the [binary] package we found, and this is the package which it came from (the source package)". A pURL representing the source package alone only answers half of what was needed, and providing multiple pURLs is confusing for something that should be used as an identity (so should be singular).

@ericbl
Copy link
Author

ericbl commented Apr 3, 2023

thanks for your answer.
ok so you prefer the current purl for the integration of Grype.
I suggested special option to generate source image, not to change your default output :)

another way would be indeed with a 2nd purl, but as you pointed out, it shall not be named "purl" since that one should be unique. But we could name it differently!

I found the cycloneDX spec a bit unprecise of the discussion, I did not find any rule either "source purl" or "binary purl".
Adding a 2nd purl in the cycloneDX spec could be an option...

@ericbl
Copy link
Author

ericbl commented Apr 3, 2023

All OS catalogers tend to have this feature: https://github.com/search?q=repo%3Aanchore%2Fsyft%20PURLQualifierUpstream&type=code (alpm, apk, dep, rpm).

thanks, I tried only on debian, npm, python, etc, but not yet on other linux distrib. I ll do asap that with alpine / apk.

It means my jq command above is not correct and shall be even more complex with a regex to rebuild the purl!

@gernot-h
Copy link

gernot-h commented Apr 4, 2023

I found the cycloneDX spec a bit unprecise of the discussion, I did not find any rule either "source purl" or "binary purl".
Adding a 2nd purl in the cycloneDX spec could be an option...

@ericbl, a while back, I requested a similar topic with the CycloneDX team. It was not about a source purl, but adding a specific type for external source references. The CycloneDX team however claimed that it's not easy/possible to distinct between "source" and "binary" references throughout all ecosystems: CycloneDX/specification#98. I guess the same arguments would apply on source purls, so I wouldn't expect this to happen soon...

Also taking the point of @wagoodman into consideration, that an SBOM should express what is "installed" in an image, a "source purl" would somehow be inconsistent in the default SBOM.

But still, the feature as requested by @ericbl here – adding a syft --upstream mode to produce a "upstream relationship SBOM", would be very helpful for us and I think it wouldn't be in contradiction with CycloneDX spec or the other goals of Syft.

@ericbl
Copy link
Author

ericbl commented Apr 4, 2023

the extraction of the source purl differs from package manager.
For instance with alpine, I just got this component (I removed some data irrelevant for current discussion)

{
	 "bom-ref": "pkg:apk/alpine/[email protected]?arch=x86_64&upstream=busybox&distro=alpine-3.17.2&package-id=256fc96b4a8c4da8",
	  "type": "library",
	  "publisher": "Sören Tempel <[email protected]>",
	  "name": "busybox-binsh",
	  "version": "1.35.0-r29",
	  "purl": "pkg:apk/alpine/[email protected]?arch=x86_64&upstream=busybox&distro=alpine-3.17.2",
	  "externalReferences": [
		{
		  "url": "https://busybox.net/",
		  "type": "distribution"
		}
	  ],
	  "properties": [
		{
		  "name": "syft:package:foundBy",
		  "value": "apkdb-cataloger"
		},
		{
		  "name": "syft:package:metadataType",
		  "value": "ApkMetadata"
		},
		{
		  "name": "syft:package:type",
		  "value": "apk"
		},
		{
		  "name": "syft:metadata:originPackage",
		  "value": "busybox"
		},
		{
		  "name": "syft:metadata:size",
		  "value": "1547"
		}
	  ]
}

so the upstream part is built from the "syft:metadata:originPackage" instead of from the "syft:metadata:source" with Debian.

This means my proposed jq command above is wrong: I should parse the purl on the upstream part, and not considering the metadata that differ from package manager.

Having a ' upstream' mode as proposed by Gernot would help us a lot and avoid getting crazy with jq :)

@wagoodman
Copy link
Contributor

There are two paths forward:

  • The path you suggested: add an application configuration option to replace the binary purls for source purls on OS packages.
  • Another path: add an application configuration option to add additional packages that represent the source packages (derived from information we find from the binary packages) and add package-to-package relationships between them (e.g. build-dependency-of).

These aren't mutually exclusive, so both in theory could be done, but I'm interested in hearing folks thoughts on which might be more useful (or if there are any other ideas here).

@ericbl
Copy link
Author

ericbl commented May 8, 2023

your 2nd path proposal seems a bit more complex. And how could I then filter out the packages listing binary information I am not interested with?

@ericbl
Copy link
Author

ericbl commented May 8, 2023

as I wrote above, my jq query is specific to Debian and difficult to maintain.
Therefore, I replaced it in our pipeline with the following python script.

import argparse
import json
from packageurl import PackageURL


def transform_json(import_json, export_json):
    image_sbom = json.load(open(import_json))
    for comp in image_sbom['components']:
        if 'purl' in comp:
            purl = comp['purl']
            # extract the purl items.
            syft_purl = PackageURL.from_string(purl)
            if "upstream" in purl:
                # extract the upstream and then the name and version of the source package.
                # example: "purl": "pkg:deb/debian/bsdutils@1:2.36.1-8+deb11u1?arch=amd64&upstream=util-linux%402.36.1-8+deb11u1&distro=debian-11"
                upstream = syft_purl.qualifiers['upstream']
                if "@" in upstream:
                    name, version = upstream.split("@")
                else:
                    name, version = upstream, syft_purl.version
                # retrieve the distro
                if "distro" in purl:
                    distro = syft_purl.qualifiers['distro']
                    # build a source purl from the purl items.
                    src_purl = f"pkg:{syft_purl.type}/{syft_purl.namespace}/{name}@{version}?arch=source&distro={distro}"
                else:
                    src_purl = f"pkg:{syft_purl.type}/{syft_purl.namespace}/{name}@{version}?arch=source"
                # replace the purl
                comp['purl'] = src_purl
                # example: "purl": "pkg:deb/debian/[email protected]+deb11u1?arch=source&distro=debian-11"
                # update component's name and version
                comp['name'] = name
                comp['version'] = version

    # write the output json
    with open(export_json, "w") as file:
        json.dump(image_sbom, file, ensure_ascii=False)  # unicode output


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--import_json", type=str,
                        help="cycloneDX json-file out of syft", default="syft-cyclone-dx_sbom.json")
    parser.add_argument("--export_json", type=str,
                        help="cycloneDX json-file out of current transformation", default="")
    args = parser.parse_args()

    transform_json(args.import_json, args.export_json)

My pipeline script is then:

    - /usr/local/bin/syft $SCAN_CONTAINER_IMAGE --scope all-layers -o cyclonedx-json=syft-cyclone-dx_sbom.json
    - python syft.transform_sbom-bin-purl-to-source-purl.py --import_json syft-cyclone-dx_sbom.json --export_json image_sbom.json

@kzantow
Copy link
Contributor

kzantow commented Jun 8, 2023

There seem to be a couple paths forward here, although this isn't a priority at the moment we've promoted this to our backlog and we welcome pull requests and would be happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-research
Projects
Status: Backlog
Development

No branches or pull requests

4 participants