Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose option to scan without indexing #3360

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adammcclenaghan
Copy link
Contributor

@adammcclenaghan adammcclenaghan commented Oct 21, 2024

Description

There was some discussion recently regarding supporting scanning without indexing in this issue

In that issue, the main complaint is related to the time taken to scan.

The most promising solution presented was:

[most promising] Allow the user to opt into non-** globs only and use the unindexed resolver. We'd want to be able to be loud in terms of logging/UI when a ** glob is used when in an "unindexed" mode. This has a few variants, such as:
a. for select catalogers (primarily OS ones) have hard coded absolute paths to common DB locations (inflexible but simple)
b. expose glob-level configuration for all catalogers, allowing the user to extend or override the globs that are searched for (flexible but potentially dangerous)

However, on memory constrained systems with large disks, file system indexing can use considerable memory.

I am raising this PR as I'd like to propose exposing the unindexed resolver, as a form of preferring slower scan times in favour of lower memory usage.

As a result, I would like to propose getting this change in without the added complexity of Allow the user to opt into non-** globs. Perhaps adding that is a reasonable next step as part of a follow up PR?

This partially addresses #3145 but does not necessarily fix as it will not improve their scan times, however it will allow for reduced memory imapct on memory constrained systems.

As part of this change, I've changed the behaviour of RelativeFileByPath and added a unit test, the existing implementation didn't seem to work.

When package.go::fetchMd5Contents ran with the unindexed resolver it did not resolve correctly because the location field passed into the function points at the DPKG status file, so the existing code which tried to do path.Clean(path.Join(l.RealPath, p)) wound up trying to join something like /var/lib/dpkg/status with /info to search /var/ib/dpkg/status/info which is incorrect as we want to search in /var/lib/dpkg/info.

One other thing I've noticed while testing this change is that I get more 'location' files for DPKG artifacts when scanning with the unindexed resolver than with the indexed resolver, however I'm not sure this is a bug with the unindexed resolver, perhaps there is a bug with the existing indexed resolver?

With the indexed resolver, I get these locations for one of my DPKG artifacts:

      "locations": [
        {
          "path": "var/lib/dpkg/info/acpid.conffiles",
          "accessPath": "var/lib/dpkg/info/acpid.conffiles",
          "annotations": {
            "evidence": "supporting"
          }
        },
        {
          "path": "var/lib/dpkg/info/acpid.md5sums",
          "accessPath": "var/lib/dpkg/info/acpid.md5sums",
          "annotations": {
            "evidence": "supporting"
          }
        },
        {
          "path": "var/lib/dpkg/status",
          "accessPath": "var/lib/dpkg/status",
          "annotations": {
            "evidence": "primary"
          }
        }
      ],

With the unindexed resolver, I get these locations for the same DPKG artifact:

      "locations": [
        {
          "path": "var/lib/dpkg/info/acpid.conffiles",
          "accessPath": "var/lib/dpkg/info/acpid.conffiles",
          "annotations": {
            "evidence": "supporting"
          }
        },
        {
          "path": "var/lib/dpkg/info/acpid.list",
          "accessPath": "var/lib/dpkg/info/acpid.list"
        },
        {
          "path": "var/lib/dpkg/info/acpid.md5sums",
          "accessPath": "var/lib/dpkg/info/acpid.md5sums",
          "annotations": {
            "evidence": "supporting"
          }
        },
        {
          "path": "var/lib/dpkg/info/acpid.postinst",
          "accessPath": "var/lib/dpkg/info/acpid.postinst"
        },
        {
          "path": "var/lib/dpkg/info/acpid.postrm",
          "accessPath": "var/lib/dpkg/info/acpid.postrm"
        },
        {
          "path": "var/lib/dpkg/info/acpid.preinst",
          "accessPath": "var/lib/dpkg/info/acpid.preinst"
        },
        {
          "path": "var/lib/dpkg/info/acpid.prerm",
          "accessPath": "var/lib/dpkg/info/acpid.prerm"
        },
        {
          "path": "var/lib/dpkg/status",
          "accessPath": "var/lib/dpkg/status",
          "annotations": {
            "evidence": "primary"
          }
        }
      ],

The difference in behaviour relates to parse_dpkg_db.go::findDpkgInfoFiles and this call in particular:

        // look for /var/lib/dpkg/info/NAME.*
	locations, err := resolver.FilesByGlob(path.Join(searchPath, name+".*"))

Here we pass something like:

  • searchPath = /var/lib/dpkg/info/
  • name + .* = acpid.*

Giving us a FilesByGlob path of var/lib/dpkg/info/acpid.*

On my system, I believe we should be matching all of these:

$ find ./var/lib/dpkg/info -name 'acpid.*'
./var/lib/dpkg/info/acpid.postinst
./var/lib/dpkg/info/acpid.md5sums
./var/lib/dpkg/info/acpid.postrm
./var/lib/dpkg/info/acpid.prerm
./var/lib/dpkg/info/acpid.preinst
./var/lib/dpkg/info/acpid.conffiles
./var/lib/dpkg/info/acpid.list

However, with the indexed directory resolver the pattern var/lib/dpkg/info/acpid.* gets converted in the r.chroot.ToNativeGlob(pattern) call to /var/lib/dpkg/info/acpid./* which returns no matches. This explains why the indexed resolver doesn't find locations like acpid.prerm in the output. The md5sums and conffiles are found by package.go::fetchMd5Contents and package.go::fetchConffileContents respectively and they use RelativeFileByPath which explains why they still work.

Maybe I'm unclear on what FilesByGlob should be returning here, and perhaps the issue is with the existing unindexed_directory impl, but I'd like to get some clarification on what we expect the behaviour to be here.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

- Implements some new functions on unindexed directory which previously had 'panic' for their impl
- Fixed RelativeFileByPath for unindexed directory resolver

Signed-off-by: adammcclenaghan <[email protected]>
@@ -228,6 +229,9 @@ func (cfg *Catalog) AddFlags(flags clio.FlagSet) {

flags.StringVarP(&cfg.Source.BasePath, "base-path", "",
"base directory for scanning, no links will be followed above this directory, and all paths will be reported relative to this directory")

flags.BoolVarP(&cfg.Unindexed, "unindexed", "",
"whether to index the file system or not, indexing can improve scan times but incurs a memory overhead, default false.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose a CLI option for this? I feel like this is more of a low-level option, which hints at leaving it only in the config (and overridable via env vars too).

if err != nil {
return nil, fmt.Errorf("unable to create directory resolver: %w", err)
if s.config.Unindexed {
s.resolver = fileresolver.NewFromRootedUnindexedDirectory(s.config.Path, s.config.Base)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to modify this resolver to accept exclusions -- I think it would be surprising to users to have different exclusion behavior based off of changing only an index related configuration. If this is too difficult or infeasible then we should at least warn the user with log.Warn().

@wagoodman wagoodman added the enhancement New feature or request label Oct 23, 2024
@@ -10,20 +10,22 @@ import (
"github.com/anchore/syft/syft/source"
)

func NewSourceProvider(path string, exclude source.ExcludeConfig, alias source.Alias, basePath string) source.Provider {
func NewSourceProvider(path string, exclude source.ExcludeConfig, alias source.Alias, basePath string, unindexed bool) source.Provider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're in v1.0 would this be considered a breaking change for user's who are using this method for syft as a library?

@anchore/tools before we go to merge this sans other comments is this something we want to put more 👀 on before merge, release, and then potentially break some people's usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants