Make grype-db download smaller by switching compression methods #367

willmurphyscode · 2024-08-21T18:27:31Z

What would you like to be added:

Grype should download a smaller file during it's database update, probably by using .zstd compression on the current database schema.

Why is this needed:

The Grype database has grown over the years, to the point where now the database is 184 mb as a gzipped tar. This results in load on the CDN, and poor experience for many users.

Tasks:

- update yardstick to be able to handle zstd files: https://github.com/anchore/yardstick/blob/63765d429e650787df05711cd204fef04cb8efc8/src/yardstick/tool/grype.py#L260
- update grype-db-manager config shape to include a suffix per schema version, and pass that suffix as -e to grype-db package
- update that configuration to pass -e tar.zstd for v4 and v5 schemas

The text was updated successfully, but these errors were encountered:

wagoodman · 2024-09-17T19:58:36Z

We may be able to use Xz in a performant way instead if we use https://github.com/xi2/xz This appears to be an order of magnitude faster than https://github.com/ulikunitz/xz for decompression concerns. This would mean we'd need to shell out to compress within grype-db, which seems like an alright tradeoff (The ulikunitz repo I think yields larger than expected archives than the native xz utils).

Another consideration is on the compression side: I'm seeing that golang-only implementations are not achieving the best compression ratios compared to native tooling. That implies we might want to shell out to native tooling when creating archives.

wagoodman · 2024-09-17T20:53:24Z

Prototype for grype is here anchore/grype@main...fast-xz . This is down from 80 second with ulikunitz to 16 seconds. Before continuing: is this acceptable? With v6 the DB size will be much smaller than what was tested with, assuming the trend is linear, it looks like this will be ~10 seconds to decompress.

What's missing is removing some of the copied untar code from go-getter and leveraging the stereoscope tar utils (may require some refactoring in stereoscope).

popey · 2024-09-18T16:51:12Z

While busy doing other things, I ran a compression benchmark against today's grype vuln database. I don't know if it's valuable data to you, but I am posting here anyway. I ran it on my ThinkPad Z13, so it's 1-2-year-old commodity hardware.

Summary

Algorithm             Time(U+S)(s)  Time(E)(M:s)  ComprRatio  SpaceSave(%)
xz                    309.47        5:06.46       14.15       92.94
gzip                  20.79         0:19.93       7.79        87.18
bzip2                 119.72        1:58.66       11.26       91.13
lzip                  265.31        4:24.80       13.19       92.42
lzma                  312.10        5:09.29       14.13       92.93
lzop                  2.68          0:02.30       4.90        79.64
zstd                  6.22          0:03.89       9.01        88.92
lzip                  261.30        4:21.17       13.17       92.42
7z                    553.75        0:55.57       13.87       92.80
zip                   19.71         0:19.96       7.78        87.17
zstd -T0 -1           8.20          0:01.46       8.18        87.79
zstd -T0 -3 (def.)    13.97         0:01.85       9.01        88.92
zstd -T0 -5           33.25         0:04.06       9.58        89.57
zstd -T0 -10          68.36         0:08.60       10.94       90.87
zstd -T0 -15          236.82        0:32.13       11.23       91.10
zstd -T0 -19          1756.01       3:57.46       13.27       92.47
zstd -T0 --ultra -22  2193.86       13:03.73      17.52       94.30

Full results

(csv format)

Algorithm,Time(U+S)(s),Time(E)(M:s),ComprRatio,SpaceSave(%),T-Start(UT),T-End(UT),S-Start(b),S-End(b),Command
xz,309.47,5:06.46,14.15,92.94,1726672883,1726673189,1445834752,102163132,tar --absolute-names --xz -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.xz /home/alan/.cache/grype/db/5/vulnerability.db
gzip,20.79,0:19.93,7.79,87.18,1726673189,1726673209,1445834752,185461612,tar --absolute-names --gzip -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.gz /home/alan/.cache/grype/db/5/vulnerability.db
bzip2,119.72,1:58.66,11.26,91.13,1726673209,1726673328,1445834752,128338698,tar --absolute-names --bzip2 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.bz2 /home/alan/.cache/grype/db/5/vulnerability.db
lzip,265.31,4:24.80,13.19,92.42,1726673328,1726673593,1445834752,109612679,tar --absolute-names --lzip -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.lz /home/alan/.cache/grype/db/5/vulnerability.db
lzma,312.10,5:09.29,14.13,92.93,1726673593,1726673902,1445834752,102269096,tar --absolute-names --lzma -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.lzma /home/alan/.cache/grype/db/5/vulnerability.db
lzop,2.68,0:02.30,4.90,79.64,1726673902,1726673905,1445834752,294485866,tar --absolute-names --lzop -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.lzop /home/alan/.cache/grype/db/5/vulnerability.db
zstd,6.22,0:03.89,9.01,88.92,1726673905,1726673909,1445834752,160330902,tar --absolute-names --zstd -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
lzip,261.30,4:21.17,13.17,92.42,1726673909,1726674170,1445834752,109719253,tar --absolute-names --lzip -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.lz /home/alan/.cache/grype/db/5/vulnerability.db
7z,553.75,0:55.57,13.87,92.80,1726674170,1726674225,1445834752,104210534,7z a -bso0 -bsp0 /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.7z /home/alan/.cache/grype/db/5/vulnerability.db
zip,19.71,0:19.96,7.78,87.17,1726674225,1726674245,1445834752,185629104,zip -q -r /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.zip /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -1,8.20,0:01.46,8.18,87.79,1726674245,1726674247,1445834752,176662743,tar --absolute-names -I zstd -T0 -1 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -3 (def.),13.97,0:01.85,9.01,88.92,1726674247,1726674249,1445834752,160330902,tar --absolute-names -I zstd -T0 -3 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -5,33.25,0:04.06,9.58,89.57,1726674249,1726674253,1445834752,150815387,tar --absolute-names -I zstd -T0 -5 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -10,68.36,0:08.60,10.94,90.87,1726674253,1726674262,1445834752,132079524,tar --absolute-names -I zstd -T0 -10 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -15,236.82,0:32.13,11.23,91.10,1726674262,1726674294,1445834752,128697206,tar --absolute-names -I zstd -T0 -15 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 -19,1756.01,3:57.46,13.27,92.47,1726674294,1726674531,1445834752,108899112,tar --absolute-names -I zstd -T0 -19 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db
zstd -T0 --ultra -22,2193.86,13:03.73,17.52,94.30,1726674531,1726675315,1445834752,82520379,tar --absolute-names -I zstd -T0 --ultra -22 -cf /home/alan/Source/Misairuzame/compression-benchmark/tmp/tmparch.tar.zst /home/alan/.cache/grype/db/5/vulnerability.db

jonjohnsonjr · 2024-11-13T00:15:14Z

I would highly recommend using zstandard over xz!

wagoodman · 2024-11-20T17:14:10Z

Me too! When evaluating I've been trying to minimize file size while not impacting decompression time in grype. Something that threw a wrench into this evaluation process is when to use golang implementations for these methods vs shelling out to tooling to do this. I've found when compressing with a golang implementation there tends to be less ideal compression ratios and decompression times. The lesson here learned is: compress with native tooling (for best archives), decompress with golang implementations (allowing us to keep grype as a portable static binary easily). I also found that the compression ratio is pretty sensitive to what is being compressed, so while we've been prototyping a new schema we ended up changing a lot of the details based on apparent ratios we were getting with those designs (for instance, a more normalized DB design tended to be a smaller DB file size, but not a great compression ratio when compressing for distribution... but relaxing normalization and leaning more towards a json blob store the ratio was maximized).

So! Where are we at today with all of the feedback incorporated? In terms of distribution sizes:

original:
711M    build/vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774.tar

archives:
 81M    build/vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774-19.tar.zst    # zstd -19
 63M    build/vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774-22.tar.zst    # zstd -22 --ultra
 58M    build/vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774.tar.xz        # xz -9

Where Xz-9 and Zstd-22 are comparable enough to be candidates here.

And timing (after trying out / swapping some decompression libs... I'll spare folks the details here):

❯ time grype db import vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774.tar.xz
grype db import   6.32s user 0.34s system 89% cpu 7.488 total

❯ time grype db import vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774-19.tar.zst
grype db import   1.21s user 0.58s system 168% cpu 1.065 total

❯ time grype db import vulnerability-db_v6.0.0_2024-11-14T01:32:00Z_1732070774-22.tar.zst
grype db import   2.89s user 0.46s system 130% cpu 2.574 total

From a timing perspective Zstd wins here.

edit: --ultra impacts memory used in decompression, for archives around these sizes -22 uses ~130 MB of memory while -21 uses ~70MB of memory, so we might be tweaking some of the final values here still.

Overall, the final verdict is Zstd 🎉

wagoodman · 2024-11-27T22:10:35Z

Adding changelog ignore since this, though this is implemented in #437, it won't be usable until v6 is enabled as the default schema (probably in a couple months). We don't want to pick this up in the next release notes.

willmurphyscode added the enhancement New feature or request label Aug 21, 2024

anchoretoolsops added this to OSS Aug 21, 2024

wagoodman added this to the DB v6 milestone Sep 17, 2024

wagoodman changed the title ~~Make grype-db download smaller by using zstd compression~~ Make grype-db download smaller by switching compression methods Sep 17, 2024

wagoodman moved this to Ready in OSS Sep 17, 2024

wagoodman mentioned this issue Sep 20, 2024

Use faster xz lib for decompression anchore/grype#2139

Merged

wagoodman self-assigned this Sep 26, 2024

wagoodman moved this from Ready to In Progress in OSS Sep 26, 2024

wagoodman linked a pull request Nov 21, 2024 that will close this issue

Enable packaging v6 DBs #437

Draft

wagoodman added the changelog-ignore Don't consider when generating the changelog label Nov 27, 2024

wagoodman linked a pull request Nov 27, 2024 that will close this issue

Enable packaging v6 DBs #437

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make grype-db download smaller by switching compression methods #367

Make grype-db download smaller by switching compression methods #367

willmurphyscode commented Aug 21, 2024

wagoodman commented Sep 17, 2024 •

edited

Loading

wagoodman commented Sep 17, 2024 •

edited

Loading

popey commented Sep 18, 2024

jonjohnsonjr commented Nov 13, 2024

wagoodman commented Nov 20, 2024 •

edited

Loading

wagoodman commented Nov 27, 2024

Make grype-db download smaller by switching compression methods #367

Make grype-db download smaller by switching compression methods #367

Comments

willmurphyscode commented Aug 21, 2024

wagoodman commented Sep 17, 2024 • edited Loading

wagoodman commented Sep 17, 2024 • edited Loading

popey commented Sep 18, 2024

Summary

Full results

jonjohnsonjr commented Nov 13, 2024

wagoodman commented Nov 20, 2024 • edited Loading

wagoodman commented Nov 27, 2024

wagoodman commented Sep 17, 2024 •

edited

Loading

wagoodman commented Sep 17, 2024 •

edited

Loading

wagoodman commented Nov 20, 2024 •

edited

Loading