Skip to content

Commit

Permalink
docs: Further improve formatting and docs of H2 database caching stra…
Browse files Browse the repository at this point in the history
…tegies

Signed-off-by: Chad Wilson <[email protected]>
  • Loading branch information
chadlwilson committed Jul 4, 2024
1 parent 51f84ff commit 12b5238
Showing 1 changed file with 54 additions and 27 deletions.
81 changes: 54 additions & 27 deletions src/site/markdown/data/cacheh2.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,49 +2,76 @@ Caching ODC's H2 Database
=========================================

Many users of dependency-check ensure that ODC runs as fast as possible by caching
the entire `data` directory included the H2 database (`odc.mv.db`). The location of the `data`
directory is different for each integration (cli, maven, gradle, etc.), however each
allows users to configure this location.
the entire `data` directory, including the H2 database (`odc.mv.db`).

Within the `data` directory there is a `cache` directory that contains temporary caches
of data requested that is not stored in the database and is generally build specific
- but can be re-used.
The location of the `data` directory is different for each integration (cli, maven, gradle, etc.), however each
integration allows users to configure this location.

There are two primary strategies used:

1. Single node database updater with multiple node "readers"
Single node database updater with multiple node "readers"
---------------------------------------------------------

Use a single node to build the database using the integration in update only mode
Use a single node to build the database using the integration in "update only" mode
(e.g., `--updateOnly` for the cli) and specify the data directory location (see
the configuration documentation for each integration's configuration).

The `data` directory is then archived somewhere accessible to all nodes.
Subsequent nodes that perform scanning will download the archived database before
scanning. These "reader" nodes would be configured with `--noupdate` (or the related
configuration to disable the updates in each integration) so they are not reliant
on outgoing calls.
The `data` directory is then archived somewhere accessible to all nodes, usually using one of two common caching
strategies:
1. **Use shared disk storage (e.g network mounted)**
Subsequent nodes point directly to the same mounted storage being written to by the single node updater.

The cached `data` directory (and H2 database) is generally updated by the single
node/process daily in this use case - but could be designed with a more frequent update.
2. **Use a common artifact storage location/repository**

Subsequent nodes will download the archived `data` directory before scanning and unpack to the relevant location.

2. Multiple node database updaters collaborating on a common cache location
The "reader" nodes are configured with `--noupdate` (or the related configuration to disable the updates in each
integration) so they are not reliant on outgoing calls.

Some users have a slightly modified version of the above caching strategy. Instead
of only having a single update node - they allow all nodes to update. However,
the entire `data` directory is zipped and stored in a common location, including the H2
database, `cache`, and in some cases cached data from multiple upstream sources.
The cached `data` directory (and H2 database) is generally updated by the single node/process daily in this use
case - but could be designed with a more frequent update.

Each node will execute a scan (with updates enabled) and if successful the updated
`data` directory is zipped and uploaded to the common location for use by other nodes.
This has the small advantage of being updated faster and will store the cache between
executions which can improve the performance on some builds, with the disadvantage of
needing to allow all nodes to update the common cache, and thus requiring some degree of
consistency in how they configure ODC.
This approach is often used when:
- updating of database/cache needs to be centrally controlled/co-ordinated
- internet access is not available to all nodes/users, and perhaps only available centrally or difficult to
configure (e.g proxied environments)
- nodes/users of ODC data cannot safely collaborate on a shared cache without affecting one another

Multiple node database updaters collaborating on a common cache location
------------------------------------------------------------------------

Instead of having only a single update node - all nodes are allowed to update the database if necessary. However
the entire `data` directory is zipped and stored in a common location, including the H2 database, `cache`, and in
some cases cached data from multiple upstream sources.

There are two common caching strategies here:
1. **Use shared disk storage (e.g network mounted)**

Every node is pointed to writeable shared storage, e.g network mounted. ODC creates an update lock file within
the shared storage when any individual node is updating, and other nodes will wait for the lock to be released.

2. **Use a common artifact storage location/repository**

Prior to running ODC, each node downloads the latest version of the archived `data` directory from the shared
artifact storage and unpacks it to the relevant location.

They then execute a scan (with updates enabled) and if successful the updated `data` directory is archived and
uploaded to the common location for use by the next node.

Since this strategy allows all nodes to update the common cache to be effective
- it does not help if nodes download from the common cache, but dont share the updated cache with others by uploading
- it requires some degree of consistency in how all nodes configure ODC to ensure the cache is not corrupted by others

This approach is usually used when:
- ensuring data is updated more deterministically after validity period expiry is desirable (e.g `nvdValidForHours`)
- configuring ODC with single writer and multiple reader strategies adds excessive friction
- reliance on a centralised updater is undesirable and a more de-centralised approach is useful

Additional Notes
----------------

The `data` directory may also contain cached data from other upstream sources, dependent
The `data` directory may also contain cached data from other upstream sources, depending
on which analyzers are enabled. Ensuring that file modification times are retained during
archiving and un-archiving will make these safe to cache, which is especially important in
a multi-node update strategy.

0 comments on commit 12b5238

Please sign in to comment.