Store path provenance tracking #11749

edolstra · 2024-10-25T12:52:40Z

Motivation

Nix historically has been bad at being able to answer the question "where did this store path come from", i.e. to provide traceability from a store path back to the Nix expression from which is was built. Nix tracks the "deriver" of a store path (the .drv file that built it) but that's pretty useless in practice, since it doesn't link back to the Nix expressions.

So this PR adds a "provenance" field (a JSON object) to the ValidPaths table and to .narinfo files that describes where the store path came from and how it can be reproduced.

There are currently 3 types of provenance:

copied: Records that the store path was copied or substituted from another store (typically a binary cache). Its "from" field is the URL of the origin store. Its "provenance" field propagates the provenance of the store path on the origin store.
derivation: Records that the store path is the output of a .drv file. This is equivalent for the "deriver" field, but it has a
nested "provenance" field that records how the .drv file was created.
flake: Records that the store path was created during the evaluation of a flake output.

Example:

$ nix path-info --json /nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0
{
  "/nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0": {
    "provenance": {
      "from": "https://cache.example.org/",
      "provenance": {
        "drv": "rlabxgjx88bavjkc694v1bqbwslwivxs-patchelf-0.18.0.drv",
        "output": "out",
        "provenance": {
          "flake": {
            "lastModified": 1729856604,
            "narHash": "sha256-obmE2ZI9sTPXczzGMerwQX4SALF+ABL9J0oB371yvZE=",
            "owner": "NixOS",
            "repo": "patchelf",
            "rev": "689f19e499caee8e5c3d387008bbd4ed7f8dc3a9",
            "type": "github",
          },
          "output": "packages.x86_64-linux.default",
          "type": "flake"
        },
        "type": "derivation"
       },
       "type": "copied"
    },
    ...
  }
}

This specifies that the store path was copied from the binary cache https://cache.example.org/ and it's the "out" output of a store derivation that was produced by evaluating the flake ouput packages.x86_64-linux.default of some revision of the patchelf GitHub repository.

Depends on #11668.

Context

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

johnrichardrinehart · 2024-10-26T19:43:30Z

This looks like a cool idea. How does it help me determine which expression (which line of which file) in the checkout of some repository defines the .drv?

Like, you implied this would support tracking the store path back to the expression. And, in the flake case I guess someone could make an argument that that's good enough. But, what about in the case of an ad-hoc derivation floating around on my filesystem that I realise with nix-build and which gets post-build-hooked to a substituter? Seems like the provenance might be hard in that case? I should play around with this because I'll probably be able to answer my own questions.

Nix historically has been bad at being able to answer the question "where did this store path come from", i.e. to provide traceability from a store path back to the Nix expression from which is was built. Nix tracks the "deriver" of a store path (the .drv file that built it) but that's pretty useless in practice, since it doesn't link back to the Nix expressions. So this PR adds a "provenance" field (a JSON object) to the ValidPaths table and to .narinfo files that describes where the store path came from and how it can be reproduced. There are currently 3 types of provenance: * "copied": Records that the store path was copied or substituted from another store (typically a binary cache). Its "from" field is the URL of the origin store. Its "provenance" field propagates the provenance of the store path on the origin store. * "derivation": Records that the store path is the output of a .drv file. This is equivalent for the "deriver" field, but it has a nested "provenance" field that records how the .drv file was created. * "flake": Records that the store path was created during the evaluation of a flake output. Example: $ nix path-info --json /nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0 { "/nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0": { "provenance": { "from": "https://cache.example.org", "provenance": { "drv": "rlabxgjx88bavjkc694v1bqbwslwivxs-patchelf-0.18.0.drv", "output": "out", "provenance": { "flake": { "lastModified": 1729856604, "narHash": "sha256-obmE2ZI9sTPXczzGMerwQX4SALF+ABL9J0oB371yvZE=", "owner": "NixOS", "repo": "patchelf", "rev": "689f19e499caee8e5c3d387008bbd4ed7f8dc3a9", "type": "github", }, "output": "packages.x86_64-linux.default", "type": "flake" }, "type": "derivation" }, "type": "copied" }, ... } } This specifies that the store path was copied from the binary cache https://cache.example.org and it's the "out" output of a store derivation that was produced by evaluating the flake ouput `packages.x86_64-linux.default` of some revision of the patchelf GitHub repository.

edolstra · 2024-10-27T10:43:50Z

How does it help me determine which expression (which line of which file) in the checkout of some repository defines the .drv?

It doesn't currently, since that information wouldn't be enough to reproduce the store derivation (i.e. a package function in Nixpkgs requires arguments to be able to reproduce its output, not to mention stuff like overrides). But storing the top-level flake + flake output name that caused the store derivation to be created does allow the store derivation to be reproduced.

But, what about in the case of an ad-hoc derivation floating around on my filesystem

The problem there is that evaluation of non-flake expressions is not hermetic, so we really do need something like flakes for provenance.

roberth · 2024-11-06T12:11:22Z

not hermetic

It will be less likely that you can verify the provenance, but something could be recorded nonetheless.
Expressions written with "purity" in mind may actually verify just fine if, say, a git revision is stored when e.g. a default.nix is in a git repo.

roberth · 2024-11-06T12:11:53Z

(I haven't read the whole diff yet, so apologies for questions I could have answered myself, but these will need to be documented anyway, so also you're welcome :) )

flake: Records that the store path was created during the evaluation of a flake output.

Many evaluations will produce the same paths. How do we deal with that? I suppose we only need a flake provenance for the outputs that are immediately in the flake outputs, and we can find provenance of the closure by following the referrers relation.
Denormalizing all this into the closure is too expensive.

Another solution is to only store the first provenance, but this is too arbitrary IMO, and can also be achieved with a first referrer field if we feel like storing all referrers edges is too expensive or impractical for "non-enumerating" stores like the binary cache stores.

Putting new appendable data into the stores including the binary caches stores is quite a step.

Do we really need this to be in the binary cache?

A lot of the value of this feature could instead be produced by a local database, since that's where evaluation and realisation ultimately happen anyway.
It's only when you're doing deployments with store-level-only operations like closure copying that you lose this info, but I think this is fine. Deployment targets don't need to know their evaluation provenance; only the machines that manage those targets really need to know.

Some questions

Things to be documented and/or implemented

How do we deal with the many-to-one relationship between evaluations and a product of those evaluations?
How does this work for ca-derivations realisations?
Documentation in the protocols section of the manual
- Include a JSON schema? I have a solution, which looks like this

roberth · 2024-11-06T12:13:48Z

src/libstore/provenance.hh

+    struct ProvFlake
+    {
+        std::shared_ptr<nlohmann::json> flake; // FIXME: change to Attrs
+        std::string flakeOutput;


Suggested change

std::string flakeOutput;

std::vector<std::string> flakeOutput;

roberth · 2024-11-06T12:16:35Z

src/libstore/provenance.hh

+     * derivation input source) that was produced by the evaluation of
+     * a flake.
+     */
+    struct ProvFlake


This is a layer violation. We could define something like struct ProvOther { std::string type; nlohmann::json value; } at the store layer and refine this in upper layers.

I'm thinking about getting rid of all the Prov* types and just passing provenance around as a JSON value.

edolstra · 2024-11-06T13:53:38Z

It will be less likely that you can verify the provenance, but something could be recorded nonetheless.

Indeed provenance doesn't need to be hermetic or reproducible, so we could certainly have a provenance type for non-flake evaluations.

Many evaluations will produce the same paths. How do we deal with that?

The provenance is the evaluation that produced the store path, i.e. the first one. There can of course be many other evaluations that produce the same store path, but those are not the provenance for that particular store / binary cache. (The same applies to other types of provenance like substitution: a path can be substituted from many binary caches, but we only record the one we actually used.)

Recording other provenances makes the metadata for a store path potentially grow without bounds. And in the case of .narinfo files, we really don't want to update them after creation due to caching etc.

This is the same semantics as the deriver field BTW.

Do we really need this to be in the binary cache?

I think so, because without that you can't query the ultimate provenance of a store path in a binary cache like cache.nixos.org.

mschwaig · 2024-11-07T13:19:48Z

I do not like this PR as a solution to the problem of provenance tracking. I think this approach is something that could be implemented in any ecosystem, while Nix is in the unique position that it could really do so much better.

On the origin of build outputs

Do we really need this to be in the binary cache?

I think so, because without that you can't query the ultimate provenance of a store path in a binary cache like cache.nixos.org.

The signatures that we already use for transport security in binary caches do provide this kind of information already, because they are evidence of where you got an output, and I think it is a mistake to design a solution to this problem that just sidesteps them.
I did some work recently on how we could extend the existing signing scheme to be even more suitable for solving issues like this end to end, by making it possible for the signer to claim to be the builder (and generally attach other arbitrary metadata to the signature to make it attributable/verifiable). See my recent paper about this, or my talk about that work at NixCon 2024. #9644 is a bit out of date, but relevant, and #3023 would nicely complement that approach by making it possible to track local builds in the same way.

Extending the signing scheme would give us an actual cryptographic basis on which we can attribute the outputs of individual build steps to their actual builders, while this PR only propagates second-hand information down the chain with a kind of attribution that is not really trustworthy, because any link in the chain can just alter it.

Because signed information is attributable to the signer, it works much better across systems.

On attributing build outputs to derivations, and derivations to flakes

Attributing build outputs to derivations, and derivations to flakes, is a problem that can be solved better locally.
I see the fundamental data structure in Nix as mappings from derivation hashes to either store contents or NAR hashes.

derivation hash -> store content / NAR hash of output

This naturally makes it difficult to keep track of derivations themselves, because the derivation hash is computed from the derivation, and so you already have to know the derivation to look up anything.

Instead of attaching flake references to the destination of this mapping, we can view flakes as a higher-level mapping, which summarizes and tracks subtrees of build steps, including the derivations involved.
This mapping would be reflected in a new table in the DB, where we add new entries every time we evaluate a flake.

flake-url, commmit hash, flake-output -> derivation hash of flake output

It would also be possible to record dependencies between flakes that way, by making a DB entry whenever we cross a boundary to another upstream flake while building or substituting.

Similarly, during evaluation we can record 'kind of but not reallly' the inverse of derivation hash -> NAR hash of output (which we have evidence for in the form of a signature) in another DB table, because we still have access to the derivation we are evaluating at that point:

derivation hash -> [ NAR hash of each input ]

Based on all three of these relations, you can start at the NAR hash of any output, and walk through its reverse dependencies, until you hit a flake output, or continue walking until you find them all.

One side benefit of attributing paths to flakes in any way, would be that it makes the contents of the local store of a system less opaque.
This same information could be used to outline how much each version of each flake takes up in disk space exclusively and in total, and to prioritize flake versions with no commit hash in version control during GC.

I did read through the code in this PR a few days ago. I hope that I have understood the gist of it correctly, and I hope this makes sense to you. In any case I would really appreciate if you could give me the benefit of the doubt and we could further discuss this/my work on issues like this somewhere.

dpulls · 2024-11-20T21:32:39Z

🎉 All dependencies have been resolved !

edolstra requested review from Ericson2314, roberth and fricklerhandwerk as code owners October 25, 2024 12:52

github-actions bot added store Issues and pull requests concerning the Nix store fetching Networking with the outside (non-Nix) world, input locking labels Oct 25, 2024

edolstra marked this pull request as draft October 25, 2024 12:53

edolstra force-pushed the provenance branch 5 times, most recently from f2b796f to 31d1d7e Compare October 26, 2024 15:49

github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Oct 26, 2024

edolstra force-pushed the provenance branch from 31d1d7e to 5de7753 Compare October 27, 2024 09:56

Record provenance of source trees

9f6746a

roberth reviewed Nov 6, 2024

View reviewed changes

Fix migration name

78f7ac4

Merge remote-tracking branch 'origin/master' into provenance

f12b4b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store path provenance tracking #11749

Store path provenance tracking #11749

edolstra commented Oct 25, 2024

johnrichardrinehart commented Oct 26, 2024 •

edited

Loading

edolstra commented Oct 27, 2024

roberth commented Nov 6, 2024

roberth commented Nov 6, 2024

roberth Nov 6, 2024

roberth Nov 6, 2024

edolstra Nov 6, 2024

edolstra commented Nov 6, 2024

mschwaig commented Nov 7, 2024 •

edited

Loading

dpulls bot commented Nov 20, 2024

	std::string flakeOutput;
	std::vector<std::string> flakeOutput;

Store path provenance tracking #11749

Are you sure you want to change the base?

Store path provenance tracking #11749

Conversation

edolstra commented Oct 25, 2024

Motivation

Context

Priorities and Process

johnrichardrinehart commented Oct 26, 2024 • edited Loading

edolstra commented Oct 27, 2024

roberth commented Nov 6, 2024

roberth commented Nov 6, 2024

Do we really need this to be in the binary cache?

Some questions

roberth Nov 6, 2024

Choose a reason for hiding this comment

roberth Nov 6, 2024

Choose a reason for hiding this comment

edolstra Nov 6, 2024

Choose a reason for hiding this comment

edolstra commented Nov 6, 2024

mschwaig commented Nov 7, 2024 • edited Loading

On the origin of build outputs

On attributing build outputs to derivations, and derivations to flakes

dpulls bot commented Nov 20, 2024

johnrichardrinehart commented Oct 26, 2024 •

edited

Loading

mschwaig commented Nov 7, 2024 •

edited

Loading