-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store path provenance tracking #11749
base: master
Are you sure you want to change the base?
Conversation
f2b796f
to
31d1d7e
Compare
This looks like a cool idea. How does it help me determine which expression (which line of which file) in the checkout of some repository defines the Like, you implied this would support tracking the store path back to the expression. And, in the |
Nix historically has been bad at being able to answer the question "where did this store path come from", i.e. to provide traceability from a store path back to the Nix expression from which is was built. Nix tracks the "deriver" of a store path (the .drv file that built it) but that's pretty useless in practice, since it doesn't link back to the Nix expressions. So this PR adds a "provenance" field (a JSON object) to the ValidPaths table and to .narinfo files that describes where the store path came from and how it can be reproduced. There are currently 3 types of provenance: * "copied": Records that the store path was copied or substituted from another store (typically a binary cache). Its "from" field is the URL of the origin store. Its "provenance" field propagates the provenance of the store path on the origin store. * "derivation": Records that the store path is the output of a .drv file. This is equivalent for the "deriver" field, but it has a nested "provenance" field that records how the .drv file was created. * "flake": Records that the store path was created during the evaluation of a flake output. Example: $ nix path-info --json /nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0 { "/nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0": { "provenance": { "from": "https://cache.example.org", "provenance": { "drv": "rlabxgjx88bavjkc694v1bqbwslwivxs-patchelf-0.18.0.drv", "output": "out", "provenance": { "flake": { "lastModified": 1729856604, "narHash": "sha256-obmE2ZI9sTPXczzGMerwQX4SALF+ABL9J0oB371yvZE=", "owner": "NixOS", "repo": "patchelf", "rev": "689f19e499caee8e5c3d387008bbd4ed7f8dc3a9", "type": "github", }, "output": "packages.x86_64-linux.default", "type": "flake" }, "type": "derivation" }, "type": "copied" }, ... } } This specifies that the store path was copied from the binary cache https://cache.example.org and it's the "out" output of a store derivation that was produced by evaluating the flake ouput `packages.x86_64-linux.default` of some revision of the patchelf GitHub repository.
It doesn't currently, since that information wouldn't be enough to reproduce the store derivation (i.e. a package function in Nixpkgs requires arguments to be able to reproduce its output, not to mention stuff like overrides). But storing the top-level flake + flake output name that caused the store derivation to be created does allow the store derivation to be reproduced.
The problem there is that evaluation of non-flake expressions is not hermetic, so we really do need something like flakes for provenance. |
It will be less likely that you can verify the provenance, but something could be recorded nonetheless. |
(I haven't read the whole diff yet, so apologies for questions I could have answered myself, but these will need to be documented anyway, so also you're welcome :) )
Many evaluations will produce the same paths. How do we deal with that? I suppose we only need a Another solution is to only store the first provenance, but this is too arbitrary IMO, and can also be achieved with a first referrer field if we feel like storing all referrers edges is too expensive or impractical for "non-enumerating" stores like the binary cache stores. Putting new appendable data into the stores including the binary caches stores is quite a step. Do we really need this to be in the binary cache?A lot of the value of this feature could instead be produced by a local database, since that's where evaluation and realisation ultimately happen anyway. Some questionsThings to be documented and/or implemented
|
struct ProvFlake | ||
{ | ||
std::shared_ptr<nlohmann::json> flake; // FIXME: change to Attrs | ||
std::string flakeOutput; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string flakeOutput; | |
std::vector<std::string> flakeOutput; |
* derivation input source) that was produced by the evaluation of | ||
* a flake. | ||
*/ | ||
struct ProvFlake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a layer violation. We could define something like struct ProvOther { std::string type; nlohmann::json value; }
at the store layer and refine this in upper layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about getting rid of all the Prov*
types and just passing provenance around as a JSON value.
Indeed provenance doesn't need to be hermetic or reproducible, so we could certainly have a provenance type for non-flake evaluations.
The provenance is the evaluation that produced the store path, i.e. the first one. There can of course be many other evaluations that produce the same store path, but those are not the provenance for that particular store / binary cache. (The same applies to other types of provenance like substitution: a path can be substituted from many binary caches, but we only record the one we actually used.) Recording other provenances makes the metadata for a store path potentially grow without bounds. And in the case of .narinfo files, we really don't want to update them after creation due to caching etc. This is the same semantics as the deriver field BTW.
I think so, because without that you can't query the ultimate provenance of a store path in a binary cache like cache.nixos.org. |
I do not like this PR as a solution to the problem of provenance tracking. I think this approach is something that could be implemented in any ecosystem, while Nix is in the unique position that it could really do so much better. On the origin of build outputs
The signatures that we already use for transport security in binary caches do provide this kind of information already, because they are evidence of where you got an output, and I think it is a mistake to design a solution to this problem that just sidesteps them. Extending the signing scheme would give us an actual cryptographic basis on which we can attribute the outputs of individual build steps to their actual builders, while this PR only propagates second-hand information down the chain with a kind of attribution that is not really trustworthy, because any link in the chain can just alter it. Because signed information is attributable to the signer, it works much better across systems. On attributing build outputs to derivations, and derivations to flakesAttributing build outputs to derivations, and derivations to flakes, is a problem that can be solved better locally.
This naturally makes it difficult to keep track of derivations themselves, because the derivation hash is computed from the derivation, and so you already have to know the derivation to look up anything. Instead of attaching flake references to the destination of this mapping, we can view flakes as a higher-level mapping, which summarizes and tracks subtrees of build steps, including the derivations involved.
It would also be possible to record dependencies between flakes that way, by making a DB entry whenever we cross a boundary to another upstream flake while building or substituting. Similarly, during evaluation we can record 'kind of but not reallly' the inverse of
Based on all three of these relations, you can start at the NAR hash of any output, and walk through its reverse dependencies, until you hit a flake output, or continue walking until you find them all. One side benefit of attributing paths to flakes in any way, would be that it makes the contents of the local store of a system less opaque. I did read through the code in this PR a few days ago. I hope that I have understood the gist of it correctly, and I hope this makes sense to you. In any case I would really appreciate if you could give me the benefit of the doubt and we could further discuss this/my work on issues like this somewhere. |
🎉 All dependencies have been resolved ! |
Motivation
Nix historically has been bad at being able to answer the question "where did this store path come from", i.e. to provide traceability from a store path back to the Nix expression from which is was built. Nix tracks the "deriver" of a store path (the
.drv
file that built it) but that's pretty useless in practice, since it doesn't link back to the Nix expressions.So this PR adds a "provenance" field (a JSON object) to the
ValidPaths
table and to.narinfo
files that describes where the store path came from and how it can be reproduced.There are currently 3 types of provenance:
copied
: Records that the store path was copied or substituted from another store (typically a binary cache). Its "from" field is the URL of the origin store. Its "provenance" field propagates the provenance of the store path on the origin store.derivation
: Records that the store path is the output of a .drv file. This is equivalent for the "deriver" field, but it has anested "provenance" field that records how the .drv file was created.
flake
: Records that the store path was created during the evaluation of a flake output.Example:
This specifies that the store path was copied from the binary cache https://cache.example.org/ and it's the "out" output of a store derivation that was produced by evaluating the flake ouput
packages.x86_64-linux.default
of some revision of the patchelf GitHub repository.Depends on #11668.
Context
Priorities and Process
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.