-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Cache build environments by content hash #420
base: main
Are you sure you want to change the base?
Conversation
Write an archive consisting of the build environment provided by pip, and then extract it at a path containing the SHA256 hash of its contents. This prevents unnecessary rebuilds triggered by CMake thinking that packages provided by pip in the build environment have been changed, when it's only the temporary path that changed. Because the path contains the hash of its contents, this is perfectly safe since a changed build environment will change the hash. Signed-off-by: Tobias Markus <[email protected]>
I haven't added any tests or documentation yet because I didn't want to waste any work in case this approach is not a good fit for scikit-build-core. |
I haven't had time to look into it yet - the question I would answer first, does the current mechanism not work? We record the paths, then if that recording exists from a previous run, scikit-build-core/src/scikit_build_core/cmake.py Lines 89 to 106 in 5aa2d28
Ah, no, I don't think we do, I think for now we just remove everything. That was the original plan though - would that work? We do a search and replace on the cache file to put the new paths in. The benefit of storing the actual environment is you could then work without pip restoring it (say in editable mode), but it's pretty unusual and unexpected to grab a copy of the build environment. I think modifying the paths (and/or times) based on a stored value (we could also store hashes) and relying on the provided build environments would be better? |
I'm not entirely sure what role the mechanism you linked plays. I just checked a To clarify, what I mean is that CMake might reference paths from the build environment in the actual compile commands. To give some context (perhaps others use scikit-build-core differently): For the projects I'm using scikit-build-core on, setting
I agree that it is not the prettiest solution. I've thought about somehow asking pip to provide a persistent path for the build requirements. But I guess that sort of goes against the spirit of build isolation. BTW: If you're wondering about why the PR first creates, hashes and then extracts an archive instead of just hashing files and copying them directly: The idea is that creating the archive provides a defined wire format on which we can calculate a hash. If we just copied files directly while calculating their hash value, we'd have to combine the hash values in a defined format in order to come up with a hash for directories and finally the entire build environment - which is essentially reinventing what an archive file is. Also, by extracting the temporary archive, we can be sure that we only provide files which we also included in the hash, reducing the chance for bugs due to forgotten files or changed metadata. |
(FYI, this is still on my roadmap to investigate!) |
To clarify the issue from the cmake side:
|
Rationale of this PR
Reusing build directories already speeds up repeated
pip install
invocations significantly because most build artifacts can be reused. However, when using CMake packages provided by pip, as commonly used for pybind11 and friends, CMake will always consider them out of date when building with PEP 517 build isolation enabled (because the temporary build folder changes on every invocation).Implementation
Write an archive consisting of the build environment provided by pip, and then extract it at a path containing the SHA256 hash of its contents. The archive contents are filtered:
__pycache__
files are ignored (see below), and mtime is set to 0 so that it is essentially ignored - CMake should only look at the file path (containing the hash), and completely ignore mtime.Because the path contains the hash of its contents, this is perfectly safe since a changed build environment will change the hash.
Open Questions
__pycache__
files changing on every invocation? Is it about the file metadata, or does pip somehow change the files themselves?I hacked this PR together when I got the idea to do hash-based caching earlier today. I'm looking forward to any comments, especially whether this is a good fit for
scikit-build-core
and whether is a good idea.I'm not an expert on pip internals so perhaps there are some details about the temporary build environment that I missed. Perhaps there's a simpler way to solve all this.
TODO
hashlib.file_digest
, it is only available in Python 3.11+