Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Cache build environments by content hash #420

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hesiod
Copy link

@hesiod hesiod commented Jul 3, 2023

Rationale of this PR

Reusing build directories already speeds up repeated pip install invocations significantly because most build artifacts can be reused. However, when using CMake packages provided by pip, as commonly used for pybind11 and friends, CMake will always consider them out of date when building with PEP 517 build isolation enabled (because the temporary build folder changes on every invocation).

Implementation

Write an archive consisting of the build environment provided by pip, and then extract it at a path containing the SHA256 hash of its contents. The archive contents are filtered: __pycache__ files are ignored (see below), and mtime is set to 0 so that it is essentially ignored - CMake should only look at the file path (containing the hash), and completely ignore mtime.
Because the path contains the hash of its contents, this is perfectly safe since a changed build environment will change the hash.

Open Questions

  • Should this be an experimental feature?
  • How should enabling the feature work? Using an enable setting, or perhaps by setting a cache path?
  • Why are hashes of __pycache__ files changing on every invocation? Is it about the file metadata, or does pip somehow change the files themselves?
  • What about very large build environments? The archival and extraction is currently run on every invocation. With larger archive sizes, performance would likely be quite degraded.
  • Should the archive be split per package? This would allow reusing existing cache directories more often, while complicating the code somewhat.

I hacked this PR together when I got the idea to do hash-based caching earlier today. I'm looking forward to any comments, especially whether this is a good fit for scikit-build-core and whether is a good idea.
I'm not an expert on pip internals so perhaps there are some details about the temporary build environment that I missed. Perhaps there's a simpler way to solve all this.

TODO

  • Replace use of hashlib.file_digest, it is only available in Python 3.11+

Write an archive consisting of the build environment provided by pip,
and then extract it at a path containing the SHA256 hash of its
contents.
This prevents unnecessary rebuilds triggered by CMake thinking that
packages provided by pip in the build environment have been changed,
when it's only the temporary path that changed.
Because the path contains the hash of its contents, this is perfectly
safe since a changed build environment will change the hash.

Signed-off-by: Tobias Markus <[email protected]>
@hesiod hesiod force-pushed the build-env-caching branch from 1de6289 to 79ecb0d Compare July 3, 2023 21:28
@hesiod
Copy link
Author

hesiod commented Jul 5, 2023

I haven't added any tests or documentation yet because I didn't want to waste any work in case this approach is not a good fit for scikit-build-core.
@henryiii (or other maintainers) If you don't mind, I'd love to hear some feedback as to whether this idea is viable and whether this approach is the way forward. Thanks!

@henryiii
Copy link
Collaborator

henryiii commented Jul 5, 2023

I haven't had time to look into it yet - the question I would answer first, does the current mechanism not work? We record the paths, then if that recording exists from a previous run, we swap it out for the current path in the cache. IIRC that was to address this issue.

skbuild_info = self.build_dir / ".skbuild-info.json"
# If building via SDist, this could be pre-filled, so delete it if it exists
with contextlib.suppress(FileNotFoundError):
with skbuild_info.open("r", encoding="utf-8") as f:
info = json.load(f)
cached_source_dir = Path(info["source_dir"])
if cached_source_dir.resolve() != self.source_dir.resolve():
logger.warning(
"Original src {} != {}, wiping build directory",
cached_source_dir,
self.source_dir,
)
shutil.rmtree(self.build_dir)
self.build_dir.mkdir()
with skbuild_info.open("w", encoding="utf-8") as f:
json.dump(self._info_dict(), f, indent=2)

Ah, no, I don't think we do, I think for now we just remove everything. That was the original plan though - would that work? We do a search and replace on the cache file to put the new paths in.

The benefit of storing the actual environment is you could then work without pip restoring it (say in editable mode), but it's pretty unusual and unexpected to grab a copy of the build environment. I think modifying the paths (and/or times) based on a stored value (we could also store hashes) and relying on the provided build environments would be better?

@hesiod
Copy link
Author

hesiod commented Jul 5, 2023

I haven't had time to look into it yet - the question I would answer first, does the current mechanism not work? We record the paths, then if that recording exists from a previous run, we swap it out for the current path in the cache. IIRC that was to address this issue.

I'm not entirely sure what role the mechanism you linked plays. I just checked a .skbuild-info.json file in one of my projects, and its source_dir is simply equal to the real source_dir, so no caching is taking place. But the contents of the source directory should not have any impact on the build environment path issue I'm trying to fix here.

To clarify, what I mean is that CMake might reference paths from the build environment in the actual compile commands.
For example, when including pybind11 using build-system.requires, a typical compile command line might look like this: /usr/bin/c++ ... -isystem /tmp/pip-build-env-0ew390zl/overlay/lib/python3.11/site-packages/pybind11/include ...
Obviously, the /tmp/pip-build-env-0ew390zl part of the include directory will change on every run of pip install, and even if its contents stay the same, Ninja will still consider all files which contain that include flag as out of date.

To give some context (perhaps others use scikit-build-core differently): For the projects I'm using scikit-build-core on, setting tool.scikit-build.build-dir eliminates 90% of the build time since all object files that don't reference pybind11 can be reused. But the files that use pybind11 are still rebuild every time. With this PR, running pip install . on an unchanged source tree results in near-instant builds (it's mostly just pip doing dependency resolution).

The benefit of storing the actual environment is you could then work without pip restoring it (say in editable mode), but it's pretty unusual and unexpected to grab a copy of the build environment. I think modifying the paths (and/or times) based on a stored value (we could also store hashes) and relying on the provided build environments would be better?

I agree that it is not the prettiest solution. I've thought about somehow asking pip to provide a persistent path for the build requirements. But I guess that sort of goes against the spirit of build isolation.
Modifying times would not be enough since the rebuilds are triggered by the path itself changing.
Modifying paths based on a stored hash is more or less what this PR does, or did you have something else in mind?

BTW: If you're wondering about why the PR first creates, hashes and then extracts an archive instead of just hashing files and copying them directly: The idea is that creating the archive provides a defined wire format on which we can calculate a hash. If we just copied files directly while calculating their hash value, we'd have to combine the hash values in a defined format in order to come up with a hash for directories and finally the entire build environment - which is essentially reinventing what an archive file is. Also, by extracting the temporary archive, we can be sure that we only provide files which we also included in the hash, reducing the chance for bugs due to forgotten files or changed metadata.

@henryiii
Copy link
Collaborator

henryiii commented Oct 6, 2023

(FYI, this is still on my roadmap to investigate!)

@LecrisUT
Copy link
Collaborator

LecrisUT commented Oct 6, 2023

To clarify the issue from the cmake side:

  • this issue still occurs even when the build directory is specified statically in the configuration? Does Pep517 overwrite the build folder when built in isolation?
  • this is about cmake build artifacts and not the source files that might trigger a re-configure?
  • this workflow reusing artifacts is specifically for development or also for some CI/CD?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants