Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient way to relocate files in runfiles #23728

Open
groodt opened this issue Sep 24, 2024 · 4 comments
Open

Efficient way to relocate files in runfiles #23728

groodt opened this issue Sep 24, 2024 · 4 comments
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc team-Rules-Python Native rules for Python type: feature request untriaged

Comments

@groodt
Copy link
Contributor

groodt commented Sep 24, 2024

Description of the feature request:

There are situations where having more granular control over the layout of files and directories in runfiles is beneficial.

Python context

Specifically, for rules_python, when we import third_party wheels (archives) from indexes like PyPI (Python Package Index), we would ideally like to expand the archives into a directory called site-packages. This is because outside the bazel context, packages are expanded as siblings inside site-packages.

This creates a major impedance mismatch between bazel and archives or foreign packages brought in from other ecosystems. It truly breaks assumptions in breaking ways in Python. One example (there are many others): NVIDIA wheels are expected to be installed as siblings of site-packages and they even have relative RPATH inside their .so. There are no built-in mechanisms to solve this cleanly in Python or bazel at the moment. See bazelbuild/rules_python#2156

Here is a typical example of the contents of site-packages for pip install requests where you can see the standard layout when not using bazel.

tree -L 1 .venv/lib/python3.9/site-packages/                                                                                                                                                                                   20:49:03
.venv/lib/python3.9/site-packages/
├── _distutils_hack
├── certifi
├── certifi-2024.8.30.dist-info
├── charset_normalizer
├── charset_normalizer-3.3.2.dist-info
├── distutils-precedence.pth
├── idna
├── idna-3.10.dist-info
├── pip
├── pip-22.0.4.dist-info
├── pkg_resources
├── requests
├── requests-2.32.3.dist-info
├── setuptools
├── setuptools-58.1.0.dist-info
├── urllib3
└── urllib3-2.2.3.dist-info

Other contexts

This is more than just a Python problem. Node has similar challenges with node_modules.

cc @rickeylev @fmeum

Which category does this issue belong to?

No response

What underlying problem are you trying to solve with this feature?

No response

Which operating system are you running Bazel on?

All

What is the output of bazel info release?

N/A

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

N/A

What's the output of git remote get-url origin; git rev-parse HEAD ?

N/A

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@sgowroji sgowroji added the team-Configurability platforms, toolchains, cquery, select(), config transitions label Sep 24, 2024
@sgowroji sgowroji added the team-Rules-Python Native rules for Python label Sep 24, 2024
@fmeum
Copy link
Collaborator

fmeum commented Sep 24, 2024

Have you looked into symlinks and root_symlinks on ctx.runfiles? That should allow you to realize any layout you want.

@rickeylev
Copy link
Contributor

Thanks for filing this, Greg. I'll post some of my notes from thinking about this topic from awhile back. Sorry, this will be a bit long.

Additional use case:

Sphinx: It requires all the files it's going to process to be under a single directory.
This makes it difficult to have documentation artifacts spread across the repo (i.e. allowing the docs for a subproject to be colocated in that subproject's code tree).

#15486 is sort of related, insofar as: the main reason to pass runfiles into an action is to respect the root_symlinks type of transforms that runfiles can express and that Bazel materializes into paths on disk (as seen by tools). If there was some other way to express that sort of relocation, then (presumably), that could be passed into an action instead.

runfiles symlinks, root_symlinks

Unfortunately, these don't work well for a few reasons:

  1. They're "global" to the final, merged runfiles directory
  2. They don't compose well with intra-library dependencies
  3. It's brittle.
  4. It's fundamentally incompatible with some upcoming rules_python features (precompiling)

For (1), it starts to break down when multiple binaries are involved with different transitive closures. For example, you might have two binaries (outer and inner) that use the same library, but different versions, and one is a data dep of the other, outer->inner. This can easily happen if the two use different requirements files (or if they use the same requirements file, but with e.g. platform/version conditions choosing different versions).

If a library, foo, tries to do e.g. root_symlinks={"site-packages/foo/foo.py": <File: foo.py>}, then the above situation no longer works. The multiple versions of foo are going to try and set that same root_symlink path. Ultimately, we need a runfiles structure that looks approximately like this:

outer.runfiles/
  X-site-packages/foo/foo.py -> symlink to pypi_foo_1_0/foo.py
  Y-site-packages/foo/foo.py -> symlink to pypi_foo_2_0/foo.py

But the library can't know what that X/Y prefix is supposed to be. And, really, the library shouldn't care about any part of the X-site-packages string. The library only cares about a relative-structure ("put all my files in a directory named foo"), it doesn't care about the name of the directory that is put within.

For (2), what I mean is: Making root_symlinks work requires everything to use it. While e.g. py_library could use root_symlinks, a filegroup in its data deps won't. This means, when materialized to runfiles, the py code is in on place, and its data files are in another place. It's only option is to flatten the runfiles to try and figure things out.

For (3), what I mean is: Both the library and consuming binary have to agree about the "site package prefix" used. If the library uses root_symlinks={"site-packages/foo": ...}, then the binary has to assume (or otherwise know) that "site-packages/" is the "site package prefix". This also ties into point (1).

For (4), what I mean is: part of the precompiling feature is a binary-level attribute to use/not-use precompiled files. In order for this to work, a library can't put its py code into runfiles -- if it did, its files would always be included, and a binary could no longer opt-out.

Back to OP

Fundamentally, what we want to do is to express some way to efficiently relocate (i.e. change their materialized runfiles path) files.

  • Files are spread across 1 or more repos and/or across multiple locations within a repo
  • We want to materialize them all under a directory within runfiles.
  • The location, may be binary-specific. In practice, there would only be a handful
    (essentially one per unique transitive closure; duplicate ones can be deduped based on a hash of their contents).
  • A binary wants to chose the location for that "top level" directory.
  • Meanwhile, libraries want to choose a "sub-location" within that "top level" location.

So given something like:

main/
  WORKSPACE(name=main)
  src/myapp/
    app.py  # py_binary(app.py, deps=[myutil, numpy1.0], data=[helper])
    helper.py # py_binary(helper.py, deps=[more_itertools, numpy2.0])
  src/myutil/
    util.py # py_library(site_package_name="myutil")
  third_party/
    more_itertools/... # py_library(site_package_name="more_itertools")
pypi_numpy_1_0/
  WORKSPACE(name="numpy_1_0")
  BUILD: py_library(srcs=*.py, site_package_name="numpy")
  numpy.py
pypi_numpy_2_0/
  WORKSPACE(name="numpy_2_0")
  BUILD: py_library(srcs=*.py, site_package_name="numpy")
  numpy.py

An ideal output looks something like:

bazel-bin/src/myapp.runfiles/
  src/
    app: executable: adds $unfiles/X-site-packages to sys.path
    helper: executable: adds $runfiles/Y-site-packages to sys.path
  X-site-packages/
    myutil/util.py
    numpy/numpy.py -> symlink to pypi_numpy_1_0
  Y-site-packages/
    more_itertools/... -> symlinks to main/third_party/more_itertools/...
    numpy/numpy.py -> symlink to pypi_numpy_2_0

Some various ideas and design notes I have:


This can sort of be modeled using TreeArtifacts, however, this has two drawbacks:

  1. The analysis phase loses insight into what is in a build, as it can't see the specific files anymore; it just sees an opaque directory. So something like type checkers or aspects become much harder (impossible?) to make work.
  2. TreeArtifacts don't compose well. e.g., if you have a py_library for each .py file, and are using granular targets, then trying to "merge" all those TreeArtifacts together won't go well
  3. Creating a tree artifact requires running a subprocess. Which, in turn, means we are required to perform file copies. That's a lot of waste and additional overhead simply to change the path something materializes to.

This can sort of be modeled by having e.g. depset[tuple[str prefix, depset[File] files]]. This is similar to the SymlinkEntry object used by runfiles.root_symlinks . The catch is:

  1. I'm not sure how well depset[tuple[str, depset]] this will compose and scale? A depset of depsets seems a bit weird. Conversely, repeating the same prefix string multiple times, depset[tuple[str, File]] seems wasteful (e.g. if you have 100 files, you'll have the same prefix string repeated 100 times), but IDK the intricacies of the memory model, so perhaps that's premature to worry about.
  2. At the binary level, we have to flatten the depset to feed it into runfiles.root_symlinks. This approximately doubles the amount of work: O(N) for depset -> list, O(N) again for root_symlinks construction.

I do something like this for rules_python's docgen. It requires flattening the depset and then declare_file/symlink for everything in the depset. It works, but it's one thing to do it for O(dozens) of files vs O(tens-of-thousands) of files.


An alternative idea I had was to have something like a "relative file" object. The idea being: wherever it's put, it'll always be relative to the path it indicates. Consumers of relative files have a way to specify what they're relative to (perhaps another relative file, or some runfiles-root path or something). So you'd have something like:

def library(ctx):
  srcs = [ctx.relative_file('foo/', file=src) for src in ctx.files.srcs]
  return [LibInfo(files=srcs)]

def binary(ctx):
  site_packages = ctx.relative_file('{name}.site-packages', files=[d[LibInfo].files for d in deps])
  return [DefaultInfo(runfiles=ctx.runfiles(site_packages))]

# output: of //pkg:bin
bazel-bin/pkg
  bin.runfiles/
    pkg/
      bin -> executable
      bin.site-packages/
        foo/foo.py

..or something like that. But the basic idea is, at the library level, the (prefix, File) information is given, and then a higher level is able to easily/efficiently "move" it. All it's really doing it affecting the path that Bazel will materialize. Notably, it's not having to e.g. invoke a custom tool via a build action just to copy/paste a bunch of files to new paths.


A less well-thought out idea is to allow some sort of transform function. So a binary would do e.g.

def binary(ctx):
  def xform(path):
    return '{name}.site-packages/' + path
  ctx.runfiles(xform=xform)

Where the xform function returns the path to materialize. I'm not sure how this would compose, though.

@rickeylev
Copy link
Contributor

I came across another use case efficient relocation would help: avoiding overlap when generated file are in the same directory as a binary. e.g. given

foo/BUILD
py_binary(name="bin", srcs=["bin/main.py"])

We get an error because we want to generate foo/bin (the executable) and foo/bin/main.pyc (the generated pyc file).

However, we don't have to put the pyc in a subdirectory. Python has a feature called an "alt pyc root", which lets us specify an entirely different location to act as the "root" of pyc searching. Thus, we could generate e.g. foo/_pyc_outs/bin/main.pyc, then relocate it to e.g. $runfiles/site-packages/foo/bin/main.pyc. (or replace site-packages with pyc_root -- I just use site-packages so its clear the same categories of problems come up trying to relocate things with today's APIs).

@aranguyen aranguyen added team-Local-Exec Issues and PRs for the Execution (Local) team and removed team-Configurability platforms, toolchains, cquery, select(), config transitions labels Oct 4, 2024
@zhengwei143 zhengwei143 added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels Oct 8, 2024
@alexeagle
Copy link
Contributor

You might want to take a look at rules_js, which creates a pnpm-compatible node_modules tree using existing Bazel support, with significant use of symlinks. I believe as @fmeum said earlier that any site-packages tree you desire may be created this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc team-Rules-Python Native rules for Python type: feature request untriaged
Projects
None yet
Development

No branches or pull requests

9 participants