Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate one Bazel repo per dependency #2370

Open
mering opened this issue Nov 4, 2024 · 10 comments
Open

Generate one Bazel repo per dependency #2370

mering opened this issue Nov 4, 2024 · 10 comments
Labels
need: repro Needs a minimal reproduction

Comments

@mering
Copy link

mering commented Nov 4, 2024

🚀 feature request

Relevant Rules

pip.parse() module extension

Description

When the same pip package is (transitively) depended upon multiple times from different requirements.txt files, it is undefined which version will be chosen (both are added to the PYTHONPATH and the order of deps decides which one takes precedence).
This is especially problematic for transitive dependencies of module dependencies where one only has limited to no influence of which version is chosen. This is especially problematic for fundamental libraries like Protobuf which gets pulled in transitively in many requirements.txt.
The way this is handled in golang, is that every dependency is its own Bazel repo and as such takes part in version resolution. This makes sure that every package only exists once within a given Bazel workspace.

Describe the solution you'd like

Generate one Bazel repo per pip dependency.

Describe alternatives you've considered

For the transition period, rules_python could generate both, one repo for all of requirements.txt and one repo per dependency. This way, users can easily migrate from one approach to another. In a future major version, the old functionality could be removed.

@groodt
Copy link
Collaborator

groodt commented Nov 4, 2024

I think this is working as expected. To confirm, can you please create a reproduction repository to demonstrate the issue?

@mering
Copy link
Author

mering commented Nov 4, 2024

@groodt What do you mean by "working as expected"? The way it currently works what is expected from the current architecture, but the current architecture is problematic.

@mering
Copy link
Author

mering commented Nov 4, 2024

Example:

Module A:

  • requirements.txt: protobuf==5.26.0
  • py_library(name="a", deps=["@a_pip/protobuf"])

Module B:

  • requirements.txt: protobuf==5.27.0
  • py_library(name="b", deps=["@b_pip//protobuf"])

Module `C:

  • requirements.txt: protobuf==5.28.0
  • bazel_dep(name="A"), bazel_dep(name="B")
  • py_library(name="c", deps=["@A//:a", "@B//:b", "@c_pip//protobuf"])

The Protobuf version available in @C//:c depends on the order of its dependencies. This becomes even more confusing when there are additional levels of transitive dependencies involved.

@groodt
Copy link
Collaborator

groodt commented Nov 4, 2024

The recommendation is to produce a single lockfile for your dependencies and register that in your hub repository. See https://rules-python.readthedocs.io/en/latest/pypi-dependencies.html

Mixing dependencies across dependency closures as you have in your example is indeed problematic and not supported. You may wish to explore other rules that might offer the behaviour that you desire.

@mering
Copy link
Author

mering commented Nov 4, 2024

The recommendation is to produce a single lockfile for your dependencies and register that in your hub repository. See https://rules-python.readthedocs.io/en/latest/pypi-dependencies.html

Mixing dependencies across dependency closures as you have in your example is indeed problematic and not supported. You may wish to explore other rules that might offer the behaviour that you desire.

This is not possible for dependencies added from BCR which add their own Python code and are not available on pypi.org. We do not have control over these dependencies. I can find at least 170 such package versions currently: https://github.com/search?q=repo%3Abazelbuild%2Fbazel-central-registry%20pip.parse&type=code. This problem will grow every day as more packages migrate to Bzlmod and are uploaded to BCR.

@rickeylev
Copy link
Collaborator

rickeylev commented Nov 4, 2024

Unfortunately, a canonical target for every dependency doesn't Just Work, either.

Going back to the Module A/B/C example, lets pretend there is a single repo for each dependency. We know that there are different requirements.txt files, which means there are different "concrete" implementions for a dependency (in the example, protobuf).

Which concrete implementation to use depends on the final consumer (e.g. the binary / requirements.txt), not the point of need (module A's target needing protobuf). However, an upstream target can't know what its downstream targets's constraints are. i.e., module A's target can't know whether it's module A, module B, or module C that is consuming it. Thus, the point-of-need can't specify what concrete implementation to use. Similarly, the final consumer can't arbitrarily override what every point-of-need says.

re: "[libraries] takes part in version resolution": part of how requirements.txt works is you can arbitrarily pin a dependency to a version. It's then up to the resolver to find a closure with that constraint. Solving that is outside the scope and ability of Bazel (solving a dependency closure's constraints can take several minutes sometimes). My point here is that picking which concrete implementation to use isn't simply decided based on python version or OS. It can be arbitrary and out-of-band information.

The only way to make something resembling this work is something like:

  1. Every dependency has a flag to select which concrete implementation to use
  2. Binaries accept an arbitrary number of flags they transition on to affect (1)

This isn't a very tenable design at scale, though. It's fragile, unwieldy, and leads to configuration state explosion.

The way this is handled in golang, is that every dependency is its own Bazel repo and as such takes part in version resolution. This makes sure that every package only exists once within a given Bazel workspace.

I think this is the basic point of misunderstanding -- rules_python supports multiple dependency closures in a single build. This is based on user request -- not every repo is a monorepo with a strict one-version policy as enforced within Google.

As a general rule, you can't mix "bazel dependency management" (e.g. implicit deps on library rules) with "external dependency management" (e.g. requirements.txt) when it comes to user dependencies. The two methods have separate dependency closures and aren't aware of each other, and thus can't safely interoperate.

This is not possible for dependencies added from BCR which add their own Python code and are not available on pypi.org. We do not have control over these dependencies.

In this situation, the best a BCR module can do is:

  • Expose a e.g. py_library that contains the module's code
  • These targets do not depend on (e.g. have py_library.deps edges) to code not owned by the module
  • Expose a requirements.txt snippet, or document the pypi packages needed, that a consumer needs to put in their requirements.txt

A mediocre middle ground might be to do something like:

  • Always generate e.g. an @pypi repo.
  • Allow e.g. pip.override to "register" aliases into it that point elsewhere.
  • Something like grpc then depends on @pypi//protobuf, as a way to say "I need this, but let someone else point the label to something".

This is, effectively, similar to the deprecated bind() function from workspace; it's a form of late binding. However, as I described above, given a repo with multiple requirements.txt, there can be conflicting answers for what the correct and neccesary binding is. (I'll note this is somewhat already possible today by having the implicit dep point to a user-defined flag, or code generated by workspace/bzlmod configuration).

@aignas
Copy link
Collaborator

aignas commented Nov 4, 2024

Related issue - #2094

@groodt
Copy link
Collaborator

groodt commented Nov 5, 2024

Yes, this comes up from time to time.

I'd still value a realistic reproduction in a realistic repo if thats possible please @mering? Most of the examples you've highlighted in the BCR search seem to be related to grpc? So perhaps there is a nicer way to solve this in rules_proto or similar that we can then document. The only requirement for grpc generated code and the runtime in a userspace build is that the protobuf versions are compatible. It doesn't really require that the runtime code is using a grpc dependency provided by the grpc BCR module. This is why Im continuing to ask for a realistic example that demonstrates the concrete problem, rather than discussing problems in an abstract way.

The confusion and conflation of build dependencies that are used by bzlmod modules vs application dependencies that become part of a build output is often a point of confusion. The way that build dependencies that happen to use python as part of their extensions, is that they should be isolated as @aignas points out in #2094

It often gets compared to golang dependencies, but thats really just because golang and bzlmod happen to use the same dependency resolution algorithms. Java and Python application dependency resolvers do differ from golang and indeed there isn't a reason for all languages to adopt the golang approach.

@mering
Copy link
Author

mering commented Nov 5, 2024

Which concrete implementation to use depends on the final consumer (e.g. the binary / requirements.txt), not the point of need (module A's target needing protobuf). However, an upstream target can't know what its downstream targets's constraints are. i.e., module A's target can't know whether it's module A, module B, or module C that is consuming it. Thus, the point-of-need can't specify what concrete implementation to use. Similarly, the final consumer can't arbitrarily override what every point-of-need says.

I agree but would argue that the final consumer is in the best position to decide as it has all information available.

re: "[libraries] takes part in version resolution": part of how requirements.txt works is you can arbitrarily pin a dependency to a version. It's then up to the resolver to find a closure with that constraint. Solving that is outside the scope and ability of Bazel (solving a dependency closure's constraints can take several minutes sometimes). My point here is that picking which concrete implementation to use isn't simply decided based on python version or OS. It can be arbitrary and out-of-band information.

While it is true that the Bazel and the pip dependency solvers use slightly different mechanics, they are compatible enough in most cases. While not being a perfect fit, I see more problems solved when generating one Bazel repo per dependency as opposed to one Bazel repo per dependency closure.

The way this is handled in golang, is that every dependency is its own Bazel repo and as such takes part in version resolution. This makes sure that every package only exists once within a given Bazel workspace.

I think this is the basic point of misunderstanding -- rules_python supports multiple dependency closures in a single build. This is based on user request -- not every repo is a monorepo with a strict one-version policy as enforced within Google.

One of the main reasons to use Bazel in the first place is to support big (mono-)repos. Otherwise, the overhead is usually not worth it. So I would argue that most users of rules_python (including us) actually want a single dependency closure in a single build. Why would you want multiple dependency closures within a single workspace? Why don't you just create another workspace if you want separate ones?

As a general rule, you can't mix "bazel dependency management" (e.g. implicit deps on library rules) with "external dependency management" (e.g. requirements.txt) when it comes to user dependencies. The two methods have separate dependency closures and aren't aware of each other, and thus can't safely interoperate.

This is not true, see for example the golang example. People are usually using Bazel for a reason and explicitly chose to follow Bazels way of managing dependencies but then need some way to add dependencies from other ecosystems. So they should merge into a single dependency closure in order to safely interoperate.

This is not possible for dependencies added from BCR which add their own Python code and are not available on pypi.org. We do not have control over these dependencies.

In this situation, the best a BCR module can do is:

  • Expose a e.g. py_library that contains the module's code
  • These targets do not depend on (e.g. have py_library.deps edges) to code not owned by the module
  • Expose a requirements.txt snippet, or document the pypi packages needed, that a consumer needs to put in their requirements.txt

I would argue that any target should completely list its dependencies. Requiring addition of transitive packages into an unrelated consumers is an anti-pattern.

A mediocre middle ground might be to do something like:

  • Always generate e.g. an @pypi repo.
  • Allow e.g. pip.override to "register" aliases into it that point elsewhere.
  • Something like grpc then depends on @pypi//protobuf, as a way to say "I need this, but let someone else point the label to something".

This is, effectively, similar to the deprecated bind() function from workspace; it's a form of late binding. However, as I described above, given a repo with multiple requirements.txt, there can be conflicting answers for what the correct and neccesary binding is. (I'll note this is somewhat already possible today by having the implicit dep point to a user-defined flag, or code generated by workspace/bzlmod configuration).
Related issue - #2094

This would require the root workspace to know about all the dependencies to manually pick a specific version (similar to the WORKSPACE model). People didn't like it and created Bzlmod as a result.


I'd still value a realistic reproduction in a realistic repo if thats possible please @mering? Most of the examples you've highlighted in the BCR search seem to be related to grpc? So perhaps there is a nicer way to solve this in rules_proto or similar that we can then document. The only requirement for grpc generated code and the runtime in a userspace build is that the protobuf versions are compatible. It doesn't really require that the runtime code is using a grpc dependency provided by the grpc BCR module. This is why Im continuing to ask for a realistic example that demonstrates the concrete problem, rather than discussing problems in an abstract way.

This problem comes up every other week in our big monorepo and usually requires days to figure out the right order of dependencies to work around if possible at all.
The last occurrences have been about a transitive dependency to protobuf which hasn't even been specified in our requirements.in file but only generated via pip-compile. One dependency closure or a transitive dependency defined an older version of Protobuf which wasn't compatible with what other parts of our code base assumed. This problem appeared out of nowhere just by adding or removing a dependency or re-ordering dependencies.

The confusion and conflation of build dependencies that are used by bzlmod modules vs application dependencies that become part of a build output is often a point of confusion. The way that build dependencies that happen to use python as part of their extensions, is that they should be isolated as @aignas points out in #2094

One can argue at length on what people should be doing but I guess we have to face reality and see how users are actually using it. The more closely we align to the way Bazel handles things, the more intuitive it is for users and the more likely it is that users will use the library the way it is intended.

It often gets compared to golang dependencies, but thats really just because golang and bzlmod happen to use the same dependency resolution algorithms. Java and Python application dependency resolvers do differ from golang and indeed there isn't a reason for all languages to adopt the golang approach.

I think it doesn't bring value to discuss the advantages and disadvantages of various dependency resolvers here. But maybe the Bazel one is good enough for most use cases and the few exceptions could be covered by override mechanisms.


Why not having both?
rules_python could continue to generate fixed dependency closures in a repo per dependency closure. Additionally, it could create one repo per dependency which could then use Bazel version resolution scheme which is not perfect but probably good enough for most cases. Users could then pick which repo they want to depend on. WDYT?

@groodt groodt added the need: repro Needs a minimal reproduction label Nov 5, 2024
@aignas
Copy link
Collaborator

aignas commented Nov 7, 2024

Just to add my personal take here as I have worked a lot on 3rd party dependency handling.

  • rules_python does not have a dependency resolver that would do resolution in a way that bzlmod does. We merely convert the requirements.txt file into a BUILD.bazel file that allows users to include 3rd party repos. Attempting to do resolution could be doable, but I haven't looked at it and since rules_python is a volunteer project I have limited time to take on new things.
  • Having both - i.e. allowing the root module to override hub repos has been proposed feat: add ability to override hub name #1829, but the original author has not finished it, maybe a pip.override or something of sorts that allows the root module to provide the requirements.txt files could give the user that control. @mering, would this essentially solve what you are after, i.e. having the control?

As @groodt mentioned, I think having a small reproducible example which highlights the short comings of rules_python would be good. So far I've only heard about short comings of protobuf bazel integration - the way python sources are provided could be the main problem here. Is there an XY problem here ref?

EDIT: there is also work by aspect in aspect_rules_py here where they have a way to resolve the dependency tree based on what the terminal targets (i.e. py_binary and py_test) specify, so this is another knob that is missing from rules_python right now, but could technically be implemented if there is willingness from the community to help drive this. See the resolutions attribute in the said rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need: repro Needs a minimal reproduction
Projects
None yet
Development

No branches or pull requests

4 participants