Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix per-file config interaction with one py_binary per main #1664

Merged
merged 3 commits into from
Jan 9, 2024

Conversation

adzenith
Copy link
Contributor

@adzenith adzenith commented Jan 2, 2024

This previous PR added the ability to make a py_binary target per file if if __name__ == "__main__" tokens were found in the file. This works great in the default case, but when python_generation_mode is set to file, the plugin now attempts to make both a py_binary and a py_library target for each main file, which results in an error.

This PR modifies the behavior to work properly with per-file target generation, and adds tests for this case.

@adzenith adzenith requested a review from f0rmiga as a code owner January 2, 2024 20:14
@aignas
Copy link
Collaborator

aignas commented Jan 3, 2024

What would happen if we have lib_and_main.py, which has code usable as library code (e.g. in lib2.py) but also should have a py_binary gerated for that because of an if name main statement?

@adzenith
Copy link
Contributor Author

adzenith commented Jan 3, 2024

A py_binary can show up in another py_binary's deps just as a py_library can. The py_binary just also affords the ability to use bazel run. I would think it would be best practice to extract library code to a py_library target and not have binaries depend on each other, but it is possible and the rules have no problem with it.

I've added a new lib_and_main.py and it creates a py_binary. I've also added a dependency on it from another py_binary, and Gazelle works as expected.

Let me know your thoughts!

@siddharthab
Copy link
Contributor

+1 for the need for this change.

Without this change, per-file generation mode does not really work for us if we are not strict about having the if __name__ == "__main__" line only in files named __main__.py.

Copy link
Collaborator

@aignas aignas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I would like this to be merged, but not sure about the usage of py_binary as a py_library. Maybe @rickeylev could chime in and advise us here? Should we use py_binary as a convenient py_library implementation when a file supports both, being used as a lib and as a script, are there trade offs in doing so?

@rickeylev
Copy link
Collaborator

In general I would like this to be merged, but not sure about the usage of py_binary as a py_library. Maybe @rickeylev could chime in and advise us here? Should we use py_binary as a convenient py_library implementation when a file supports both, being used as a lib and as a script, are there trade offs in doing so?

A binary shouldn't be used as a library, insofar as Bazel dependencies are concerned. The main reasons are:

  1. The inner binary executable is still built and included into the runfiles, even though it isn't used. This means additional data in the binary, but also the additional analysis time cost of the binary.
  2. It's assumed that a binary is a "top level" node in "the graph". This lets binaries assume its OK to perform more expensive operations like flattening depsets or performing transitions. Another example is binaries are a natural place for decisions that can only be made by the user who knows the final runtime environment can be made (e.g., linking in an alternative malloc, choosing some alternative version of a dependency, specifying the python version to use, etc)
  3. Questions around visibility become more complicated. For example, one might want to expose a binary to be used only as a tool, but not as a library, as the interface for the former is much more narrow and easy to define and control. Similar questions arise for other attributes.

@siddharthab
Copy link
Contributor

Thanks @rickeylev.

Some basic thoughts that come to mind:

  1. The cleanest separation is at the file level - keep the executable part in a separate file. This means having lib.py and lib_bin.py instead of one lib_and_bin.py file. This can ensure that Gazelle can generate two separate py_library and py_binary targets automatically. If the user is not doing this separation (i.e. has elected to keep lib_and_bin.py), then they should be ready to take some performance hit and some messiness regarding visibility. This can be documented.
  2. Some of the concerns, like the executable being built and included in runfiles, can/should be addressed within starlark. When dependency is through the deps attribute, do not do these things, but do so when the dependency is through data or tools attributes, etc.
  3. Setting a restrictive visibility on py_library does not really help if an alternative is available through py_binary which can provide py_library like behavior.

With the current behavior, unless the file is named __main__.py, Gazelle gets confused if there is an if __name__ == "__main__" line in the file (in principle, this line check can also be replaced with an explicit directive).

Options I can think of:

  1. Generate a py_binary target instead of py_library. Assume that either the module is not depended upon, or if it is depended upon, then the user understands the performance implications.
  2. Generate two separate identical targets for the same file - one py_library and one py_binary, and stop Gazelle from using py_binary targets for import resolution, thus enforcing that py_binary targets are not used in deps attribute.

Copy link
Collaborator

@aignas aignas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the option 2 as it is the option with the least surprising behaviour.

We should genurate a py_library per file with additional py_binary if there is ifmain block in there. We could improve things by not generating py_library if the only code in the file is the said ifmain block or have directives to not generate a library for a particular file.

@adzenith, what do you think?

@rickeylev
Copy link
Collaborator

Some of the concerns, like the executable being built and included in runfiles, can/should be addressed within starlark. When dependency is through the deps attribute, do not do these things, but do so when the dependency is through data or tools attributes, etc.

Unfortunately, it's not so simple. I held that same opinion until a couple years ago (internally, there was a FR for exactly this, which I originally argued in favor of, but as time passed, it became more clear keeping the two separate is better architecturally).

The issue is that the expected behavior of rules is that deps-like attributes also include the runfiles of a target. And where is the executable file? It's part of the runfiles. Filtering it out of the runfiles can't be done without flattening the runfiles. So then the only other answer is a special provider with a different set of runfiles, and thus, we have re-invented the deprecated data_runfiles feature, except now we're calling to deps_runfiles (the reason data_runfiles is deprecated and discouraged is because it creates surprising behavior -- we don't expect that changing the name of the dependency edge changes the set of files received. I've been confused by that behavior several times and seen it bite others).

We're still left with the other major problem of unnecessary analysis time logic, though, as well as the other parts of points (2) and (3) I posted. It starts to turn into many special cases and special py-rule-specific behavior.

Predominately, we (within Google) have found that binaries only need libraries to make testing easier -- a unit test typically wants to import some of the functions of the binary to test them. To serve that end, the popular py_binary-wrapping macros have an opt-in argument to generate a py_library that is testonly=true. If the same source file is supposed to be both a library and binary, people have to purposefully do that to consciously face the decisions it entails. I originally argued that this testonly lib be opt-out instead, but data indicated that it wouldn't be used much (it was something unequivocal like 25%)

Generate two separate identical targets for the same file - one py_library and one py_binary

This is fine. The same source file can be in the srcs of multiple targets. I'd recommend that, if both a binary and library are generated, that the binary depend on the library. This just reduces the amount of repetition in the definitions and looks a bit cleaner more than anything, so not strictly necessary.

I don't know how advanced of analysis gazelle does, but if it can look around at other files, it could look for a test-looking file that imports the binary. Another heuristic might be to look if a file looks "entry point like", e.g., if the file that has if name==main imports main from another location to run it, it probably isn't meant to be a library itself.

@siddharthab
Copy link
Contributor

siddharthab commented Jan 5, 2024

An extension of option 2 could be to give the user some control through an in-file directive that says whether the target type should be library, binary, or both. Hopefully, this will obviate the need for intelligence within Gazelle.

Without this directive, the ifmain line check can be used to generate both types of targets, or always library targets regardless of the file content (unless the file name is __main__.py). This behavior can also be controlled through a global directive.

The in-file directive can look like:

# gazelle:py_target_type=binary
# gazelle:py_target_type=library,binary

Unfortunately, all of this makes the configurability of Gazelle more complex. But because the Python language spec itself is not strict, and Gazelle is not positioned high enough in the ecosystem for enforcing opinions, this is unavoidable.

@siddharthab
Copy link
Contributor

Unfortunately, it's not so simple.

Sigh. For this reason, for better or worse, I did not create separate r_library and r_binary when writing rules for R (my thoughts at that time). The language spec does not distinguish between what can be executed and what can not be, the build system should not either. The additional infrastructure around self-contained executable archives, etc. can be considered separately. Anyway, what's done is done.

@adzenith
Copy link
Contributor Author

adzenith commented Jan 6, 2024

This turned out to be a much more controversial change than what I was expecting! I was trying to fix what I perceived as a bug in the current behavior. Thank you all for all the comments. I read through them all but it looks like there's not yet an agreement on the best path forward - so I'll throw my hat in the ring as well.

Options I can think of:

  1. Generate a py_binary target instead of py_library. Assume that either the module is not depended upon, or if it is depended upon, then the user understands the performance implications.
  2. Generate two separate identical targets for the same file - one py_library and one py_binary, and stop Gazelle from using py_binary targets for import resolution, thus enforcing that py_binary targets are not used in deps attribute.

I would prefer the option 2 as it is the option with the least surprising behaviour.

I personally would prefer option 1. I think option 2 is actually more surprising. Why am I ending up with two targets? What are the implications of that? I think it's pushing an implementation detail onto the user, whereas with option 1 everything "just works" from the user perspective.

I also am not sure how often this will actually come up. Do people frequently have py_library targets where the source files have if __name__ == "__main__" that this change would convert to py_binary targets? If so, why do they have them if they aren't able to run them? Note also that this change only affects people who have configured Gazelle for per-file target generation. Are we trying to over-optimize for a rare case?

A binary shouldn't be used as a library, insofar as Bazel dependencies are concerned. The main reasons are:

  1. The inner binary executable is still built and included into the runfiles, even though it isn't used. This means additional data in the binary, but also the additional analysis time cost of the binary.
  2. It's assumed that a binary is a "top level" node in "the graph". This lets binaries assume its OK to perform more expensive operations like flattening depsets or performing transitions. Another example is binaries are a natural place for decisions that can only be made by the user who knows the final runtime environment can be made (e.g., linking in an alternative malloc, choosing some alternative version of a dependency, specifying the python version to use, etc)
  3. Questions around visibility become more complicated. For example, one might want to expose a binary to be used only as a tool, but not as a library, as the interface for the former is much more narrow and easy to define and control. Similar questions arise for other attributes.

I feel like users generally fall into two camps:

  1. People who just want to write Python, and don't really care about the inner workings of Bazel (most devs). They don't know what runfiles are and they aren't flattening depsets or doing anything fancy. They just want to bazel run their code.
  2. Bazel devs, who understand things like analysis time cost and complications around visibility.

I have the feeling that the first camp doesn't really mind the negative effects of depending on a *_binary target, and the second camp is knowledgeable enough to make things work the way they want in any case. I'm not sure if there's a set of people who would be upset or surprised about slightly more memory or analysis time when having a py_binary depend on a py_binary without also having the skills to remedy the situation.

The in-file directive can look like:

# gazelle:py_target_type=binary
# gazelle:py_target_type=library,binary

My preference would be simply to document the following:

  • if you have if __name__ == "__main__" and per-file generation enabled, a py_binary will be created
  • if you import that source file from another file, your target will depend on that py_binary, with possible negative effects on analysis time and runfiles size
  • to avoid this, extract library code into a separate file and don't import any files that contain if __name__ == "__main__".

Then users who just want things to work can just work, and for anybody who is looking into performance they will find that the best fix is to structure their code slightly differently (arguably better, perhaps).

Thoughts?

@siddharthab
Copy link
Contributor

+1 to all @adzenith said above.

@adzenith adzenith force-pushed the per-file-binary-target branch from 7de137b to bc90924 Compare January 8, 2024 20:16
@adzenith adzenith requested a review from rickeylev as a code owner January 8, 2024 20:16
@adzenith
Copy link
Contributor Author

adzenith commented Jan 8, 2024

I added documentation about the negative effects of depending on a py_binary as well as how to avoid them. Let me know your thoughts!

@adzenith adzenith force-pushed the per-file-binary-target branch from bc90924 to 957345e Compare January 8, 2024 20:17
Copy link
Collaborator

@aignas aignas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the documentation, I think as is it is good enough for now.

@aignas aignas added this pull request to the merge queue Jan 9, 2024
Merged via the queue into bazelbuild:main with commit 3d68f6c Jan 9, 2024
4 checks passed
@adzenith
Copy link
Contributor Author

adzenith commented Jan 9, 2024

Thank you! I appreciate it

@adzenith adzenith deleted the per-file-binary-target branch January 9, 2024 17:29
michael-christen added a commit to michael-christen/toolbox that referenced this pull request Jan 22, 2024
Changes:
- Establish a mechanism for defining external python requirements via a
`requirements.in`, generating updates to a lock file with `bazel run
//:requirements.update`
- Configure gazelle python  and protobufs
- Configures via gazelle directives to use a file based approach, stick
with the BUILD configuration, disable go, default visibility to private
and resolve py imports
- Update dev_setup to install openjdk (java), tree, ranger, ag

New Commands:
```
bazel run //:requirements.update
bazel run //:gazelle_python_manifest.update
bazel run //:gazelle
```

New Tests:
```
//:requirements_test
//::gazelle_python_manifest.test
```

Notes:
- One of the recurring issues I have every time I deal with updating
WORKSPACE is slight incompatibilities with various repos that result in
incomprehensible output errors.
- stackb/rules_proto is one of the only things that seems to support
what I'm looking for, and while it is getting steady contributions
hasn't pubilshed a release in about a year.
https://github.com/rules-proto-grpc/rules_proto_grpc. Additionally, it
could be handy to look into what it would take to make my own "shitty"
version of these rules to learn more about what's going on in gazelle's
/ bazel's internals
- bzlmod migration didn't go well for me, most tutorials out there still
use WORKSPACE, might be good to look into in the future / figure out how
to migrate piecemeal, but bypassed this and 7.0.1 bazel upgrade for now

Related Issues:
- Noting the issue with requiring a gazelle directive for proto
libraries: bazelbuild/rules_python#1703
- bazelbuild/rules_python#1664 fixes the reason
I had to "disable" my py_binary, wait for that to release and then
update. I could probably patch this in, but don't quite know how 🤷
- Sounds like there's some talk around gazelle c++
bazel-contrib/bazel-gazelle#910
- I'm helping ;) bazelbuild/rules_python#1712
(doc update)

References:
- https://github.com/bazelbuild/bazel-gazelle
- https://github.com/stackb/rules_proto
- https://github.com/bazelbuild/rules_python/tree/main/gazelle

Future Things to Look Into:
- Setup a gazelle test that checks output of `//:gazelle -- -mode diff`
is zero
- Configure rust with gazelle: https://github.com/Calsign/gazelle_rust
- A really cool / general parser / generator (what github uses for
"semantic" tool): https://tree-sitter.github.io/tree-sitter/ Could use
to make small code generator / modification? Or maybe more general
utilities?
- General compile command completion:
https://github.com/hedronvision/bazel-compile-commands-extractor
- Maybe play with https://github.com/rules-proto-grpc/rules_proto_grpc
as an alternative to stackb
- If I can't use gazelle for some things, mess around with
https://github.com/bazelbuild/buildtools/tree/master/buildozer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants