Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a non-touching ak.zip, called 'ak.zip_no_broadcast' #3390

Merged
merged 7 commits into from
Jan 31, 2025

Conversation

pfackeldey
Copy link
Collaborator

@pfackeldey pfackeldey commented Jan 28, 2025

This PR adds a non-touching version of ak.zip called "ak.unsafe_zip". This is useful to re-arrange record arrays in coffea without touching any contents.

The price we're paying to achieve non-touching zipping behavior is safety, i.e. broken arrays can be constructed. ak.unsafe_zip assumes that all input arrays have the same layouts (and lengths!), and currently can only work on ak.contents.NumpyArray or ak.contents.ListOffsetArray. Nesting record arrays can be achieved by nesting ak.unsafe_zip calls, similar to how ak.zip works currently.

For usage with coffea, this needs to be daskified similar to ak.zip, see: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/structure.py#L1272-L1344

cc @nsmith-

See also: dask-contrib/dask-awkward#536


non-touching show-case:

import awkward as ak

tracer = ak.Array([[1,2], [], [3]], backend="typetracer")

tracer, report = ak.typetracer.typetracer_with_report(tracer.layout.form_with_key(), highlevel=True)

ak.unsafe_zip({"foo": tracer, "bar": tracer})
# <Array-typetracer [...] type='## * var * {foo: int64, bar: int64}'>

print(report)
# <TypeTracerReport with 1 shape_touched, 0 data_touched>

No data is touched. However, touching shape can not be avoided, but for this we have unknown_length already.

@pfackeldey pfackeldey marked this pull request as ready for review January 28, 2025 22:08
@pfackeldey pfackeldey requested review from jpivarski and ianna January 28, 2025 22:08
@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Jan 28, 2025

I think we need to be careful here with what we can check without touching data. I could add e.g. checking that all lengths of all given layouts are the same, which would touch all shapes instead of one, but is significantly more safe. I'm curious where we should draw the boundary of how "unsafe" this function should be.
I'm happy to get any feedback about where we should draw this boundary.

edit: I added checking for the same lengths as otherwise this could be really dangerous. Touching lengths is anyway ok with unknown_length.

@agoose77
Copy link
Collaborator

@pfackeldey my feeling for an unsafe_zip like function was that we should perform a non-broadcasting assertion at runtime. This would be cheaper than ak.zip, because it only copies buffers incidentally when e.g. comparing offsets vs starts. This would be safer because it won't permit invalid arrays; it will fail with an error instead.

To implement such a function, we would build the result form using the type of e.g. the first array (or to be smarter, whichever array has the most "well-known" type above the record).

I think this would require the dak.unsafe_zip overload to perform the runtime check, which will need to e.g. handle placeholders in the zip-like call.

Just food for thought :)

@pfackeldey
Copy link
Collaborator Author

@pfackeldey my feeling for an unsafe_zip like function was that we should perform a non-broadcasting assertion at runtime. This would be cheaper than ak.zip, because it only copies buffers incidentally when e.g. comparing offsets vs starts. This would be safer because it won't permit invalid arrays; it will fail with an error instead.

To implement such a function, we would build the result form using the type of e.g. the first array (or to be smarter, whichever array has the most "well-known" type above the record).

I think this would require the dak.unsafe_zip overload to perform the runtime check, which will need to e.g. handle placeholders in the zip-like call.

Just food for thought :)

Hi @agoose77, thanks for your comment! I assumed you gave this already some thoughts a while ago in dask-contrib/dask-awkward#536, so I'm really happy to have your input.

I'm not sure what kind of assertion you have in mind. The only way this ak.unsafe_zip can be dangerous (as far as I can see right now) would be if the offsets are different in the ListOffsetArray case. At runtime I could compare them, but only the ones that are not PlaceholderArrays, because otherwise I would have to touch all of them, which is what we want to avoid in the first place.

I could simply add this assertion here to enforce the same values for non-PlaceholderArray offsets at runtime, i.e. when the backend is not "typetracer".

What else would be dangerous? I'm not sure I understood your suggestion fully yet.

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfackeldey - looks good to me! Thanks! I wonder if a test where this operation fails would be useful?

src/awkward/operations/ak_unsafe_zip.py Outdated Show resolved Hide resolved
@agoose77
Copy link
Collaborator

@pfackeldey I'm replying to

The price we're paying to achieve non-touching zipping behavior is safety, i.e. broken arrays can be constructed. ak.unsafe_zip assumes that all input arrays have the same layouts (and lengths!), and currently can only work on ak.contents.NumpyArray or ak.contents.ListOffsetArray. Nesting record arrays can be achieved by nesting ak.unsafe_zip calls, similar to how ak.zip works currently.

we can 100% avoid constructing invalid arrays at runtime. We just need to ensure that enforcing the check doesn't trigger touching. The way to do this would be:

  1. Rename ak.unsafe_zip be ak.zip_no_broadcast (we don't need a different name, it's just to illustrate the intention)
  2. Have ak.zip_no_broadcast check lengths during zipping, and error if they don't match
  3. Have dak.zip_no_broadcast use a different code path (e.g. running ak.zip_no_broadcast on the meta that has lost its report) to figure out the final form, and then ensure that the task-graph runs the ak.zip_no_broadcast function.

Maybe that's what you're proposing, I'm not totally sure!

@pfackeldey
Copy link
Collaborator Author

I'll respond to both of you directly @agoose77 and @ianna:

  1. Initially, I was lacking 2 checks that allowed to construct invalid arrays: (1) checking the lengths; (2) checking that the offsets are the same at runtime for non-Placeholder and non-Typetracer arrays. (1) is already there in this PR; (2) will be added soon.
  2. I'm happy to renamed it, especially given that the two aforementioned checks prohibit constructing invalid arrays - that was not the case initially!
  3. the dak implementation would just be the more-or-less a copy of the dak.zip implementation, I don't see a reason why there's a difference (but that's something we can follow up on in dask-awkward)

Let me add the second check and rename the function, then I think it's good 👍

@pfackeldey
Copy link
Collaborator Author

Ok, @ianna and @agoose77 can you have another look at it?
I think I covered all cases:

  • all layouts are NumpyArrays -> check that their lengths are equal
  • all layouts are ListOffsetArrays -> check that their contents lengths are equal and check (only in non-typetracing mode) that the offsets (that have numerical values) are equal
  • otherwise raise an error

@pfackeldey pfackeldey changed the title feat: add a non-touching ak.zip, called 'ak.unsafe_zip' feat: add a non-touching ak.zip, called 'ak.zip_no_broadcast' Jan 30, 2025
@ianna ianna merged commit 5aa9698 into main Jan 31, 2025
40 checks passed
@ianna ianna deleted the pfackeldey/feat_add_unsafe_zip branch January 31, 2025 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants