-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for awkward 2.3 #1040
Fix for awkward 2.3 #1040
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1040 +/- ##
==========================================
+ Coverage 84.28% 85.10% +0.82%
==========================================
Files 35 35
Lines 4932 5258 +326
==========================================
+ Hits 4157 4475 +318
- Misses 775 783 +8
|
@@ -273,7 +273,7 @@ def as_view_awkarray(array, view_args): | |||
"Please open an issue in the AnnData repo and describe your use-case." | |||
) | |||
array = ak.with_parameter(array, _PARAM_NAME, (parent_key, attrname, keys)) | |||
array = ak.with_parameter(array, "__array__", "AwkwardArrayView") | |||
array = ak.with_parameter(array, "__list__", "AwkwardArrayView") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, although we should set this only on list-types. We don't yet have a function to do this, so we probably need to introduce a transform function here for now:
def with_list(array, name):
def apply(layout, **kwargs):
if layout.is_list:
return layout.with_parameter("__list__", name)
return ak.transform(apply, array)
The type string could be better here, but it will be OK for now I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need to bother applying this recursively, as we can only mutate the top level array, right?
I've given an example of this an asked a little more here: #1040 (comment)
@@ -273,7 +273,7 @@ def as_view_awkarray(array, view_args): | |||
"Please open an issue in the AnnData repo and describe your use-case." | |||
) | |||
array = ak.with_parameter(array, _PARAM_NAME, (parent_key, attrname, keys)) | |||
array = ak.with_parameter(array, "__array__", "AwkwardArrayView") | |||
array = ak.with_parameter(array, "__list__", "AwkwardArrayView") | |||
return array | |||
|
|||
ak.behavior["AwkwardArrayView"] = AwkwardArrayView |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ak.behavior["AwkwardArrayView"] = AwkwardArrayView | |
ak.behavior["AwkwardArrayView"] = AwkwardArrayView | |
ak.behavior["*", "AwkwardArrayView"] = AwkwardArrayView |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make the definition recursive, right?
(a) Would ak.with_parameter(array, "__list__", None)
still remove that behavior from the children of this array?
(b) What behavior would this change allow? E.g. does __setitem__
even work for the children of this array?
For example, with current behavior (e.g. at commit: 220d2e2):
import anndata as ad, numpy as np, awkward as ak
a = ad.AnnData(
np.ones((3, 2)),
obsm={
"awk": ak.Array(
[
{"a": {"b": [1, 2, 3]}},
{"a": {"b": [1, 2]}},
{"a": {"b": [5, 6]}}
]
)
}
)
v = a[:2]
# This doesn't "work"
v.obsm["awk"]["a"]["c"] = [1, 2]
assert "c" not in v.obsm["awk"]["a"].fields
# But neither does:
a.obsm["awk"]["a"]["c"] = [1, 2, 3]
assert "c" not in a.obsm["awk"]["a"].fields
# These both work as expected:
v.obsm["awk"]["a", "c"] = [1, 2]
a.obsm["awk"]["a", "c"] = [1, 2, 3]
assert "c" in v.obsm["awk"]["a"].fields
assert "c" in a.obsm["awk"]["a"].fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a) Would
ak.with_parameter(array, "__list__", None)
No, the same function with_list
should be used, with a value of None
.
(b) What behavior would this change allow? E.g. does setitem even work for the children of this array?
The main change in Awkward is to replace the powerful __array__
with a much tighter-scoped __list__
parameter. This means that if your top-level layout is not an is_list
type, it won't have any effect upon the array class (and in future, would error). The with_list
function behaves like with_name
, in that it descends until if finds a list, sets the parameter, and exits.
Now that I think about it, the most comprehensive solution would set either __list__
or __record__
, whichever appears first. That would look like
def with_any_name(array, name):
def apply(layout, **kwargs):
if layout.is_list:
return layout.with_parameter("__list__", name)
elif layout.is_record:
return layout.with_parameter("__record__", name)
return ak.transform(apply, array)
The reason that we need to recurse is that __setitem__
is recursive; it will find a record at any level of nesting, and update the layout accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, I suppose that you are only worried about mutability from fields (which is the only kind Awkward Arrays support). Therefore, we should just use with_name
here: if the layout doesn't have any records, it can't be mutated. So that's the proper solution after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I suppose that if you only worry about mutability from fields (which is the only kind Awkward Arrays support), then we should just use with_name here: if the layout doesn't have any records, it can't be mutated.
👍
Make sense to me. So, something like:
ak.with_name(array, "AwkwardArrayView")
instead of:
ak.with_parameter(array, "__list__", "AwkwardArrayView")
?
This does seem to be causing some test failures, but I haven't looked too deeply at that yet.
Btw, the current approach works even if there are no lists. E.g.:
a = ad.AnnData(
np.ones((3, 2)),
obsm={
"awk": ak.Array(
[
{"a": {"b": 1}},
{"a": {"b": 2}},
{"a": {"b": 3}}
]
)
}
)
Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the record name means that users won't be able to set their own record names without losing the anndata view. I think that's actually OK - behavior classes are somewhat at odds with anndata's needs here, and I think it's fine for a user to need to write
This was true before (I think?)
Would it be true with the ak.with_parameter(array, "__list__", "AwkwardArrayView")
approach?
Playing around a bit:
two = ak.Array([[{"x": 0.9, "y": 1}, {"x": 2, "y": 2.2}, {"x": 2.9, "y": 3}],
[],
[{"x": 3.9, "y": 4}, {"x": 5, "y": 5.5}],
[{"x": 5.9, "y": 6}],
[{"x": 6.9, "y": 7}, {"x": 8, "y": 8.8}, {"x": 8.9, "y": 9}]],
with_name="point")
ak.with_name(two, "AwkwardArrayView").typestr
# '5 * var * AwkwardArrayView[x: float64, y: float64]'
ak.with_parameter(two, "__list__", "AwkwardArrayView").typestr
# '5 * [var * point[x: float64, y: float64], parameters={"__list__": "AwkwardArrayView"}]'
I think the with_parameter
result better fits my mental model of what we're trying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__list__
is defined only for is_list
types. Here's an example of how fragile that is if we only set the top-most parameter:
>>> array = ak.Array([[1,2,3], [4]])
>>> result = array.mask[[True, False]]
>>> ak.with_parameter(result, "__list__", "View").layout
<ByteMaskedArray valid_when='true' len='2'>
<parameter name='__list__'>'View'</parameter>
<mask><Index dtype='int8' len='2'>
[1 0]
</Index></mask>
<content><ListOffsetArray len='2'>
<offsets><Index dtype='int64' len='3'>
[0 3 4]
</Index></offsets>
<content><NumpyArray dtype='int64' len='4'>[1 2 3 4]</NumpyArray></content>
</ListOffsetArray></content>
</ByteMaskedArray>
This currently "works" (i.e. Awkward looks up an array class for this layout), but it won't forever; __list__
is defined only for is_list
types (i.e. the ListOffsetArray
here, not the ByteMaskedArray
). We're just gradually introducing these rules, so the sanity checks will come next.
But, you can ensure that we only set the parameter on __list__
nodes using the recursive transform above. The question is what else breaks?
If you support single-records in AnnData, i.e. ak.record.Record
objects that wrap an array, then these are top-level nodes that do not have is_list=True
, so these would not be supported. Records can be mutated just like record arrays.
To my mind, this should be a record-feature; these are the only mutable components after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just catching up here...
That said, I suppose that if you only worry about mutability from fields (which is the only kind Awkward Arrays support), then we should just use with_name here: if the layout doesn't have any records, it can't be mutated.
this makes sense to me.
I understand the final discussion is whether to use with_name
, or with_parameter
here?
- using
with_name
doesn't have any obvious downsides except that thename
parameter is lost? In that case we'd probably want to update this code to raise aNotImplementedError
when a name is detected instead. - Using
with_parameter
is bound to break when there is noListType
in the array. Ist this something that can actually happen, as we require at least one dimension to be aligned to AnnData (except foradata.uns
ofc.). - Any updates on xarray-style "attrs", global and per-record field scikit-hep/awkward#1391 (comment)? Back then we discussed this as the ideal solution once it is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see way more weird behavior when I use with_name
. E.g. subsets carrying around the behavior.
I prefer the idea of the top level object carrying around the behavior rather than applying it recursivley, since we need to be much more careful about removing this behavior. Right now, with_parameter
seems to be working for cases where @agoose77 says it shouldn't. E.g. I don't think:
ak.Array([
{"a": 1},
{"a": 2},
{"a": 3},
])
Is a list_type
, but seems to have the correct behavior using with_parameter
but not with with_name
.
I think I'm just going to pin awkward for now, since this is breaking our CI and it's really not obvious what the solution here is going to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, __list__
is only supposed to work with list-types; we defined it such. It works for other nodes for now, because it's a gradual addition.
__record__
will definitely always work. Whilst we can't scope it to the top-level node, we can scope the view-metadata there, and ensure that the view class doesn't trigger a copy if self
changes.
The metadata pr linked above won't address the class resolution, but will let us store this metadata in a better location!
We may need a new record/list-class parameter - array was too multi purpose. Yet, record classes do work here, and can easily be made to only copy when bound to the top-level array.
I'm on holiday at the moment, so tagging @jpivarski to think about whether __record__
is sufficient.
@ilan-gold could you give this a look? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@agoose77 @grst @jpivarski @ilan-gold Had a discussion about this, with the following conclusions:
|
Co-authored-by: Isaac Virshup <[email protected]>
TODO
addedrefactoredwith_parameter
,with_name