Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for awkward 2.3 #1040

Merged
merged 6 commits into from
Jul 24, 2023
Merged

Fix for awkward 2.3 #1040

merged 6 commits into from
Jul 24, 2023

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Jul 6, 2023

TODO

@codecov
Copy link

codecov bot commented Jul 6, 2023

Codecov Report

Merging #1040 (0390f36) into main (bd47cf9) will increase coverage by 0.82%.
The diff coverage is 100.00%.

❗ Current head 0390f36 differs from pull request most recent head 9c21be0. Consider uploading reports for the commit 9c21be0 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1040      +/-   ##
==========================================
+ Coverage   84.28%   85.10%   +0.82%     
==========================================
  Files          35       35              
  Lines        4932     5258     +326     
==========================================
+ Hits         4157     4475     +318     
- Misses        775      783       +8     
Impacted Files Coverage Δ
anndata/_core/views.py 88.26% <100.00%> (ø)

... and 4 files with indirect coverage changes

@@ -273,7 +273,7 @@ def as_view_awkarray(array, view_args):
"Please open an issue in the AnnData repo and describe your use-case."
)
array = ak.with_parameter(array, _PARAM_NAME, (parent_key, attrname, keys))
array = ak.with_parameter(array, "__array__", "AwkwardArrayView")
array = ak.with_parameter(array, "__list__", "AwkwardArrayView")
Copy link

@agoose77 agoose77 Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, although we should set this only on list-types. We don't yet have a function to do this, so we probably need to introduce a transform function here for now:

def with_list(array, name):
    def apply(layout, **kwargs):
        if layout.is_list:
            return layout.with_parameter("__list__", name)
    return ak.transform(apply, array)

The type string could be better here, but it will be OK for now I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need to bother applying this recursively, as we can only mutate the top level array, right?

I've given an example of this an asked a little more here: #1040 (comment)

@@ -273,7 +273,7 @@ def as_view_awkarray(array, view_args):
"Please open an issue in the AnnData repo and describe your use-case."
)
array = ak.with_parameter(array, _PARAM_NAME, (parent_key, attrname, keys))
array = ak.with_parameter(array, "__array__", "AwkwardArrayView")
array = ak.with_parameter(array, "__list__", "AwkwardArrayView")
return array

ak.behavior["AwkwardArrayView"] = AwkwardArrayView
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ak.behavior["AwkwardArrayView"] = AwkwardArrayView
ak.behavior["AwkwardArrayView"] = AwkwardArrayView
ak.behavior["*", "AwkwardArrayView"] = AwkwardArrayView

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make the definition recursive, right?

(a) Would ak.with_parameter(array, "__list__", None) still remove that behavior from the children of this array?
(b) What behavior would this change allow? E.g. does __setitem__ even work for the children of this array?

For example, with current behavior (e.g. at commit: 220d2e2):

import anndata as ad, numpy as np, awkward as ak

a = ad.AnnData(
    np.ones((3, 2)),
    obsm={
        "awk": ak.Array(
            [
                {"a": {"b": [1, 2, 3]}},
                {"a": {"b": [1, 2]}},
                {"a": {"b": [5, 6]}}
            ]
        )
    }
)

v = a[:2]

# This doesn't "work"
v.obsm["awk"]["a"]["c"] = [1, 2]
assert "c" not in v.obsm["awk"]["a"].fields

# But neither does:
a.obsm["awk"]["a"]["c"] = [1, 2, 3]
assert "c" not in a.obsm["awk"]["a"].fields

# These both work as expected:
v.obsm["awk"]["a", "c"] = [1, 2]
a.obsm["awk"]["a", "c"] = [1, 2, 3]

assert "c" in v.obsm["awk"]["a"].fields
assert "c" in a.obsm["awk"]["a"].fields

Copy link

@agoose77 agoose77 Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) Would ak.with_parameter(array, "__list__", None)

No, the same function with_list should be used, with a value of None.

(b) What behavior would this change allow? E.g. does setitem even work for the children of this array?

The main change in Awkward is to replace the powerful __array__ with a much tighter-scoped __list__ parameter. This means that if your top-level layout is not an is_list type, it won't have any effect upon the array class (and in future, would error). The with_list function behaves like with_name, in that it descends until if finds a list, sets the parameter, and exits.

Now that I think about it, the most comprehensive solution would set either __list__ or __record__, whichever appears first. That would look like

def with_any_name(array, name):
    def apply(layout, **kwargs):
        if layout.is_list:
            return layout.with_parameter("__list__", name)
        elif layout.is_record:
            return layout.with_parameter("__record__", name)
    return ak.transform(apply, array)

The reason that we need to recurse is that __setitem__ is recursive; it will find a record at any level of nesting, and update the layout accordingly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, I suppose that you are only worried about mutability from fields (which is the only kind Awkward Arrays support). Therefore, we should just use with_name here: if the layout doesn't have any records, it can't be mutated. So that's the proper solution after all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I suppose that if you only worry about mutability from fields (which is the only kind Awkward Arrays support), then we should just use with_name here: if the layout doesn't have any records, it can't be mutated.

👍

Make sense to me. So, something like:

ak.with_name(array, "AwkwardArrayView")

instead of:

ak.with_parameter(array, "__list__", "AwkwardArrayView")

?

This does seem to be causing some test failures, but I haven't looked too deeply at that yet.


Btw, the current approach works even if there are no lists. E.g.:

a = ad.AnnData(
    np.ones((3, 2)),
    obsm={
        "awk": ak.Array(
            [
                {"a": {"b": 1}},
                {"a": {"b": 2}},
                {"a": {"b": 3}}
            ]
        )
    }
)

Is that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the record name means that users won't be able to set their own record names without losing the anndata view. I think that's actually OK - behavior classes are somewhat at odds with anndata's needs here, and I think it's fine for a user to need to write

This was true before (I think?)

Would it be true with the ak.with_parameter(array, "__list__", "AwkwardArrayView") approach?

Playing around a bit:

two = ak.Array([[{"x": 0.9, "y": 1}, {"x": 2, "y": 2.2}, {"x": 2.9, "y": 3}],
                [],
                [{"x": 3.9, "y": 4}, {"x": 5, "y": 5.5}],
                [{"x": 5.9, "y": 6}],
                [{"x": 6.9, "y": 7}, {"x": 8, "y": 8.8}, {"x": 8.9, "y": 9}]],
               with_name="point")

ak.with_name(two, "AwkwardArrayView").typestr
#  '5 * var * AwkwardArrayView[x: float64, y: float64]'

ak.with_parameter(two, "__list__", "AwkwardArrayView").typestr
# '5 * [var * point[x: float64, y: float64], parameters={"__list__": "AwkwardArrayView"}]'

I think the with_parameter result better fits my mental model of what we're trying to do.

Copy link

@agoose77 agoose77 Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__list__ is defined only for is_list types. Here's an example of how fragile that is if we only set the top-most parameter:

>>> array = ak.Array([[1,2,3], [4]])
>>> result = array.mask[[True, False]]
>>> ak.with_parameter(result, "__list__", "View").layout
<ByteMaskedArray valid_when='true' len='2'>
    <parameter name='__list__'>'View'</parameter>
    <mask><Index dtype='int8' len='2'>
        [1 0]
    </Index></mask>
    <content><ListOffsetArray len='2'>
        <offsets><Index dtype='int64' len='3'>
            [0 3 4]
        </Index></offsets>
        <content><NumpyArray dtype='int64' len='4'>[1 2 3 4]</NumpyArray></content>
    </ListOffsetArray></content>
</ByteMaskedArray>

This currently "works" (i.e. Awkward looks up an array class for this layout), but it won't forever; __list__ is defined only for is_list types (i.e. the ListOffsetArray here, not the ByteMaskedArray). We're just gradually introducing these rules, so the sanity checks will come next.

But, you can ensure that we only set the parameter on __list__ nodes using the recursive transform above. The question is what else breaks?

If you support single-records in AnnData, i.e. ak.record.Record objects that wrap an array, then these are top-level nodes that do not have is_list=True, so these would not be supported. Records can be mutated just like record arrays.

To my mind, this should be a record-feature; these are the only mutable components after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just catching up here...

That said, I suppose that if you only worry about mutability from fields (which is the only kind Awkward Arrays support), then we should just use with_name here: if the layout doesn't have any records, it can't be mutated.

this makes sense to me.

I understand the final discussion is whether to use with_name, or with_parameter here?

  • using with_name doesn't have any obvious downsides except that the name parameter is lost? In that case we'd probably want to update this code to raise a NotImplementedError when a name is detected instead.
  • Using with_parameter is bound to break when there is no ListType in the array. Ist this something that can actually happen, as we require at least one dimension to be aligned to AnnData (except for adata.uns ofc.).
  • Any updates on xarray-style "attrs", global and per-record field scikit-hep/awkward#1391 (comment)? Back then we discussed this as the ideal solution once it is ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see way more weird behavior when I use with_name. E.g. subsets carrying around the behavior.

I prefer the idea of the top level object carrying around the behavior rather than applying it recursivley, since we need to be much more careful about removing this behavior. Right now, with_parameter seems to be working for cases where @agoose77 says it shouldn't. E.g. I don't think:

ak.Array([
    {"a": 1},
    {"a": 2},
    {"a": 3},
])

Is a list_type, but seems to have the correct behavior using with_parameter but not with with_name.


I think I'm just going to pin awkward for now, since this is breaking our CI and it's really not obvious what the solution here is going to be.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, __list__ is only supposed to work with list-types; we defined it such. It works for other nodes for now, because it's a gradual addition.

__record__ will definitely always work. Whilst we can't scope it to the top-level node, we can scope the view-metadata there, and ensure that the view class doesn't trigger a copy if self changes.

The metadata pr linked above won't address the class resolution, but will let us store this metadata in a better location!

We may need a new record/list-class parameter - array was too multi purpose. Yet, record classes do work here, and can easily be made to only copy when bound to the top-level array.

I'm on holiday at the moment, so tagging @jpivarski to think about whether __record__ is sufficient.

@ivirshup ivirshup mentioned this pull request Jul 10, 2023
@ivirshup
Copy link
Member Author

@ilan-gold could you give this a look?

@ivirshup ivirshup requested a review from ilan-gold July 20, 2023 13:13
@ivirshup
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ivirshup
Copy link
Member Author

@agoose77 @grst @jpivarski @ilan-gold

Had a discussion about this, with the following conclusions:

  • We'll merge as is
    • So we don't put an upper bound on awkward
  • @grst will try to see if he can figure out with_name later in the week
  • @jpivarski @agoose77, if we can't figure out with_name, would you be up for making a release candidate before making a change that changes this?

@ivirshup ivirshup enabled auto-merge (squash) July 24, 2023 13:08
@ivirshup ivirshup merged commit 6c09341 into scverse:main Jul 24, 2023
meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Jul 24, 2023
ivirshup added a commit that referenced this pull request Jul 24, 2023
@grst grst mentioned this pull request Jul 24, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Future changes to Awkward Array behavior class resolution
3 participants