Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add field validation on structs #52

Merged
merged 18 commits into from
Apr 26, 2024

Conversation

dsgibbons
Copy link
Contributor

@dsgibbons dsgibbons commented Apr 1, 2024

Closes #43.

This fix takes advantage of the existing validation logic by recursively calling _find_errors whenever pl.Struct or pl.List(pl.Struct) columns are detected.

This fix exposes a (possible) bug with one of the pre-existing test cases test_model.test_missing_date_struct. I've currently added a @pytest.mark.skip to the failing test. Is this actually a bug, and if so, can I continue to skip this test case and raise a new issue to fix it? Do you have any insights into why Test.model_fields['c'].annotation.columns does not exist?

@dsgibbons dsgibbons force-pushed the fix/validate-fields-on-structs branch from c27ced8 to 8e6ccec Compare April 3, 2024 20:06
@dsgibbons dsgibbons marked this pull request as ready for review April 3, 2024 21:27
@thomasaarholt
Copy link
Collaborator

thomasaarholt commented Apr 4, 2024

I think this is a seriously elegant solution, and I really hope we can make it work for the general case.

I think the skipped test error is related to the following example:

Consider a nullable struct column:

import patito as pt

class XY(pt.Model):
    x: int
    y: int

class Coord(pt.Model):
    id: int
    xy: Optional[XY] # <-- nullable

What we are saying here is that we allow the following values for xy:

  • {"x":1, y:1}
  • null

But not

  • {"x":1, y:null}

Now let's take a look at how polars materializes the above three column values:

import polars as pl

schema = {"id": pl.Int64, "xy": pl.Struct({"x": pl.Int64, "y": pl.Int64})}
# note the difference between the two last elements in the struct:
data = {"id": [0, 1, 2], "xy": [{"x": 1, "y": 1}, None, {"x": None, "y": None}]}

df = pl.DataFrame(data, schema=schema)
print(df)
---
shape: (3, 2)
┌─────┬─────────────┐
│ idxy          │
│ ------         │
│ i64struct[2]   │
╞═════╪═════════════╡
│ 0   ┆ {1,1}       │
│ 1   ┆ {null,null} │
│ 2   ┆ {null,null} │
└─────┴─────────────┘

I think I was expecting null for the second row, that is, looking only at the last two rows:

shape: (3, 2)
┌─────┬─────────────┐
│ idxy          │
│ ------         │
│ i64struct[2]   │
╞═════╪═════════════╡
│ 1null        │
│ 2   ┆ {null,null} │
└─────┴─────────────┘ 

I'm going to have to have a little think about how we do this one 😊

@dsgibbons
Copy link
Contributor Author

dsgibbons commented Apr 6, 2024

@thomasaarholt I've updated the PR to correctly select Type[Model] when dealing with Optional and have applied filters on all-null subrows to encode the nullable requirement, as discussed above.

Since polars coerces None -> {null, null, ...} when dealing with structs, I think this fix is valid. The only way I can see a more "legitimate" way of implementing this would be if polars itself either:

  • Provided a pl.Optional datatype for marking structs as optional.
  • Changed its default coercion logic for handling None in the context of structs. Like @thomasaarholt, I was surprised to discover how polars currently coerces None for structs, as every other datatype I have come across (including pl.List) is simply encoded as null... there must be some good reason for doing this that is unlikely to change.

I'll likely update this PR to include a few more test cases for testing optional lists of structs and optional deeply nested structs, but I don't think the current fix will need to change. What do you think?

@thomasaarholt
Copy link
Collaborator

Super! I overall agree with you - I've asked here on the discord just to make sure the implementation is correct. This is kinda an edge case, but someone might have an opinion.

@chainyo
Copy link
Contributor

chainyo commented Apr 26, 2024

Hi any news on this PR?

@thomasaarholt
Copy link
Collaborator

Let's go ahead and merge this, as well make a new release. @dsgibbons, if you want to make a few more test cases, especially demonstrating null behaviour, in another PR, then I'd appreciate it.

@thomasaarholt thomasaarholt merged commit ef2d95c into JakobGM:main Apr 26, 2024
2 checks passed
@dsgibbons
Copy link
Contributor Author

Yes, I'm still looking at adding additional test cases @thomasaarholt. I've been a bit busy lately but hopefully I'll be able to add more test cases over the next few weeks. Thanks for accepting the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Field constraints not evaluated on structs
3 participants