-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Dataset.eval works with >2 dims
#11064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit removes the dependency on pandas.eval() and implements a native expression evaluator in Dataset.eval() using Python's ast module. The new implementation provides better support for multi-dimensional arrays and maintains backward compatibility with deprecated operators through automatic transformation. Key changes: - Remove pd.eval() call and replace with custom _eval_expression() method - Add _LogicalOperatorTransformer to convert deprecated operators (and/or/not) to bitwise operators (&/|/~) that work element-wise on arrays - Implement automatic transformation of chained comparisons to explicit bitwise AND operations - Add security validation to block lambda expressions and private attributes - Emit FutureWarning for deprecated constructs (logical operators, chained comparisons, parser= argument) - Support assignment statements (target = expression) in eval() - Make data variables and coordinates take priority in namespace resolution - Provide safe builtins (abs, min, max, round, len, sum, pow, any, all, type constructors, iteration helpers) while blocking __import__, open, etc. - Add comprehensive test coverage including edge cases, error messages, dask compatibility, and security validation
- Use pd.isna(ds["a"].values) instead of pd.isna(ds["a"]) since pandas type stubs don't have overloads for DataArray - Use abs() instead of np.abs() to get DataArray return type Co-authored-by: Claude <[email protected]>
The lambda and dunder restrictions emulate pd.eval() behavior rather than providing security guarantees. Pandas explicitly doesn't claim these as security measures. Co-authored-by: Claude <[email protected]>
Extract AST-based expression evaluation code to xarray/core/eval.py: - EVAL_BUILTINS dict - LogicalOperatorTransformer class - validate_expression function This addresses the review feedback to keep the Dataset class focused. Co-authored-by: Claude <[email protected]>
Extract eval tests from test_dataset.py to test_eval.py: - 35 tests covering basic functionality, error messages, edge cases, and dask - Mirrors the implementation structure (core/eval.py <-> tests/test_eval.py) - Reduces test_dataset.py by 574 lines Co-authored-by: Claude <[email protected]>
| __bool__(), which is ambiguous for multi-element arrays. | ||
| """ | ||
|
|
||
| def visit_BoolOp(self, node: ast.BoolOp) -> ast.AST: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not snake_case? I'm surprised we don't have a linter that catches this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a required convention from Python's ast.NodeTransformer. The visitor methods must be named visit_<NodeType> where <NodeType> matches the AST node class name exactly (e.g., BoolOp, UnaryOp, Compare). Using snake_case like visit_bool_op would break the visitor pattern - Python's ast module wouldn't find the methods.
[This is Claude Code on behalf of max-sixty]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the implementation, the only documentation I managed to understand:
https://github.com/python/cpython/blob/d0e9f4445a0d9039e1a2367ecee376b4b3ba7593/Lib/ast.py#L502-L506
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right — I think it needs to be visit_Foo, we can't change that. which the link seems to support?
method = 'visit_' + node.__class__.__name__
(this is Max himself!)
Address review feedback: - Convert TestEvalErrorMessages class to test_eval_error_* functions - Convert TestEvalEdgeCases class to test_eval_* functions - Convert TestEvalDask class to test_eval_dask_* functions This follows xarray's preference for standalone test functions over classes. Co-authored-by: Claude <[email protected]>
this ended up being a much bigger effort than expected
Dataset.evalbecause it's limited to 2 dims.query, because we should be more careful about changing that, it usesnumexprwhich is fast, and doesn't have a requirement for > 2 dims.queryinterface; e.g.and&or, etcI added some similar constraints that pandas has around limiting what
evalcan do. I'm not that confident that it's robust. and not sure how valuable it is.most of the added code is tests
Commentary from Claude below (+ Claude wrote the code, for transparency, albeit with lots of oversight)
This commit removes the dependency on pandas.eval() and implements a native expression evaluator in Dataset.eval() using Python's ast module. The new implementation provides better support for multi-dimensional arrays and maintains backward compatibility with deprecated operators through automatic transformation.
Key changes:
whats-new.rstapi.rst