-
Notifications
You must be signed in to change notification settings - Fork 4
implement where api #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
implement where api #298
Conversation
Codecov Report
@@ Coverage Diff @@
## master #298 +/- ##
==========================================
+ Coverage 83.24% 83.26% +0.01%
==========================================
Files 22 22
Lines 6149 6287 +138
Branches 1247 1273 +26
==========================================
+ Hits 5119 5235 +116
- Misses 734 749 +15
- Partials 296 303 +7
Continue to review full report at Codecov.
|
|
Here is the matrix that represent the field type result of
|
|
The first part of the TODO was to accept list, tuple, and any sort of ndarray, not just bool arrays. Can we make that change? |
exetera/core/fields.py
Outdated
| raise NotImplementedError("Where does not support condition on indexed string fields at present") | ||
| cond = cond.data[:] | ||
| elif callable(cond): | ||
| raise NotImplementedError("module method `fields.where` doesn't support callable cond, please use instance mehthod `where` for callable cond.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: mehthod -> method
exetera/core/fields.py
Outdated
| a = a.data[:] | ||
| if isinstance(b, Field): | ||
| b = b.data[:] | ||
| return np.where(cond, a, b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still returning a numpy array rather than a field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still returning a numpy array rather than a field
The logic of module-level where API will be almost same as instance-level where API. Think we can focus on one first, e.g. instance-level where API.
exetera/core/fields.py
Outdated
| if not self._valid_reference: | ||
| raise ValueError("This field no longer refers to a valid underlying field object") | ||
|
|
||
| def where(self, cond:Union[list, tuple, np.ndarray, Field], b, inplace=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the callable signature to cond's type information
exetera/core/fields.py
Outdated
| result_mem_field.data.write(result_ndarray) | ||
|
|
||
| elif isinstance(self, (IndexedStringField, FixedStringField)) or isinstance(b, (IndexedStringField, FixedStringField)): | ||
| result_mem_field = IndexedStringMemField(self._session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right. Why are we causing an operation with fixed string field to output an indexed string field?
It doesn't make the logic much more complicated. Also, I would make that a separate method probably, because I can imagine us needing it elsewhere in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For FixedStringField, you can refer to the matrix I listed above. Only when two FixedStringField will generate FixedStringField, otherwise it will be IndexedStringField.
|
|
|
Sorry, accidental close |
|
Fixed string type promotion should not result in indexed strings, I think. Here is a revised version of the table below: |
|
exetera/core/fields.py
Outdated
|
|
||
| result_mem_field = None | ||
|
|
||
| if isinstance(self, IndexedStringField) and isinstance(b, IndexedStringField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When doing the type checking need to check that it's one of two types: isinstance(self, (IndexedStringField, IndexedStringMemField)).
exetera/core/fields.py
Outdated
| cond = cond(self.data[:]) | ||
| else: | ||
| raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we could just do return where(cond, self, b) and then the rest of the body of this method can be put into the global where function.
exetera/core/fields.py
Outdated
| other_field_row_count = len(other_field.data[:]) | ||
| data_converted_to_str = np.where([True]*other_field_row_count, other_field.data[:], [""]*other_field_row_count) | ||
| maxLength = 0 | ||
| re_match = re.findall(r"<U(\d+)|S(\d+)", str(data_converted_to_str.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
U can be <U or >U
tests/test_fields.py
Outdated
| np.testing.assert_array_equal(expected, result) | ||
|
|
||
|
|
||
| WHERE_BOOLEAN_COND = RAND_STATE.randint(0, 2, 20).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing tests for when cond is a field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently unittest for cond is a field is missing. I'm trying to add one.
So for the indexedstringfield, we will throw out the exception.
How should we deal with the FixedStringField? As we can't use string as boolean value directly, so which case should be considered True for fixedstringfield, and which case is False?
| WHERE_FIXED_STRING_TESTS = [ | ||
| (lambda f: f > 5, "create_numeric", {"nformat": "int8"}, WHERE_NUMERIC_FIELD_DATA, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA), | ||
| (lambda f: f > 2, "create_categorical", {"nformat": "int32", "key": {"a": 1, "b": 2, "c": 3}}, WHERE_CATEGORICAL_FIELD_DATA, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA), | ||
| (WHERE_BOOLEAN_COND, "create_fixed_string", {"length": 3}, WHERE_FIXED_STRING_FIELD_DATA, "create_categorical", {"nformat": "int32", "key": {"a": 1, "b": 2, "c": 3}}, WHERE_CATEGORICAL_FIELD_DATA), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2300: can we also do this for float32?
tests/test_fields.py
Outdated
| np.testing.assert_array_equal(result.data[:], expected_result) | ||
|
|
||
| # reload to test FixedStringMemField | ||
| a_mem_field, b_mem_field = a_field, b_field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to before the first subtest
tests/test_fields.py
Outdated
|
|
||
| expected_result = where_oracle(cond, a_field_data, b_field_data) | ||
|
|
||
| with self.subTest(f"Test instance where method: a is {type(a_field)}, b is {type(b_field)}"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to after the mem fields are created
tests/test_fields.py
Outdated
|
|
||
| # reload to test FixedStringMemField | ||
| a_mem_field, b_mem_field = a_field, b_field | ||
| if isinstance(a_field, fields.FixedStringField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condition can be removed
tests/test_fields.py
Outdated
| a_mem_field = fields.FixedStringMemField(self.s, a_kwarg["length"]) | ||
| a_mem_field.data.write(np.array(a_field_data)) | ||
|
|
||
| if isinstance(b_field, fields.FixedStringField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
condition can be removed
tests/test_fields.py
Outdated
| b_mem_field = fields.FixedStringMemField(self.s, b_kwarg["length"]) | ||
| b_mem_field.data.write(np.array(b_field_data)) | ||
|
|
||
| with self.subTest(f"Test instance where method: a is {type(a_mem_field)}, b is {type(b_mem_field)}"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all four combinations:
a_field, b_field
a_field, b_mem_field
a_mem_field, b_field
a_mem_field, b_mem_field
|
|
||
|
|
||
| @parameterized.expand(WHERE_INDEXED_STRING_TESTS) | ||
| def test_instance_field_where_return_indexed_string_mem_field(self, cond, a_creator, a_kwarg, a_field_data, b_creator, b_kwarg, b_field_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with combinations of hdf5 and mem fields
exetera/core/fields.py
Outdated
| if isinstance(cond, (list, tuple, np.ndarray)): | ||
| cond = cond | ||
| elif isinstance(cond, Field): | ||
| if isinstance(cond, (NumericField, CategoricalField)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not checking for both hdf5 and mem field types
exetera/core/fields.py
Outdated
| if isinstance(cond, (list, tuple, np.ndarray)): | ||
| cond = cond | ||
| elif isinstance(cond, Field): | ||
| if isinstance(cond, (NumericField, CategoricalField)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not checking both hdf5 and mem field types
exetera/core/fields.py
Outdated
| if isinstance(cond, (list, tuple, np.ndarray)): | ||
| cond = cond | ||
| elif isinstance(cond, Field): | ||
| if isinstance(cond, (NumericField, CategoricalField)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not checking hdf5 and mem field types
atbenmurray
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these exception handling messages are fixed, I think we are good to go
exetera/core/fields.py
Outdated
| else: | ||
| raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") | ||
| elif callable(cond): | ||
| raise NotImplementedError("module method `fields.where` doesn't support callable cond, please use instance mehthod `where` for callable cond.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"module method fields.where doesn't support callable cond parameter, please use the instance method where if you need to use a callable cond parameter"
exetera/core/fields.py
Outdated
| if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)): | ||
| cond = cond.data[:] | ||
| else: | ||
| raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where only supports python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
| if l: | ||
| maxLength = int(l) | ||
| else: | ||
| raise ValueError("The return dtype of instance method `where` doesn't match '<U(\d+)' or 'S(\d+)' when one of the field is FixedStringField") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"The return dtype of instance method where doesn't match '<U(\d+)' or 'S(\d+)' when one of the fields is a fixed string field"
exetera/core/fields.py
Outdated
| if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)): | ||
| cond = cond.data[:] | ||
| else: | ||
| raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where only supports python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
| if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)): | ||
| cond = cond.data[:] | ||
| else: | ||
| raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
| elif callable(cond): | ||
| cond = cond(self.data[:]) | ||
| else: | ||
| raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
| if isinstance(cond, (NumericField, NumericMemField, CategoricalField, CategoricalMemField)): | ||
| cond = cond.data[:] | ||
| else: | ||
| raise NotImplementedError("Where only support condition on numeric field and categorical field at present.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
exetera/core/fields.py
Outdated
| elif callable(cond): | ||
| cond = cond(self.data[:]) | ||
| else: | ||
| raise TypeError("'cond' parameter needs to be either callable lambda function, or array like, or NumericMemField.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, please replace with:
"where only supports callables, python sequences, numpy ndarrays, and numeric field and categorical field types for the cond parameter at present."
fix #224
TODO:
(1) Accept types Field, np.ndarray, list, tuple, for where function and methods
(2) where always returns a MemField, in case of inplace return self after changing internals to match type if necessary, otherwise returning fresh field
(3) categorical is typed down to integer, treat timestamp as float64 fields
For checking for multiple types: isinstance(cond, (list, tuple, np.ndarray))