-
Notifications
You must be signed in to change notification settings - Fork 4
Dfview #297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Dfview #297
Conversation
add get_spans() in Field class, similar to get_spans() in Session class
to add CSVDataset file as the import required in module init
…d string, indexed string, etc.
…ction in core.operations. Provide get_spans methods in fields using data attribute.
Modify the get_spans functions in Session to call field method and operation method.
exetera/core/fields.py
Outdated
| if self._filter_wrapper is None: # poential returns: raise error or return a full-index array | ||
| return None | ||
| else: | ||
| return self._filter_wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to return a read-only field array for this.
exetera/core/fields.py
Outdated
| """ | ||
| Return if the dataframe's name matches the field h5group path; if not, means this field is a view. | ||
| """ | ||
| if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name: | |
| return self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name |
exetera/core/fields.py
Outdated
| def __getitem__(self, item): | ||
| return self._dataset[item] | ||
| if self._field_instance.filter is not None: | ||
| mask = self._field_instance.filter[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply the item to select indices from the mask first then use these to get values from self._dataset:
idx = mask[item]
exetera/core/fields.py
Outdated
| return self._dataset[item] | ||
| if self._field_instance.filter is not None and not isinstance(self._field_instance, IndexedStringField): | ||
| mask = self._field_instance.filter[:] | ||
| data = self._dataset[:][mask] # as HDF5 does not support unordered mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea here, want to get indices from mask rather than resolve all mask indices then select from result.
|
|
||
| class ReadOnlyIndexedFieldArray: | ||
| def __init__(self, field, indices, values): | ||
| def __init__(self, chunksize, indices, values, field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: check this is right?
| start, stop = self._indices[item:item + 2] | ||
| if start == stop: | ||
| return '' | ||
| value = self._values[start:stop].tobytes().decode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, slice the mask not the data.
| def __init__(self, chunksize, indices, values): | ||
| def __init__(self, chunksize, indices, values, field): | ||
| """ | ||
| :param: chunksize: Size of each chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Samesies.
exetera/core/dataframe.py
Outdated
| # add filter | ||
| if filter is not None: | ||
| nformat = 'int32' | ||
| if len(filter) > 0 and np.max(filter) >= 2**31 - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils.INT64_INDEX_LENGTH
| for name, field in self._columns.items(): | ||
| newfld = field.create_like(ddf, name) | ||
| field.apply_filter(filter_to_apply_, target=newfld) | ||
| ddf._add_view(field, filter_to_apply_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if the same dataset
exetera/core/dataframe.py
Outdated
| filter_to_apply_ = val.validate_filter(filter_to_apply) | ||
|
|
||
| if ddf is not None: | ||
| if ddf is not None and ddf is not self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ddf = self if ddf is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ddf not in (None, self)
exetera/core/dataframe.py
Outdated
| if ddf is not None: | ||
| if not isinstance(ddf, DataFrame): | ||
| raise TypeError("The destination object must be an instance of DataFrame.") | ||
| ddf._write_filter(np.where(filter_to_apply_ == True)[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that you return the filter reference here so you can directly assign it during add_view (line 608)
exetera/core/fields.py
Outdated
| return None | ||
| else: | ||
| return self._filter_wrapper[:] | ||
| return self._filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code
exetera/core/fields.py
Outdated
|
|
||
| @property | ||
| def filter(self): | ||
| if self._filter_wrapper is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return filter field rather than dereferencing
exetera/core/fields.py
Outdated
| """ | ||
| self._references.remove(field) | ||
|
|
||
| def concreate_all_fields(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: concrete_all_fields
exetera/core/fields.py
Outdated
| Replaces current dataset with empty dataset. | ||
| :return: None | ||
| """ | ||
| if len(self._references) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this check inside the notification method
| view.update(self, msg) | ||
|
|
||
| def update(self, subject, msg=None): | ||
| if isinstance(subject, (WriteableFieldArray, WriteableIndexedFieldArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This field is being notified by its own field array
# It needs to notify other fields that it is about to change before the change goes ahead
| self.notify(msg) | ||
| self.detach() | ||
|
|
||
| if isinstance(subject, HDF5Field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This field is being notified by the field that owns the data that it has a view of
# At present, the behavior is that it copies the data and then detaches from the view that notified it, as it
# no longer has an observation relationship with that field
exetera/core/fields.py
Outdated
| def update(self, subject, msg=None): | ||
| if isinstance(subject, (WriteableFieldArray, WriteableIndexedFieldArray)): | ||
| self.notify(msg) | ||
| self.detach() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detach should be the responsibility of the observer, not the subject, as the subject could instead do something clever that maintains the relationship
| def attach(self, view): | ||
| self._view_refs.append(view) | ||
|
|
||
| def detach(self, view=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This detach is actually notify_deletion. This is a standard part of subject_observer, but so is detach, which is for the observer to detach from a given subject
| def attach(self, view): | ||
| self._view_refs.append(view) | ||
|
|
||
| def detach(self, view=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also deletion function (subject del observer)
| if utils.is_sorted(mask): | ||
| return self._dataset[mask] | ||
| else: | ||
| return self._dataset[np.sort(mask)][mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an issue on benchmarking: filtering/item -> hdf5 or hdf5/numpy -> filter.
exetera/core/fields.py
Outdated
| bytestr[index[ir] - np.int64(startindex): | ||
| index[ir + 1] - np.int64(startindex)].tobytes().decode() | ||
| return results | ||
| if self._field_instance.filter is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This field is not a view so no filtered_index to deal with
| else: | ||
| mask = self._field_instance.filter[item] | ||
| if utils.is_sorted(mask): | ||
| index_s = self._indices[mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# the filtered indices represent a filter operation
We need to evaluate whether this saves anything. H5py is horrifically slow and the risk is that all of this work is lost when loading slices through its api. I would suggest just loading all of the data and applying the filtered index rather than trying to gain time here, unless we do a series of detailed benchmarks that can give us a heuristic to decide whether to do this or not
1) has hdf5 group but no dataset 2) can be re-recognized in a new session
add document
Codecov Report
@@ Coverage Diff @@
## master #297 +/- ##
==========================================
- Coverage 83.39% 82.63% -0.77%
==========================================
Files 22 22
Lines 6191 6478 +287
Branches 1247 1324 +77
==========================================
+ Hits 5163 5353 +190
- Misses 733 802 +69
- Partials 295 323 +28
Continue to review full report at Codecov.
|
df.view() to create a view from the df, contains all fields as a reference
df.apply_filter() to set filter