Dfview #297

deng113jie · 2022-05-09T14:42:33Z

df.view() to create a view from the df, contains all fields as a reference
df.apply_filter() to set filter

add get_spans() in Field class, similar to get_spans() in Session class

to add CSVDataset file as the import required in module init

…d string, indexed string, etc.

…ction in core.operations. Provide get_spans methods in fields using data attribute.

Modify the get_spans functions in Session to call field method and operation method.

…view

ericspod · 2022-05-23T14:11:55Z

exetera/core/fields.py

+        if self._filter_wrapper is None:  # poential returns: raise error or return a full-index array
+            return None
+        else:
+            return self._filter_wrapper


Need to return a read-only field array for this.

ericspod · 2022-05-23T14:12:24Z

exetera/core/fields.py

+        """
+        Return if the dataframe's name matches the field h5group path; if not, means this field is a view.
+        """
+        if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name:


Suggested change

if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name:

return self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name

ericspod · 2022-05-23T14:16:21Z

exetera/core/fields.py

    def __getitem__(self, item):
-        return self._dataset[item]
+        if self._field_instance.filter is not None:
+            mask = self._field_instance.filter[:]


Apply the item to select indices from the mask first then use these to get values from self._dataset:

idx = mask[item]

ericspod · 2022-05-23T14:17:48Z

exetera/core/fields.py

-        return self._dataset[item]
+        if self._field_instance.filter is not None and not isinstance(self._field_instance, IndexedStringField):
+            mask = self._field_instance.filter[:]
+            data = self._dataset[:][mask]  # as HDF5 does not support unordered mask


Same idea here, want to get indices from mask rather than resolve all mask indices then select from result.

ericspod · 2022-05-23T14:19:09Z

exetera/core/fields.py


 class ReadOnlyIndexedFieldArray:
-    def __init__(self, field, indices, values):
+    def __init__(self, chunksize, indices, values, field):


TODO: check this is right?

ericspod · 2022-05-23T14:25:25Z

exetera/core/fields.py

+                start, stop = self._indices[item:item + 2]
+                if start == stop:
+                    return ''
+                value = self._values[start:stop].tobytes().decode()


Same here, slice the mask not the data.

ericspod · 2022-05-23T14:25:46Z

exetera/core/fields.py

-    def __init__(self, chunksize, indices, values):
+    def __init__(self, chunksize, indices, values, field):
        """
        :param: chunksize: Size of each chunk


deng113jie · 2022-05-25T09:42:02Z

exetera/core/dataframe.py

+        # add filter
+        if filter is not None:
+            nformat = 'int32'
+            if len(filter) > 0 and np.max(filter) >= 2**31 - 1:


utils.INT64_INDEX_LENGTH

deng113jie · 2022-05-25T09:59:33Z

exetera/core/dataframe.py

            for name, field in self._columns.items():
-                newfld = field.create_like(ddf, name)
-                field.apply_filter(filter_to_apply_, target=newfld)
+                ddf._add_view(field, filter_to_apply_)


check if the same dataset

deng113jie · 2022-05-25T10:02:00Z

exetera/core/dataframe.py

        filter_to_apply_ = val.validate_filter(filter_to_apply)
-
-        if ddf is not None:
+        if ddf is not None and ddf is not self:


ddf = self if ddf is None

if ddf not in (None, self)

atbenmurray · 2022-05-12T13:13:41Z

exetera/core/dataframe.py

        if ddf is not None:
            if not isinstance(ddf, DataFrame):
                raise TypeError("The destination object must be an instance of DataFrame.")
+            ddf._write_filter(np.where(filter_to_apply_ == True)[0])


I would suggest that you return the filter reference here so you can directly assign it during add_view (line 608)

atbenmurray · 2022-05-12T13:23:28Z

exetera/core/fields.py

+            return None
+        else:
+            return self._filter_wrapper[:]
+        return self._filter


atbenmurray · 2022-05-12T13:23:46Z

exetera/core/fields.py

+
+    @property
+    def filter(self):
+        if self._filter_wrapper is None:


return filter field rather than dereferencing

atbenmurray · 2022-05-12T13:24:29Z

exetera/core/fields.py

+        """
+        self._references.remove(field)
+
+    def concreate_all_fields(self):


typo: concrete_all_fields

atbenmurray · 2022-05-12T13:43:21Z

exetera/core/fields.py

        Replaces current dataset with empty dataset.
        :return: None
        """
+        if len(self._references) > 0:


You can do this check inside the notification method

atbenmurray · 2022-05-25T10:16:45Z

exetera/core/fields.py

+            view.update(self, msg)
+
+    def update(self, subject, msg=None):
+        if isinstance(subject, (WriteableFieldArray, WriteableIndexedFieldArray)):


# This field is being notified by its own field array
# It needs to notify other fields that it is about to change before the change goes ahead

atbenmurray · 2022-05-25T10:18:24Z

exetera/core/fields.py

+            self.notify(msg)
+            self.detach()
+
+        if isinstance(subject, HDF5Field):


# This field is being notified by the field that owns the data that it has a view of
# At present, the behavior is that it copies the data and then detaches from the view that notified it, as it
# no longer has an observation relationship with that field

atbenmurray · 2022-05-25T10:21:55Z

exetera/core/fields.py

+    def update(self, subject, msg=None):
+        if isinstance(subject, (WriteableFieldArray, WriteableIndexedFieldArray)):
+            self.notify(msg)
+            self.detach()


Detach should be the responsibility of the observer, not the subject, as the subject could instead do something clever that maintains the relationship

atbenmurray · 2022-05-25T10:24:08Z

exetera/core/fields.py

+    def attach(self, view):
+        self._view_refs.append(view)
+
+    def detach(self, view=None):


This detach is actually notify_deletion. This is a standard part of subject_observer, but so is detach, which is for the observer to detach from a given subject

deng113jie · 2022-05-25T10:27:39Z

exetera/core/fields.py

+    def attach(self, view):
+        self._view_refs.append(view)
+
+    def detach(self, view=None):


also deletion function (subject del observer)

deng113jie · 2022-05-25T10:34:30Z

exetera/core/fields.py

+            if utils.is_sorted(mask):
+                return self._dataset[mask]
+            else:
+                return self._dataset[np.sort(mask)][mask]


add an issue on benchmarking: filtering/item -> hdf5 or hdf5/numpy -> filter.

atbenmurray · 2022-05-25T10:42:46Z

exetera/core/fields.py

-                    bytestr[index[ir] - np.int64(startindex):
-                            index[ir + 1] - np.int64(startindex)].tobytes().decode()
-            return results
+            if self._field_instance.filter is None:


# This field is not a view so no filtered_index to deal with

atbenmurray · 2022-05-25T10:45:42Z

exetera/core/fields.py

+            else:
+                mask = self._field_instance.filter[item]
+                if utils.is_sorted(mask):
+                    index_s = self._indices[mask]


# the filtered indices represent a filter operation
We need to evaluate whether this saves anything. H5py is horrifically slow and the risk is that all of this work is lost when loading slices through its api. I would suggest just loading all of the data and applying the filtered index rather than trying to gain time here, unless we do a series of detailed benchmarks that can give us a heuristic to decide whether to do this or not

1) has hdf5 group but no dataset 2) can be re-recognized in a new session

add document

codecov-commenter · 2022-05-26T09:14:27Z

Codecov Report

Merging #297 (3153f2b) into master (bc66b84) will decrease coverage by 0.76%.
The diff coverage is 73.04%.

@@            Coverage Diff             @@
##           master     #297      +/-   ##
==========================================
- Coverage   83.39%   82.63%   -0.77%     
==========================================
  Files          22       22              
  Lines        6191     6478     +287     
  Branches     1247     1324      +77     
==========================================
+ Hits         5163     5353     +190     
- Misses        733      802      +69     
- Partials      295      323      +28

Impacted Files	Coverage Δ
exetera/core/abstract_types.py	`63.15% <57.14%> (-0.30%)`	⬇️
exetera/core/fields.py	`87.82% <69.59%> (-3.22%)`	⬇️
exetera/core/dataframe.py	`85.81% <84.93%> (-1.55%)`	⬇️
exetera/core/dataset.py	`94.91% <100.00%> (+0.32%)`	⬆️
exetera/core/utils.py	`78.32% <100.00%> (+0.62%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc66b84...3153f2b. Read the comment docs.

jie and others added 30 commits March 11, 2021 16:17

fixing issue #86 from upstream:

a2d7008

add get_spans() in Field class, similar to get_spans() in Session class

add unit test for Field get_spans() function

62925bb

remove unuseful line comments

0e313dc

add dataset, datafreame class

e211371

Merge remote-tracking branch 'upstream/master'

39e4535

closing issue 92, reset the dataset when call field.data.clear

329a7cc

closing issue 92, reset the dataset when call field.data.clear

d9d8b02

Merge branch 'master' into patch92

f7ba342

to add CSVDataset file as the import required in module init

add unittest for field.data.clear function

21f0fa9

recover the dataset file to avoid merge error when fixing issue 92

c9363ef

fix end_of_file char in dataset.py

14fc1f3

add get_span for index string field

2d13342

unittest for get_span functions on different types of field, eg. fixe…

666073e

…d string, indexed string, etc.

Merge remote-tracking branch 'upstream/master'

73aa50e

Merge remote-tracking branch 'upstream/master' into dataframe

689cc3f

dataframe basic methods and unittest

8ba818f

more dataframe operations

abb3337

fix upstream merge conflict

3180cbd

minor fixing

9b9c420

update get_span to field subclass

55989d6

solve conflict

cd69d04

intermedia commit due to test pr 118

f2136d5

Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera

30953e3

Merge remote-tracking branch 'upstream/master' into dataframe

0dccc6e

Implementate get_spans(ndarray) and get_spans(ndarray1, ndarray2) fun…

000463d

…ction in core.operations. Provide get_spans methods in fields using data attribute.

Merge branch 'dataframe'

37972b5

Move the get_spans functions from persistence to operations.

74c1dad

Modify the get_spans functions in Session to call field method and operation method.

Merge branch 'dataframe'

bf210c4

minor edits for pull request

95c1645

Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera

5db42d2

Merge branch 'master' of https://github.com/KCL-BMEIS/ExeTera into df…

5c93b43

…view

ericspod reviewed May 23, 2022

View reviewed changes

deng113jie added 4 commits May 23, 2022 16:39

update Eric's comments

fe9cee8

minor update

c1ad9ba

update unittests for dataframe view

80c0339

minor update

6cb1d3e

deng113jie marked this pull request as ready for review May 24, 2022 16:51

deng113jie requested review from atbenmurray and ericspod May 24, 2022 16:51

KCL-BMEIS deleted a comment from codecov-commenter May 24, 2022

deng113jie commented May 25, 2022

View reviewed changes

atbenmurray reviewed May 25, 2022

View reviewed changes

deng113jie commented May 25, 2022

View reviewed changes

atbenmurray reviewed May 25, 2022

View reviewed changes

deng113jie added 2 commits May 25, 2022 17:23

add persistence over view so that view

e8cf7f2

1) has hdf5 group but no dataset 2) can be re-recognized in a new session

add unittest for view presistence

3153f2b

add document

documents on future work

135260e

	if self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name:
	return self._field.name[1:1+len(self.dataframe.name)] != self.dataframe.name

Dfview #297

Are you sure you want to change the base?

Dfview #297

Uh oh!

Conversation

deng113jie commented May 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented May 26, 2022 •

edited

Loading