Feature/time series dataframe #431

xingularity · 2024-10-13T14:35:19Z

This PR contains the first version of a prototype, TimeSeriesDataFrame which is to preserve and extract time series data for analysis. This feature is related to #380. Current features as following:

Create timeseries_dataframe.py module under modmesh.
Create TimeSeriesDataFrame class in the module
The TimeSeriesDataFrame utilize columnar format to store data.
It uses SimpleArrayUint64 to store index column which is expected to be epoch or incremental unsigned integer.
It uses a list of SimpleArrayFloat64 to store every column (except epoch column) in the CSV.
It provides "columns" attribute which returns a list of string contains column names.
It provides get_column method to extract data in a particular column. The return type is a ndarray.

yungyuc

Please leave in-line comments for what to review. Have all CI passing before requesting for review. (You can turn on Github Actions in your fork. It is also productive to run linter locally.)

For a simple change like this, please use a single commit. Squash properly.
timeseries_dataframe.py:L31: Use relative import inside modmesh.
timeseries_dataframe.py:L39: Move simple prototype to a test file.
timeseries_dataframe.py:L50: Do not use type annotation in Python.

yungyuc · 2024-10-13T22:13:11Z

modmesh/timeseries_dataframe.py

+import os
+import numpy as np
+
+from modmesh import SimpleArrayUint64, SimpleArrayFloat64


When inside modmesh, use relative import. You need the .core module for the relative import.

The latest commit contains this modification

yungyuc · 2024-10-13T22:14:30Z

modmesh/timeseries_dataframe.py

+]
+
+
+class TimeSeriesDataFrame(object):


For the very early stage prototype code, put it in unit tests. It also allows you to add basic tests for your prototype.

Test file and data have been added into the repo.

This class needs to be put in the test file because it is a prototype. The prototype is not mature enough to be put in modmesh namespace.

yungyuc · 2024-10-13T22:15:56Z

modmesh/timeseries_dataframe.py

+
+    def read_from_text_file(
+        self,
+        txt_path: str,


Don't use type annotation in Python. It's bad smell. If something needs typing information, it should go to C++.

Removed and already contained in my latest commit

modmesh/timeseries_dataframe.py

xingularity

The spaces are removed in my latest commit. Similar change is also made to Line No. 96 in my latest commit.

yungyuc

You have not addressed all points in the previous review. When requesting for the next review, please address all points.

When requesting for the next review, a global comment should be added and summarize the changes. Inline comments should be added to highlight what to review.

It is ideal for each review to use a commit. Sometimes a couple of commits are OK. But so far you are using 16. They are too many. Please do squash for this batch. Be kind to your reviewer.

Additional points:

Since the time series prototype should go to test file, no change should be made in __init__.py.

yungyuc · 2024-10-19T08:37:06Z

modmesh/timeseries_dataframe.py

+]
+
+
+class TimeSeriesDataFrame(object):


This class needs to be put in the test file because it is a prototype. The prototype is not mature enough to be put in modmesh namespace.

yungyuc · 2024-10-19T08:38:21Z

modmesh/__init__.py

@@ -41,6 +41,8 @@
 from . import onedim  # noqa: F401
 from . import system  # noqa: F401
 from . import toggle  # noqa: F401
+from . import timeseries_dataframe  # noqa: F401, F403


Do not surpress F403.

Noted. This line is removed since I moved prototype into test file.

modmesh/timeseries_dataframe.py

yungyuc · 2024-10-19T08:47:04Z

modmesh/timeseries_dataframe.py

+                    SimpleArrayFloat64(array=nd_arr[:, i].copy())
+                )
+
+    def __getitem__(self, column_name):


Is it a convention of Pandas or any popular package to name the argument this long?

It should only be the name for column so name alone seems to suffice. But I can be wrong.

This is a good point. Since in the normal practice of a DataFrame, the "[]" operator usually extracts a column, we can just use "name" instead of column_name. I will change in my next commit.

j8xixo12 · 2024-10-21T15:45:56Z

tests/test_timeseries_dataframe.py

+            os.path.join(self.DATADIR, "dlc_trimmed.csv")
+        )
+
+        one_column_data = tsdf['DATA_DELTA_VEL[1] ']


I’m wondering if the user needs to include a space when entering the column name?

Hi @j8xixo12
My design is to preserve whatever there is in the CSV. I do not want to assume what the data "should be". If there is a space in the column name in the original file, I will read as is. This is also the behavior of Pandas if you use it to read my test CSVs.

It is not user friendly to require them remember the whitespaces in text file. It is especially difficult when the whitespaces are at the end of a line. Trailing whitespaces are invisible in many editors and web browsers.

In this early prototype it's OK to keep the space in key names. But I don't think it's a good design to force keeping whitespaces. We need to think about it in the future.

Hi @yungyuc

I think we can discuss further when we have F2F time, and I still want to keep the trailing white spaces since user can access the list of columns through "columns" property. I think there could be scenario where user wants to keep the column names as is. Let's say user compose an automatic script to read the column names from another file (maybe the config of the data source instruments) and the trailing spaces is there in the beginning. Under this scenario the user may want the column name stays the same.

You can keep it for now.

Do not mark conversation resolved. Doing it makes it harder to track.

xingularity · 2024-10-31T08:21:56Z

Hi @yungyuc and reviewers

My latest commit should have made following changes, and therefore, should have already addressed above issues.

I have moved the TimeSeriesDataFrame prototype from modmesh module into the test file.
Removed timeseries_dataframe.py in modmesh module from repo.
Since 1 is completed, the init.py of the modmesh module is restored.
Previous 16 commits are squashed into one, and force pushed onto github branch.
Changed parameter name in the TimeSeriesDataFrame. using "name" instead of column name.
Fixed indentation in docstring.

All the CIs are passed, please help to review this PR.

yungyuc

Thanks for the update, @xingularity . Additional changes are requested:

Rebase instead of duplicate changes for CI update.
Move the fixture contents to the test (code) file because there are not yet many lines.

yungyuc · 2024-10-31T13:49:06Z

.github/workflows/devbuild.yml

Just for this time. Please rebase, do not duplicate change. The former is clearer than the latter.

yungyuc · 2024-10-31T13:52:07Z

tests/data/dlc_trimmed.csv

Please put the contents of the two short fixture csv files dlc_trimmed.csv and dlc_trimmed_header_changed.csv in the test file test_timeseries_dataframe.py. They are short and distinct files are harder to read and maintain.

Distinct fixture files can be created later after the fixture (data) become larger/longer.

Hi @yungyuc

I found the numpy function I used to generate data from text supported StringIO. I will change my function to support streaming text input, and I will also put the CSV data into test file. I estimate this can be done on 11/3.

Need more time to verify my new implementation, push the estimated time to 11/6.

yungyuc · 2024-10-31T13:53:14Z

tests/test_timeseries_dataframe.py

+            os.path.join(self.DATADIR, "dlc_trimmed.csv")
+        )
+
+        one_column_data = tsdf['DATA_DELTA_VEL[1] ']


Do not mark conversation resolved. Doing it makes it harder to track.

xingularity · 2024-10-31T08:14:26Z

modmesh/__init__.py

init.py in modmesh module is restored since TimeSeriesDataFrame class has been moved to test file.

xingularity · 2024-10-31T11:06:03Z

modmesh/__init__.py

@@ -41,6 +41,8 @@
 from . import onedim  # noqa: F401
 from . import system  # noqa: F401
 from . import toggle  # noqa: F401
+from . import timeseries_dataframe  # noqa: F401, F403


Noted. This line is removed since I moved prototype into test file.

xingularity

Moved test data into the test file and removed CSV files.
Refactored read_from_text_file method to accept file path (as string), iterable of strings or StringIO.
Shorten several variable names to make code cleaner.
column name now strips white spaces.
re-initiate class members when reading a new file.
Modify all tests to read CSV using StringIO in test file.

xingularity · 2024-11-04T16:21:14Z

tests/test_timeseries_dataframe.py

+            ]
+            nd_arr = np.genfromtxt(fhd, delimiter=delimiter)
+
+            self._init_members()


Re-initiate object members when reading a new file using same object.

2. Added test data into tests/data folder 3. Added test file for timeseries_dataframe

2. Restore __init__.py in modmesh module 3. Fixed indentation in docstring 4. Changed parameter name in the TimeSeriesDataFrame.

2. Modified the implementation of read_from_text_file to support streaming and iterable text input. 3. Shorten various variable names.

yungyuc

The code looks much cleaner. Thank you, @xingularity .

Please rebase to get rid of the change in the directory of .github/workflows.

yungyuc · 2024-11-04T22:52:09Z

tests/test_timeseries_dataframe.py

+class TimeSeriesDataFrame(object):
+
+    def __init__(self):
+        self._init_members()


The code will be more readable to have member data listed in __init__(), but keeping them in a separate function is OK for now in the prototype.

yungyuc

It's rebased. I'll merge once CI finishes running.

xingularity · 2024-11-05T03:57:14Z

It's rebased. I'll merge once CI finishes running.

It looks like one of them has failed. But the all CI under devbuild have passed in my fork repo. Quiet weird.

yungyuc · 2024-11-05T04:30:40Z

It's a sensitive profiling test and passed after rerun. Merged.

yungyuc requested changes Oct 13, 2024

View reviewed changes

yungyuc assigned xingularity Oct 13, 2024

yungyuc added enhancement New feature or request array Multi-dimensional array implementation labels Oct 14, 2024

tigercosmos reviewed Oct 18, 2024

View reviewed changes

modmesh/timeseries_dataframe.py Outdated Show resolved Hide resolved

xingularity commented Oct 19, 2024

View reviewed changes

yungyuc requested changes Oct 19, 2024

View reviewed changes

xingularity force-pushed the feature/time-series-dataframe branch from 8769873 to 745ea15 Compare October 20, 2024 13:02

j8xixo12 reviewed Oct 21, 2024

View reviewed changes

yungyuc requested changes Oct 31, 2024

View reviewed changes

xingularity commented Nov 2, 2024

View reviewed changes

xingularity commented Nov 4, 2024

View reviewed changes

Zong-han, Xie added 3 commits November 5, 2024 10:05

1. Added timeseries_dataframe module into modmesh

0c20a38

2. Added test data into tests/data folder 3. Added test file for timeseries_dataframe

1. Moved TimeSeriesDataFrame prototype into test file

3beac36

2. Restore __init__.py in modmesh module 3. Fixed indentation in docstring 4. Changed parameter name in the TimeSeriesDataFrame.

1. Moved test data into the test file

b5da7ff

2. Modified the implementation of read_from_text_file to support streaming and iterable text input. 3. Shorten various variable names.

xingularity force-pushed the feature/time-series-dataframe branch from b5024ae to b5da7ff Compare November 5, 2024 02:06

yungyuc requested changes Nov 5, 2024

View reviewed changes

yungyuc approved these changes Nov 5, 2024

View reviewed changes

yungyuc merged commit ad32dd9 into solvcon:master Nov 5, 2024
12 checks passed

Feature/time series dataframe #431

Feature/time series dataframe #431

Conversation

xingularity commented Oct 13, 2024

yungyuc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xingularity Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

xingularity left a comment

Choose a reason for hiding this comment

yungyuc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xingularity Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xingularity commented Oct 31, 2024

yungyuc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xingularity Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xingularity left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yungyuc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yungyuc left a comment

Choose a reason for hiding this comment

xingularity commented Nov 5, 2024

yungyuc commented Nov 5, 2024

yungyuc left a comment •

edited

Loading

xingularity Oct 19, 2024 •

edited

Loading

yungyuc left a comment •

edited

Loading

xingularity Oct 27, 2024 •

edited

Loading

yungyuc left a comment •

edited

Loading

xingularity Nov 2, 2024 •

edited

Loading

yungyuc left a comment •

edited

Loading