Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/time series dataframe #431

Merged
merged 3 commits into from
Nov 5, 2024

Conversation

xingularity
Copy link

This PR contains the first version of a prototype, TimeSeriesDataFrame which is to preserve and extract time series data for analysis. This feature is related to #380. Current features as following:

  1. Create timeseries_dataframe.py module under modmesh.
  2. Create TimeSeriesDataFrame class in the module
  3. The TimeSeriesDataFrame utilize columnar format to store data.
  4. It uses SimpleArrayUint64 to store index column which is expected to be epoch or incremental unsigned integer.
  5. It uses a list of SimpleArrayFloat64 to store every column (except epoch column) in the CSV.
  6. It provides "columns" attribute which returns a list of string contains column names.
  7. It provides get_column method to extract data in a particular column. The return type is a ndarray.

Copy link
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave in-line comments for what to review. Have all CI passing before requesting for review. (You can turn on Github Actions in your fork. It is also productive to run linter locally.)

import os
import numpy as np

from modmesh import SimpleArrayUint64, SimpleArrayFloat64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When inside modmesh, use relative import. You need the .core module for the relative import.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit contains this modification

]


class TimeSeriesDataFrame(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the very early stage prototype code, put it in unit tests. It also allows you to add basic tests for your prototype.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test file and data have been added into the repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class needs to be put in the test file because it is a prototype. The prototype is not mature enough to be put in modmesh namespace.


def read_from_text_file(
self,
txt_path: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use type annotation in Python. It's bad smell. If something needs typing information, it should go to C++.

Copy link
Author

@xingularity xingularity Oct 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and already contained in my latest commit

@yungyuc yungyuc added enhancement New feature or request array Multi-dimensional array implementation labels Oct 14, 2024
Copy link
Author

@xingularity xingularity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spaces are removed in my latest commit. Similar change is also made to Line No. 96 in my latest commit.

Copy link
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have not addressed all points in the previous review. When requesting for the next review, please address all points.

When requesting for the next review, a global comment should be added and summarize the changes. Inline comments should be added to highlight what to review.

It is ideal for each review to use a commit. Sometimes a couple of commits are OK. But so far you are using 16. They are too many. Please do squash for this batch. Be kind to your reviewer.

Additional points:

  • Since the time series prototype should go to test file, no change should be made in __init__.py.

]


class TimeSeriesDataFrame(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class needs to be put in the test file because it is a prototype. The prototype is not mature enough to be put in modmesh namespace.

@@ -41,6 +41,8 @@
from . import onedim # noqa: F401
from . import system # noqa: F401
from . import toggle # noqa: F401
from . import timeseries_dataframe # noqa: F401, F403
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not surpress F403.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. This line is removed since I moved prototype into test file.

modmesh/timeseries_dataframe.py Outdated Show resolved Hide resolved
SimpleArrayFloat64(array=nd_arr[:, i].copy())
)

def __getitem__(self, column_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a convention of Pandas or any popular package to name the argument this long?

It should only be the name for column so name alone seems to suffice. But I can be wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Since in the normal practice of a DataFrame, the "[]" operator usually extracts a column, we can just use "name" instead of column_name. I will change in my next commit.

os.path.join(self.DATADIR, "dlc_trimmed.csv")
)

one_column_data = tsdf['DATA_DELTA_VEL[1] ']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m wondering if the user needs to include a space when entering the column name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @j8xixo12
My design is to preserve whatever there is in the CSV. I do not want to assume what the data "should be". If there is a space in the column name in the original file, I will read as is. This is also the behavior of Pandas if you use it to read my test CSVs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not user friendly to require them remember the whitespaces in text file. It is especially difficult when the whitespaces are at the end of a line. Trailing whitespaces are invisible in many editors and web browsers.

In this early prototype it's OK to keep the space in key names. But I don't think it's a good design to force keeping whitespaces. We need to think about it in the future.

Copy link
Author

@xingularity xingularity Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yungyuc

I think we can discuss further when we have F2F time, and I still want to keep the trailing white spaces since user can access the list of columns through "columns" property. I think there could be scenario where user wants to keep the column names as is. Let's say user compose an automatic script to read the column names from another file (maybe the config of the data source instruments) and the trailing spaces is there in the beginning. Under this scenario the user may want the column name stays the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep it for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not mark conversation resolved. Doing it makes it harder to track.

@xingularity
Copy link
Author

Hi @yungyuc and reviewers

My latest commit should have made following changes, and therefore, should have already addressed above issues.

  1. I have moved the TimeSeriesDataFrame prototype from modmesh module into the test file.
  2. Removed timeseries_dataframe.py in modmesh module from repo.
  3. Since 1 is completed, the init.py of the modmesh module is restored.
  4. Previous 16 commits are squashed into one, and force pushed onto github branch.
  5. Changed parameter name in the TimeSeriesDataFrame. using "name" instead of column name.
  6. Fixed indentation in docstring.

All the CIs are passed, please help to review this PR.

Copy link
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, @xingularity . Additional changes are requested:

  • Rebase instead of duplicate changes for CI update.
  • Move the fixture contents to the test (code) file because there are not yet many lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for this time. Please rebase, do not duplicate change. The former is clearer than the latter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the contents of the two short fixture csv files dlc_trimmed.csv and dlc_trimmed_header_changed.csv in the test file test_timeseries_dataframe.py. They are short and distinct files are harder to read and maintain.

Distinct fixture files can be created later after the fixture (data) become larger/longer.

Copy link
Author

@xingularity xingularity Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yungyuc

I found the numpy function I used to generate data from text supported StringIO. I will change my function to support streaming text input, and I will also put the CSV data into test file. I estimate this can be done on 11/3.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more time to verify my new implementation, push the estimated time to 11/6.

os.path.join(self.DATADIR, "dlc_trimmed.csv")
)

one_column_data = tsdf['DATA_DELTA_VEL[1] ']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not mark conversation resolved. Doing it makes it harder to track.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init.py in modmesh module is restored since TimeSeriesDataFrame class has been moved to test file.

@@ -41,6 +41,8 @@
from . import onedim # noqa: F401
from . import system # noqa: F401
from . import toggle # noqa: F401
from . import timeseries_dataframe # noqa: F401, F403
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. This line is removed since I moved prototype into test file.

Copy link
Author

@xingularity xingularity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Moved test data into the test file and removed CSV files.
  2. Refactored read_from_text_file method to accept file path (as string), iterable of strings or StringIO.
  3. Shorten several variable names to make code cleaner.
  4. column name now strips white spaces.
  5. re-initiate class members when reading a new file.
  6. Modify all tests to read CSV using StringIO in test file.

]
nd_arr = np.genfromtxt(fhd, delimiter=delimiter)

self._init_members()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-initiate object members when reading a new file using same object.

Zong-han, Xie added 3 commits November 5, 2024 10:05
2. Added test data into tests/data folder
3. Added test file for timeseries_dataframe
2. Restore __init__.py in modmesh module
3. Fixed indentation in docstring
4. Changed parameter name in the TimeSeriesDataFrame.
2. Modified the implementation of read_from_text_file to support streaming and iterable text input.
3. Shorten various variable names.
Copy link
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks much cleaner. Thank you, @xingularity .

  • Please rebase to get rid of the change in the directory of .github/workflows.

class TimeSeriesDataFrame(object):

def __init__(self):
self._init_members()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code will be more readable to have member data listed in __init__(), but keeping them in a separate function is OK for now in the prototype.

Copy link
Member

@yungyuc yungyuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's rebased. I'll merge once CI finishes running.

@xingularity
Copy link
Author

It's rebased. I'll merge once CI finishes running.

It looks like one of them has failed. But the all CI under devbuild have passed in my fork repo. Quiet weird.

@yungyuc yungyuc merged commit ad32dd9 into solvcon:master Nov 5, 2024
12 checks passed
@yungyuc
Copy link
Member

yungyuc commented Nov 5, 2024

It's a sensitive profiling test and passed after rerun. Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array Multi-dimensional array implementation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants