How to speed up feature extraction #1039

TheFilS · 2023-08-21T14:06:39Z

TheFilS
Aug 21, 2023

Hi all,
this is my first go at using tsfresh, and although everything seems to be working, its quite slow and I'm looking for some helpful tips as to how I can speed up the process. Ultimately, I'm looking to run feature extraction on wav files (sampling rate ~48kHz, 20 second files) on a growing database that currently has about 50 files. Due to the number of data points, I'm reducing the streams into chunks using:

datatime= np.arange(0, data.size) / fq  # create time data based on sampling rate
data= np.column_stack((datatime, data_read))  # combine data
frametmp = pd.DataFrame(data, columns=['time', 'values'])  # put it into dataframe for tsfresh
frametmp.insert(0, 'id', class_id)  # insert class type based on healthy, neuropathy, myopathy status
df = frametmp

df = df.melt()
df = add_sub_time_series_index(df, 10000, column_id="variable")
df["id"] = df["id"].astype(object)
X = tsfresh.extract_features(df, column_id="id", column_value='value')

Unfortunately, this take a lot of time, specifically 2+ hours per file.
Trying to run feature extraction without splitting it like this fails due to memory issues due to matrix size.

Does anyone have a suggestion as to how I can speed up the process? Should I do some sort of parallelization w Azure/AWS or is my strategy just terrible?

kohlrabi90 · 2023-08-24T19:26:44Z

kohlrabi90
Aug 24, 2023

Hi @TheFilS,
welcome to tsfresh :) There are a few things you could try:

by default, tsfresh calculates a few features that have very high computational costs (and scale more-than-linear with the length of the input data). If you don't need these features you could use the Efficient Parameters for your feature extraction to speed it up
Another possibility is to split up your data into smaller chunks (if this is possible). So instead of using 10000 values, reduce this number even more. This allows for better parallelization (because tsfresh parallelizes among different time series, not inside a single one)
you can indeed scale to the public cloud, e.g. using dask.

I would recommend using the efficient parameters (if you don't need the high computational cost features) or reducing the chunk size first and only then, if it still takes too long, to try using parallelization on the cloud.

1 reply

TheFilS Aug 28, 2023
Author

Thanks very much for your suggestions, I really appreciate it! I'll take a look at the efficient parameters list - I'm assuming I can also add or customize the feature list thereafter as well

:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up feature extraction #1039

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to speed up feature extraction #1039

TheFilS Aug 21, 2023

Replies: 1 comment · 1 reply

kohlrabi90 Aug 24, 2023

TheFilS Aug 28, 2023 Author

TheFilS
Aug 21, 2023

Replies: 1 comment 1 reply

kohlrabi90
Aug 24, 2023

TheFilS Aug 28, 2023
Author