Skip to content

Commit e2dfc6f

Browse files
authored
Extra notes on parallelization efficiencies (#1046)
1 parent 3ec96dc commit e2dfc6f

File tree

2 files changed

+26
-0
lines changed

2 files changed

+26
-0
lines changed

docs/text/faq.rst

+4
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,7 @@ FAQ
4646
Beyond sorting, tsfresh does not use the timestamp in calculations.
4747
While many features do not need a timestamp (or only need it for ordering), others will assume that observations are evenly spaced in time (e.g., one second between each observation).
4848
Since tsfresh ignores spacing, care should be taken when selecting features to use with a highly irregular series.
49+
50+
6. **Even when just extracing the :class:`tsfresh.feature_extraction.settings.EfficientFCParameters`, tsfresh is taking a long time to run. Is there anything further I can do to speed up the processing?**
51+
52+
If you are using Parallelization (the default option), you may need to check you are not over-provisioning your avaiable cpu cores. Take a look at :ref:`notes-for-efficient-parallelization-label` for steps to eliminate this, which can speed up processing significantly.

docs/text/tsfresh_on_a_cluster.rst

+22
Original file line numberDiff line numberDiff line change
@@ -207,3 +207,25 @@ If you want to use other framework instead of Dask, you will have to write your
207207
To construct your custom Distributor, you need to define an object that inherits from the abstract base class
208208
:class:`tsfresh.utilities.distribution.DistributorBaseClass`.
209209
The :mod:`tsfresh.utilities.distribution` module contains more information about what you need to implement.
210+
211+
Notes for efficient parallelization
212+
'''''''''''''''''''''''''''''''''''
213+
214+
By default tsfresh uses parallelization to distribute the single-threaded python code to the multiple cores available on the host machine.
215+
216+
However, this can create an issue known as over-provisioning. Many of the underlying python libraries (e.g. numpy) used in the feature calculators have C code implementations for their low-level processing. Those `also` try to spread their workload between as many cores available - which is in conflict with the parallelization done by tsfresh.
217+
218+
Over-provisioning is inefficient because of the overheads of repeated context switching.
219+
220+
This issue can be solved by constraining the C libraries to single threads, using the following environment variables:
221+
222+
.. code:: python
223+
224+
import os
225+
os.environ['OMP_NUM_THREADS'] = "1"
226+
os.environ['MKL_NUM_THREADS'] = "1"
227+
os.environ['OPENBLAS_NUM_THREADS'] = "1"
228+
229+
Put these lines at the beginning of your notebook/python script - before you call any tsfresh code or import any other module.
230+
231+
The more cores your host computer has, the more improvement in processing speed will be gained by implementing these environment changes. Speed increases of between 6x and 26x have been observed depending on the type of the host machine.

0 commit comments

Comments
 (0)