Skip to content

Commit

Permalink
Merge pull request #275 from hosseinmoein/Hossein/ThreadPoolFin
Browse files Browse the repository at this point in the history
Added parallel computing logic to the following visitors: Return, Dou…
  • Loading branch information
hosseinmoein authored Dec 31, 2023
2 parents 0f2711d + 4bb2b59 commit d95f3a8
Show file tree
Hide file tree
Showing 7 changed files with 2,686 additions and 706 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,19 +68,19 @@ The maximum dataset I could load into Polars was 300m rows per column. Any bigge
I ran each test 4 times and took the best time. Polars numbers varied a lot from one run to another, especially calculation and selection times. C++ DataFrame numbers were significantly more consistent.

```text
Polars:
Data generation/load time: 28.468640 secs
Calculation time: 4.876561 secs
Selection time: 3.876561 secs
Overall time: 36.876345 secs
C++ DataFrame:
Data generation/load time: 26.9459 secs
Calculation time: 1.26015 secs
Selection time: 0.742493 secs
Overall time: 28.9486 secs
For comparison, Pandas numbers running the same test:
Polars:
Data generation/load time: 28.468640 secs
Calculation time: 4.876561 secs
Selection time: 3.876561 secs
Overall time: 36.876345 secs
Pandas, for comparison:
Data generation/load time: 36.678976 secs
Calculation time: 40.326350 secs
Selection time: 8.326350 secs
Expand Down
5 changes: 3 additions & 2 deletions docs/HTML/DataFrame.html
Original file line number Diff line number Diff line change
Expand Up @@ -1444,13 +1444,14 @@ <H2><font color="blue">Multithreading</font></H2>
DataFrame uses multithreading extensively and provides granular tools to adjust your environment. Let’s divide the multithreading subject in DataFrame into two categories:<BR>

<H4>1. User Multithreading</H4>
If you use multithreading, you are responsible for synchronization of shared resources. Generally speaking, DataFrame is not multithreaded-safe. DataFrame has static data and per-instance data, both of which need protection in threads. Keep reading below:<BR>
<UL>
<LI>DataFrame uses static containers to achieve type heterogeneity. By default, these static containers are unprotected. This is done by design. So, by default there is no locking overhead. If you use DataFrame in a multithreaded program, you must provide a <I>SpinLock</I> defined in <I>ThreadGranularity.h</I> file. DataFrame will use your <I>SpinLock</I> to protect the containers.<BR>Please see above, <I>set_lock()</I>, <I>remove_lock()</I>, and <I>dataframe_tester.cc#3767</I> for code example.</LI>
<LI>In addition, instances of DataFrame are not multithreaded safe either. In other words, a single instance of DataFrame must not be used in multiple threads without protection, unless it is used as read-only.</LI>
</UL>
So, In general if you as the user of DataFrame utilize multithreading, you must protect the DataFrame with a synchronization tool (i.e. SpinLock)
<H4>2. DataFrame Internal Multithreading</H4>
Whether or not you, as the user, use multithreading, DataFrame utilizes a versatile thread-pool to employ parallel computing extensively in almost all its functionalities, when appropriate -- currently, most parallel algorithms trigger when number of items exceeds 250k and number of threads exceeds 2. DataFrame also gives you the interface to control and tweak that. You do not need to worry about synchronization for DataFrame internal multithreading.<BR>
Whether or not you, as the user, use multithreading, DataFrame utilizes a versatile thread-pool to employ parallel computing extensively in almost all its functionalities. By default, there is no multithreading. All algorithms execute their single-threaded version. To enable multithreading, call either <I>ThreadGranularity::set_optimum_thread_level()</I> (recommended) or <I>ThreadGranularity::set_thread_level(n)</I>. When Multithreading is enabled, most parallel algorithms trigger when number of data points exceeds 250k and number of threads exceeds 2.<BR>
You do not need to worry about synchronization for DataFrame internal multithreading. It is done behind the scenes and unbeknown to you.<BR>
<UL>
<LI> There are asynchronous versions of some methods. For example, you have sort()/sort_async(), visit()/visit_async(), ... The latter versions return a std::future and would execute in parallel.<BR>If you chose to use DataFrame async interfaces, it is highly recommended to call <I>ThreadGranularity::set_optimum_thread_level()</I>, So your thread-pool is populated with optimal number of threads. Otherwise, if thread-pool is empty, async interfaces will add one thread to it. Having only one thread in thread-pool could be suboptimal and hinder performance.</LI>
<LI>As mentioned above, DataFrame uses parallel computing extensively. But by default, DataFrame is single threaded, because by default its thread-pool is empty. If you want to fully take advantage of DataFrame parallel computing, it is recommended to call <I>ThreadGranularity::set_optimum_thread_level()</I> at the beginning of your program. Alternatively you could call <I>ThreadGranularity:: set_thread_level(n)</I> to add a custom number of threads to the thread-pool. But you better have a good reason for that.<BR>Thread-pool and thread level are static properties of DataFrame. Once the thread level is set, it applies to all DataFrame instances.</LI>
Expand Down
Loading

0 comments on commit d95f3a8

Please sign in to comment.