Merge pull request #275 from hosseinmoein/Hossein/ThreadPoolFin

Added parallel computing logic to the following visitors: Return, Dou…
hosseinmoein · Dec 31, 2023 · d95f3a8 · d95f3a8
2 parents 0f2711d + 4bb2b59
commit d95f3a8
Show file tree

Hide file tree

Showing 7 changed files with 2,686 additions and 706 deletions.
diff --git a/README.md b/README.md
@@ -68,19 +68,19 @@ The maximum dataset I could load into Polars was 300m rows per column. Any bigge
 I ran each test 4 times and took the best time. Polars numbers varied a lot from one run to another, especially calculation and selection times. C++ DataFrame numbers were significantly more consistent.
 
 ```text
-Polars:
-    Data generation/load time: 28.468640 secs
-    Calculation time:           4.876561 secs
-    Selection time:             3.876561 secs
-    Overall time:              36.876345 secs
-
 C++ DataFrame:
     Data generation/load time: 26.9459   secs
     Calculation time:           1.26015  secs
     Selection time:             0.742493 secs
     Overall time:              28.9486   secs
 
-For comparison, Pandas numbers running the same test:
+Polars:
+    Data generation/load time: 28.468640 secs
+    Calculation time:           4.876561 secs
+    Selection time:             3.876561 secs
+    Overall time:              36.876345 secs
+
+Pandas, for comparison:
     Data generation/load time: 36.678976 secs
     Calculation time:          40.326350 secs
     Selection time:             8.326350 secs

diff --git a/docs/HTML/DataFrame.html b/docs/HTML/DataFrame.html
@@ -1444,13 +1444,14 @@ <H2><font color="blue">Multithreading</font></H2>
 DataFrame uses multithreading extensively and provides granular tools to adjust your environment. Let’s divide the multithreading subject in DataFrame into two categories:<BR>
 
   <H4>1. User Multithreading</H4>
+  If you use multithreading, you are responsible for synchronization of shared resources. Generally speaking, DataFrame is not multithreaded-safe. DataFrame has static data and per-instance data, both of which need protection in threads. Keep reading below:<BR>
     <UL>
       <LI>DataFrame uses static containers to achieve type heterogeneity. By default, these static containers are unprotected. This is done by design. So, by default there is no locking overhead. If you use DataFrame in a multithreaded program, you must provide a <I>SpinLock</I> defined in <I>ThreadGranularity.h</I> file. DataFrame will use your <I>SpinLock</I> to protect the containers.<BR>Please see above, <I>set_lock()</I>, <I>remove_lock()</I>, and <I>dataframe_tester.cc#3767</I> for code example.</LI>
       <LI>In addition, instances of DataFrame are not multithreaded safe either. In other words, a single instance of DataFrame must not be used in multiple threads without protection, unless it is used as read-only.</LI>
     </UL>
-    So, In general if you as the user of DataFrame utilize multithreading, you must protect the DataFrame with a synchronization tool (i.e. SpinLock)
   <H4>2. DataFrame Internal Multithreading</H4>
-  Whether or not you, as the user, use multithreading, DataFrame utilizes a versatile thread-pool to employ parallel computing extensively in almost all its functionalities, when appropriate -- currently, most parallel algorithms trigger when number of items exceeds 250k and number of threads exceeds 2. DataFrame also gives you the interface to control and tweak that. You do not need to worry about synchronization for DataFrame internal multithreading.<BR>
+  Whether or not you, as the user, use multithreading, DataFrame utilizes a versatile thread-pool to employ parallel computing extensively in almost all its functionalities. By default, there is no multithreading. All algorithms execute their single-threaded version. To enable multithreading, call either <I>ThreadGranularity::set_optimum_thread_level()</I> (recommended) or <I>ThreadGranularity::set_thread_level(n)</I>. When Multithreading is enabled, most parallel algorithms trigger when number of data points exceeds 250k and number of threads exceeds 2.<BR>
+  You do not need to worry about synchronization for DataFrame internal multithreading. It is done behind the scenes and unbeknown to you.<BR>
   <UL>
     <LI> There are asynchronous versions of some methods. For example, you have sort()/sort_async(), visit()/visit_async(), ... The latter versions return a std::future and would execute in parallel.<BR>If you chose to use DataFrame async interfaces, it is highly recommended to call <I>ThreadGranularity::set_optimum_thread_level()</I>, So your thread-pool is populated with optimal number of threads. Otherwise, if thread-pool is empty, async interfaces will add one thread to it. Having only one thread in thread-pool could be suboptimal and hinder performance.</LI>
     <LI>As mentioned above, DataFrame uses parallel computing extensively. But by default, DataFrame is single threaded, because by default its thread-pool is empty. If you want to fully take advantage of DataFrame parallel computing, it is recommended to call <I>ThreadGranularity::set_optimum_thread_level()</I> at the beginning of your program. Alternatively you could call <I>ThreadGranularity:: set_thread_level(n)</I> to add a custom number of threads to the thread-pool. But you better have a good reason for that.<BR>Thread-pool and thread level are static properties of DataFrame. Once the thread level is set, it applies to all DataFrame instances.</LI>