Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parallel computing logic to the following visitors: Sum, Mean, … #272

Merged
merged 7 commits into from
Dec 19, 2023
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,10 @@ Polars:
Overall time: 36.876345 secs

C++ DataFrame:
Data generation/load time: 28.8234 secs
Calculation time: 2.30939 secs
Selection time: 0.762463 secs
Overall time: 31.8952 secs
Data generation/load time: 26.9459 secs
Calculation time: 1.26015 secs
Selection time: 0.742493 secs
Overall time: 28.9486 secs

For comparison, Pandas numbers running the same test:
Data generation/load time: 36.678976 secs
Expand Down
12 changes: 7 additions & 5 deletions benchmarks/dataframe_performance.cc
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,17 @@ using namespace hmdf;
using namespace std::chrono;

constexpr std::size_t ALIGNMENT = 64;
// constexpr std::size_t SIZE = 300000000;
constexpr std::size_t SIZE = 10000000;
constexpr std::size_t SIZE = 300000000;
// constexpr std::size_t SIZE = 10000000;

typedef StdDataFrame64<time_t> MyDataFrame;

// -----------------------------------------------------------------------------

int main(int, char *[]) {

MyDataFrame::set_optimum_thread_level();

const auto first = high_resolution_clock::now();
MyDataFrame df;

Expand All @@ -64,9 +66,9 @@ int main(int, char *[]) {
VarVisitor<double, time_t> ln_vv;
CorrVisitor<double, time_t> e_ln_cv;

auto mean = df.visit_async<double>("normal", n_mv);
auto var = df.visit_async<double>("log_normal", ln_vv);
auto corr = df.visit_async<double, double>("exponential", "log_normal", e_ln_cv);
auto mean = df.single_act_visit_async<double>("normal", n_mv);
auto var = df.single_act_visit_async<double>("log_normal", ln_vv);
auto corr = df.single_act_visit_async<double, double>("exponential", "log_normal", e_ln_cv);

std::cout << mean.get().get_result() << ", "
<< var.get().get_result() << ", "
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/polars_performance.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@

# ------------------------------------------------------------------------------

# SIZE: int = 300000000
SIZE: int = 10000000
SIZE: int = 300000000
# SIZE: int = 10000000

first = datetime.datetime.now()
df = pl.DataFrame({"normal": np.random.normal(size=SIZE),
Expand Down
6 changes: 3 additions & 3 deletions docs/HTML/DataFrame.html
Original file line number Diff line number Diff line change
Expand Up @@ -1441,16 +1441,16 @@ <H2><font color="blue">API Reference with code samples</font></H2>

<H2><font color="blue">Multithreading</font></H2>
In general, multithreading could be very tricky. A lot of times you think by using multithreading you enhance the performance of your program. But in fact, you are hindering it. It requires measuring and careful adjustments. It is recommended to start with a single-threaded version and when that is <I>working correctly</I>, take measurements and adjust to move to multithreading version.<BR>
DataFrame uses multithreading extensively and provides granular tools to adjust your program. Let’s divide the multithreading subject in DataFrame into two categories:<BR>
DataFrame uses multithreading extensively and provides granular tools to adjust your environment. Let’s divide the multithreading subject in DataFrame into two categories:<BR>

<H4>1. User Utilizing Multithreading</H4>
<H4>1. User Multithreading</H4>
<UL>
<LI>DataFrame uses static containers to achieve type heterogeneity. By default, these static containers are unprotected. This is done by design. So, by default there is no locking overhead. If you use DataFrame in a multithreaded program, you must provide a <I>SpinLock</I> defined in <I>ThreadGranularity.h</I> file. DataFrame will use your <I>SpinLock</I> to protect the containers.<BR>Please see above, <I>set_lock()</I>, <I>remove_lock()</I>, and <I>dataframe_tester.cc#3767</I> for code example.</LI>
<LI>In addition, instances of DataFrame are not multithreaded safe either. In other words, a single instance of DataFrame must not be used in multiple threads without protection, unless it is used as read-only.</LI>
</UL>
So, In general if you as the user of DataFrame utilize multithreading, you must protect the DataFrame with a synchronization tool (i.e. SpinLock)
<H4>2. DataFrame Internal Multithreading</H4>
Whether or not you, as the user, use multithreading, DataFrame utilizes a versatile thread-pool to employ parallel computing extensively in almost all its functionalities, when appropriate. DataFrame also gives you the interface to control and tweak that. You do not need to worry about synchronization for DataFrame internal multithreading.<BR>
Whether or not you, as the user, use multithreading, DataFrame utilizes a versatile thread-pool to employ parallel computing extensively in almost all its functionalities, when appropriate -- currently, most parallel algorithms trigger when number of items exceeds 250k and number of threads exceeds 2. DataFrame also gives you the interface to control and tweak that. You do not need to worry about synchronization for DataFrame internal multithreading.<BR>
<UL>
<LI> There are asynchronous versions of some methods. For example, you have sort()/sort_async(), visit()/visit_async(), ... The latter versions return a std::future and would execute in parallel.<BR>If you chose to use DataFrame async interfaces, it is highly recommended to call <I>ThreadGranularity::set_optimum_thread_level()</I>, So your thread-pool is populated with optimal number of threads. Otherwise, if thread-pool is empty, async interfaces will add one thread to it. Having only one thread in thread-pool could be suboptimal and hinder performance.</LI>
<LI>As mentioned above, DataFrame uses parallel computing extensively. But by default, DataFrame is single threaded, because by default its thread-pool is empty. If you want to fully take advantage of DataFrame parallel computing, it is recommended to call <I>ThreadGranularity::set_optimum_thread_level()</I> at the beginning of your program. Alternatively you could call <I>ThreadGranularity:: set_thread_level(n)</I> to add a custom number of threads to the thread-pool. But you better have a good reason for that.<BR>Thread-pool and thread level are static properties of DataFrame. Once the thread level is set, it applies to all DataFrame instances.</LI>
Expand Down
Loading
Loading