Add scripts for plotting throughput measurements #375

guilhermeAlmeida1 · 2023-03-29T12:50:02Z

This serves as an alternative to the plotting script using pyROOT in #344, using instead ~~gnuplot~~ matplotlib for users who might not have ROOT or prefer this interface. It also works well with the script for actually running lots of throughput measurements in that same PR, which takes care of creating all of these results in an automated fashion.

It reads the logs that our throughput executables dump using the --log_results option and turns these into a series of different plots:

For each log file (= each device tested) and each mu value, plot throughput vs number of cpu threads
For each mu value, throughput vs number of cpu threads for all devices in the same plot.
Plot throughput vs mu for all devices using their peak performance

guilhermeAlmeida1 · 2023-03-29T12:53:11Z

As a sidenote, I did first go with using pyGnuplot directly but the whole interface is a bit weird, and the package doesn't seem to get much support. Given that the user would have to install an additional package either way, I didn't see how installing gnuplot would be much different, either way I'm open to criticism

stephenswat · 2023-03-29T12:59:23Z

Excellent initiative, I will take a look at this later today. 👍

stephenswat

First, let me make it clear that this is a very solid attempt at creating a plotting script, and it works well! However, I think the Python code is a little over-complicated, and the gnuplot could be made to be more robust. The following is provided as constructive comments which will hopefully help you in future.

Starting on the Python side, I would strongly recommend not creating directories from your Python script unless absolutely necessary. It is usually a good idea to rely on the user to make the appropriate directories for you, and then you simply write to those directories. This also allows you to make the output directory configurable.

I also would recommend not globbing the CSV files in the Python script. Rather, you should rely on the user's shell to do this for you. This gives the user more control over what they do and do not include. If the user has a CSV file in their working directory which is of the wrong format (which is not uncommon) then the script breaks, and there is no way for the user to fix this. It is more elegant to use the shell's globbing, such that the user can write:

python plot_data.py *.csv

Or, if they only want to plot a subset of files:

python plot_data.py a.csv b.csv my_prefix*.csv

Thankfully, parsing arguments like this in Python is super easy. You can import the argparse module and add code like this:

import argparse
import pathlib

parser = argparse.ArgumentParser(
    description="Guilherme's data prep"
)

parser.add_argument(
    "inputs",
    nargs='+',
    type=pathlib.Path,
    help="input CSV files",
)
parser.add_argument(
    "output",
    type=pathlib.Path,
    default="efficiency.pdf",
    help="output directory",
)

args = parser.parse_args()

Then, you should get into the habit of using pathlib to work with paths in Python, as this is more robust than simply working with string representations of paths. For example, instead of...

output_csv = "clean/{}".format(input_csv)

You could write...

output_csv = pathlib.Path("clean") / input_csv

Which is a little more robust.

There are a few comments on the use of Pandas, but I have put those inline.

Your Python script should really not be invoking gnuplot via a subprocess. Rather, leave that to the user to do via their shell. It is possible, for example, that someone might want to use your data processing script, but then plot with something other than gnuplot. Here, I recommend you follow the Unix philosophy of "do one thing and do it well".

I would also recommend putting the code that you want to run during the execution of your script in a top-level if __name__ == "__main__": block. This prevents your code from being executed if someone includes the script elsewhere. In this case, that is unlikely, but this is considered good practice in Python.

Finally, I would recommend you adopt Black as your Python formatter. It is, in my opinion, by far the superior choice and it has a variety of useful features when developing code collaboratively. Again, not super important but if you ever want (or need) to do Python this is a good habit to build up.

All in all, here is what I would write for the Python code:

#!/usr/bin/env python3

import pandas as pd
import argparse
import pathlib

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Guilherme's data prep")

    parser.add_argument(
        "inputs",
        nargs="+",
        type=pathlib.Path,
        help="input CSV files",
    )
    parser.add_argument(
        "output",
        type=pathlib.Path,
        help="output directory",
    )

    args = parser.parse_args()

    peaks = pd.DataFrame()

    for input_csv in args.inputs:
        # Create output filename in clean dir
        output_csv = args.output / input_csv

        # Read raw input data
        raw = pd.read_csv(input_csv)

        # Get int value from input directory name
        raw["mu"] = raw["directory"].str.extract("(\d+)").astype(int)

        # Calculate throughput
        raw["throughput (events/s)"] = (
            raw["processed_events"] * 1e9 / raw["processing_time"]
        )

        # Calculate mean throughput and respective standard deviation for each mu/thread pair
        clean = (
            raw.groupby(["mu", "threads"])
            .agg({"throughput (events/s)": [("stddev", "std"), ("mean", "mean")]})
            .fillna(0.0)
        )

        final = clean.unstack(level=0)
        final.columns = [f"{x}_{y}" for _, x, y in final.columns.to_flat_index()]

        # Print to output csv file
        final.to_csv(output_csv)

        # Add maximum values to global comparison table
        peak = clean["throughput (events/s)"].groupby("mu").max()

        peaks[input_csv[0 : len(input_csv) - 4]] = peak["mean"]
        peaks["stddev_{}".format(input_csv[0 : len(input_csv) - 4])] = peak["stddev"]

    peaks.to_csv(output_peaks)

I understand that some of this might be hard to grasp, so please do not feel discouraged, learning Python takes time. Adopt as much or as little of this advice as you want, I am happy to put your code into the repository regardless.

Concerning the gnuplot code, I like that a lot! The only strong recommendation I would make is to refer to columns by name rather than number. For example, you're in a situation where the column for the mean throughput of mu 20 is 2_i_ and the deviation is 2_i_+1, but if you put in a new column (say mu 10) this order shifts completely. I would recommend writing column("throughput_mu20") or something similar instead.

extras/plot_throughput/plot_data.py

stephenswat · 2023-03-29T23:31:59Z

extras/plot_throughput/plot_data.py

+    print("Printed", input_csv, " output to ", output_csv)
+
+    # Add maximum values to global comparison table
+    idx = clean.groupby('mu')['throughput (events/s)'].transform(max) == clean['throughput (events/s)']


This can be written more simply:

Suggested change

idx = clean.groupby('mu')['throughput (events/s)'].transform(max) == clean['throughput (events/s)']

peak = clean["throughput (events/s)"]["mean"].groupby("mu").max()

I went for something a bit different here because I wanted the stddev as well, so I need to keep the index as well.

peak = clean.loc[ clean["throughput (events/s)"]["mean"].groupby("mu").idxmax() ].reset_index("threads")

extras/plot_throughput/plot_all.gp

extras/plot_throughput/plot_data.py

guilhermeAlmeida1 · 2023-03-30T07:41:13Z

Thank you for the thourough comments, I'll get to it soon enough

stephenswat · 2023-03-30T07:43:14Z

As I mentioned, feel free to incorporate as much or as little of this as you want; I left this here as a learning exercise not as a request for changes so if you can't be bothered I will gladly approve this.

krasznaa

Let me ping you guys on this one.

Similar to #370 I would prefer the "main script" to be put into the extras/ directory. With a name something like extras/traccc_throughput_mt_gnuplotter.py. 😛 (Feel free to rename extras/tracccc_throughput_mt_plotter.py to extras/traccc_throughput_mt_rootplotter.py along the way. 😉)

Structurally I would also just auto-generate the .gp file in the .py script, but if you want to use the pre-made file, I can live with that. I'm even fine with that file staying in a sub-directory inside of extras/. Even if I myself would do it a bit differently. 😛

stephenswat · 2023-06-09T12:34:10Z

Nooooo

guilhermeAlmeida1 · 2023-06-09T12:36:31Z

I did not mean to close it, github was being smarter than I wanted it to

stephenswat · 2023-06-09T12:38:41Z

Okay, give a shout if you want another review on this.

guilhermeAlmeida1 · 2023-06-09T12:38:42Z

I ended up deciding to not use gnuplot but instead matplotlib because it was limiting the functionality I wanted this to have

guilhermeAlmeida1 · 2023-06-09T12:40:58Z

@stephenswat I would like another review, yes. This still needs to get the visuals cleaned up to produce plots which aren't an eye-sore, but functionality wise it's where I wanted it to be. Of course the primitive user interface is very easy to break but I'm not really concerned with that.

guilhermeAlmeida1 · 2023-06-16T11:24:16Z

This is now ready to be reviewed.
The script dumps a lot of plots which hopefully would satisfy most use-cases. If not, fine-tuning the code should not be too complicated for the user.
An example of the kind of plots this produces:
mu200_fp32.pdf
peaks_fp32.pdf
PS: these plots represent data from release v0.4.0, not our current main branch.

stephenswat reviewed Mar 30, 2023

View reviewed changes

stephenswat added feature New feature or request extras Changes related to the extras labels Mar 30, 2023

krasznaa reviewed Jun 5, 2023

View reviewed changes

guilhermeAlmeida1 closed this Jun 9, 2023

guilhermeAlmeida1 force-pushed the wipGnuplotting branch from b7d02f8 to a771c0c Compare June 9, 2023 12:28

guilhermeAlmeida1 reopened this Jun 9, 2023

guilhermeAlmeida1 added 2 commits June 15, 2023 11:48

Add matplotlib based plotter

10948bd

fix readme

4d81492

guilhermeAlmeida1 force-pushed the wipGnuplotting branch from 3a3afa6 to 4d81492 Compare June 15, 2023 09:48

Improve looks of plots

77e3e40

guilhermeAlmeida1 marked this pull request as ready for review June 16, 2023 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scripts for plotting throughput measurements #375

Add scripts for plotting throughput measurements #375

guilhermeAlmeida1 commented Mar 29, 2023 •

edited

Loading

guilhermeAlmeida1 commented Mar 29, 2023

stephenswat commented Mar 29, 2023

stephenswat left a comment •

edited

Loading

stephenswat Mar 29, 2023

guilhermeAlmeida1 Jun 15, 2023

guilhermeAlmeida1 commented Mar 30, 2023

stephenswat commented Mar 30, 2023

krasznaa left a comment

stephenswat commented Jun 9, 2023

guilhermeAlmeida1 commented Jun 9, 2023

stephenswat commented Jun 9, 2023

guilhermeAlmeida1 commented Jun 9, 2023

guilhermeAlmeida1 commented Jun 9, 2023

guilhermeAlmeida1 commented Jun 16, 2023 •

edited

Loading

	idx = clean.groupby('mu')['throughput (events/s)'].transform(max) == clean['throughput (events/s)']
	peak = clean["throughput (events/s)"]["mean"].groupby("mu").max()

Add scripts for plotting throughput measurements #375

Are you sure you want to change the base?

Add scripts for plotting throughput measurements #375

Conversation

guilhermeAlmeida1 commented Mar 29, 2023 • edited Loading

guilhermeAlmeida1 commented Mar 29, 2023

stephenswat commented Mar 29, 2023

stephenswat left a comment • edited Loading

Choose a reason for hiding this comment

stephenswat Mar 29, 2023

Choose a reason for hiding this comment

guilhermeAlmeida1 Jun 15, 2023

Choose a reason for hiding this comment

guilhermeAlmeida1 commented Mar 30, 2023

stephenswat commented Mar 30, 2023

krasznaa left a comment

Choose a reason for hiding this comment

stephenswat commented Jun 9, 2023

guilhermeAlmeida1 commented Jun 9, 2023

stephenswat commented Jun 9, 2023

guilhermeAlmeida1 commented Jun 9, 2023

guilhermeAlmeida1 commented Jun 9, 2023

guilhermeAlmeida1 commented Jun 16, 2023 • edited Loading

guilhermeAlmeida1 commented Mar 29, 2023 •

edited

Loading

stephenswat left a comment •

edited

Loading

guilhermeAlmeida1 commented Jun 16, 2023 •

edited

Loading