-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts for plotting throughput measurements #375
base: main
Are you sure you want to change the base?
Add scripts for plotting throughput measurements #375
Conversation
As a sidenote, I did first go with using pyGnuplot directly but the whole interface is a bit weird, and the package doesn't seem to get much support. Given that the user would have to install an additional package either way, I didn't see how installing gnuplot would be much different, either way I'm open to criticism |
Excellent initiative, I will take a look at this later today. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, let me make it clear that this is a very solid attempt at creating a plotting script, and it works well! However, I think the Python code is a little over-complicated, and the gnuplot could be made to be more robust. The following is provided as constructive comments which will hopefully help you in future.
Starting on the Python side, I would strongly recommend not creating directories from your Python script unless absolutely necessary. It is usually a good idea to rely on the user to make the appropriate directories for you, and then you simply write to those directories. This also allows you to make the output directory configurable.
I also would recommend not globbing the CSV files in the Python script. Rather, you should rely on the user's shell to do this for you. This gives the user more control over what they do and do not include. If the user has a CSV file in their working directory which is of the wrong format (which is not uncommon) then the script breaks, and there is no way for the user to fix this. It is more elegant to use the shell's globbing, such that the user can write:
python plot_data.py *.csv
Or, if they only want to plot a subset of files:
python plot_data.py a.csv b.csv my_prefix*.csv
Thankfully, parsing arguments like this in Python is super easy. You can import the argparse
module and add code like this:
import argparse
import pathlib
parser = argparse.ArgumentParser(
description="Guilherme's data prep"
)
parser.add_argument(
"inputs",
nargs='+',
type=pathlib.Path,
help="input CSV files",
)
parser.add_argument(
"output",
type=pathlib.Path,
default="efficiency.pdf",
help="output directory",
)
args = parser.parse_args()
Then, you should get into the habit of using pathlib
to work with paths in Python, as this is more robust than simply working with string representations of paths. For example, instead of...
output_csv = "clean/{}".format(input_csv)
You could write...
output_csv = pathlib.Path("clean") / input_csv
Which is a little more robust.
There are a few comments on the use of Pandas, but I have put those inline.
Your Python script should really not be invoking gnuplot via a subprocess. Rather, leave that to the user to do via their shell. It is possible, for example, that someone might want to use your data processing script, but then plot with something other than gnuplot. Here, I recommend you follow the Unix philosophy of "do one thing and do it well".
I would also recommend putting the code that you want to run during the execution of your script in a top-level if __name__ == "__main__":
block. This prevents your code from being executed if someone includes the script elsewhere. In this case, that is unlikely, but this is considered good practice in Python.
Finally, I would recommend you adopt Black as your Python formatter. It is, in my opinion, by far the superior choice and it has a variety of useful features when developing code collaboratively. Again, not super important but if you ever want (or need) to do Python this is a good habit to build up.
All in all, here is what I would write for the Python code:
#!/usr/bin/env python3
import pandas as pd
import argparse
import pathlib
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Guilherme's data prep")
parser.add_argument(
"inputs",
nargs="+",
type=pathlib.Path,
help="input CSV files",
)
parser.add_argument(
"output",
type=pathlib.Path,
help="output directory",
)
args = parser.parse_args()
peaks = pd.DataFrame()
for input_csv in args.inputs:
# Create output filename in clean dir
output_csv = args.output / input_csv
# Read raw input data
raw = pd.read_csv(input_csv)
# Get int value from input directory name
raw["mu"] = raw["directory"].str.extract("(\d+)").astype(int)
# Calculate throughput
raw["throughput (events/s)"] = (
raw["processed_events"] * 1e9 / raw["processing_time"]
)
# Calculate mean throughput and respective standard deviation for each mu/thread pair
clean = (
raw.groupby(["mu", "threads"])
.agg({"throughput (events/s)": [("stddev", "std"), ("mean", "mean")]})
.fillna(0.0)
)
final = clean.unstack(level=0)
final.columns = [f"{x}_{y}" for _, x, y in final.columns.to_flat_index()]
# Print to output csv file
final.to_csv(output_csv)
# Add maximum values to global comparison table
peak = clean["throughput (events/s)"].groupby("mu").max()
peaks[input_csv[0 : len(input_csv) - 4]] = peak["mean"]
peaks["stddev_{}".format(input_csv[0 : len(input_csv) - 4])] = peak["stddev"]
peaks.to_csv(output_peaks)
I understand that some of this might be hard to grasp, so please do not feel discouraged, learning Python takes time. Adopt as much or as little of this advice as you want, I am happy to put your code into the repository regardless.
Concerning the gnuplot code, I like that a lot! The only strong recommendation I would make is to refer to columns by name rather than number. For example, you're in a situation where the column for the mean throughput of mu 20 is 2_i_ and the deviation is 2_i_+1, but if you put in a new column (say mu 10) this order shifts completely. I would recommend writing column("throughput_mu20")
or something similar instead.
extras/plot_throughput/plot_data.py
Outdated
print("Printed", input_csv, " output to ", output_csv) | ||
|
||
# Add maximum values to global comparison table | ||
idx = clean.groupby('mu')['throughput (events/s)'].transform(max) == clean['throughput (events/s)'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be written more simply:
idx = clean.groupby('mu')['throughput (events/s)'].transform(max) == clean['throughput (events/s)'] | |
peak = clean["throughput (events/s)"]["mean"].groupby("mu").max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for something a bit different here because I wanted the stddev as well, so I need to keep the index as well.
peak = clean.loc[
clean["throughput (events/s)"]["mean"].groupby("mu").idxmax()
].reset_index("threads")
Thank you for the thourough comments, I'll get to it soon enough |
As I mentioned, feel free to incorporate as much or as little of this as you want; I left this here as a learning exercise not as a request for changes so if you can't be bothered I will gladly approve this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me ping you guys on this one.
Similar to #370 I would prefer the "main script" to be put into the extras/
directory. With a name something like extras/traccc_throughput_mt_gnuplotter.py
. 😛 (Feel free to rename extras/tracccc_throughput_mt_plotter.py
to extras/traccc_throughput_mt_rootplotter.py
along the way. 😉)
Structurally I would also just auto-generate the .gp
file in the .py
script, but if you want to use the pre-made file, I can live with that. I'm even fine with that file staying in a sub-directory inside of extras/
. Even if I myself would do it a bit differently. 😛
b7d02f8
to
a771c0c
Compare
Nooooo |
I did not mean to close it, github was being smarter than I wanted it to |
Okay, give a shout if you want another review on this. |
I ended up deciding to not use gnuplot but instead matplotlib because it was limiting the functionality I wanted this to have |
@stephenswat I would like another review, yes. This still needs to get the visuals cleaned up to produce plots which aren't an eye-sore, but functionality wise it's where I wanted it to be. Of course the primitive user interface is very easy to break but I'm not really concerned with that. |
3a3afa6
to
4d81492
Compare
This is now ready to be reviewed. |
This serves as an alternative to the plotting script using pyROOT in #344, using instead
gnuplotmatplotlib for users who might not have ROOT or prefer this interface. It also works well with the script for actually running lots of throughput measurements in that same PR, which takes care of creating all of these results in an automated fashion.It reads the logs that our throughput executables dump using the
--log_results
option and turns these into a series of different plots:throughput vs number of cpu threads
throughput vs number of cpu threads
for all devices in the same plot.throughput vs mu
for all devices using their peak performance