Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand section on profilers (perf and VTune) #381
base: master
Are you sure you want to change the base?
Expand section on profilers (perf and VTune) #381
Changes from all commits
c416958
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful ? I think I would drop it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use a similar slide to this to give a general overview of perf in my own presentations, mentioning that there are more commands than the ones I cover. If you don't want to go into details, this could be a useful slide for that. However, other than that, it's probably fine to drop. I did have to shorten the description of the commands to fit in the slide anyway, so this is not quite what you'd get by running perf without arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first thought I also found this too much. On second thought, yeah, why shouldn't we leave an overview here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that this slide would be systematically skipped when you present. So if it's a pure reference, then let's put it in a reference section at the very end. Otherwise, let's drop it.
Useful indeed, but then I would mention that there are a lot of commands, not list them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that most people don't think it's useful, so I will drop this slide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is an
on-CPU
function? Does this relate to heterogeneous computing? In the sense that you don't profile GPU functions?I just tried that command and it counted
cycles
. So maybe:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perf
cannot take samples when the process is not running, that's why it's usually referred to on-CPU sampling, because samples are taken only when threads are scheduled on some CPU. However, you can also trace scheduling events to try to see what is going on when threads are off-CPU (i.e. being scheduled out, then back in). See https://www.brendangregg.com/offcpuanalysis.html for more information.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I start wondering if it's worth keeping examples that cannot be understood simply. The explanation you just gave is already far above the expected knowledge of the people attending the course. In order to explain that, you would need a whole set of slides starting with "thread scheduling", "sampling", etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the
sleep 10
here the command to be profiled or a trick to profile something systemwide? Sorry for my limited knowledge.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch, I did intend to have
-a
to capture things system-wide, but the command as is records data only for the sleep command.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, it is even more surprising for me. The PID should give the process to profile. What does the
sleep 10
do? Is there no flag to tell perf to count10s
? The current command line is surprising to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the
sleep
command is only used to give perf the start/stop timings (it's a very common thing to do with perf to use sleep, as there's no other easy way to tell perf to stop otherwise). The profiled process is actually the one given by<PID>
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here we suppose that people are at easy with frame-pointers (previous line) and dwarf. That would require another set of slides by itself. Less and less convinced that we should not simplify drastically and give only one slide of examples with one line of each list/stat/record/report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree with @sponce. Maybe I'm assuming too much prior knowledge that the average student doesn't/won't have. I guess in that case, showing just how to do the simplest case, which is to collect and view a report just using the default of
cycles
for the event is good enough for the course, and we can point people to other sets of slides when more advanced material is needed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm sure HSF people would love to create a full course dedicated to perf. And I promise I would be one of your first students :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given a few talks here and there, so I have many slides on perf (not using LaTeX, though). I could think about converting the material I have into a course on performance analysis, and including other less known tools, like bpftrace, uftrace, bcc, etc. That said, perf itself is more than enough for a full course, as I doubt many people have used
perf data
,perf c2c
,perf mem
, and other less well known commands as well. Plus there is the post-processing and data visualization as well, which is also interesting (gprof2dot, flamegraph, d3js).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is an
on-CPU stack trace
? And what isskid
? And what'sPEBS
? :)I am asking because a future presenter of these slides might not know this. Is all the information relevant?
Maybe we need a slide introducing some terms of art and defining the acronyms. Or a glossary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained on-CPU above. Basically, there is a margin of error to attribute samples to instructions, as a number of instructions are in flight in parallel on the CPU at any given time. This error is called the skid in the sampling (see more information here). PEBS stands for Precise Event Based Sampling (PEBS), and is a feature on Intel CPUs that allows sampling with low or no skid. The sort of equivalent thing on AMD CPUs is IBS, or Instruction-based Sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that someone presenting
perf
to others will read the manual pages and understand these examples ahead of time. I tried to give a general overview of how to do several different things with each of the most important commands, so of course that what I added I think is relevant information for people trying to useperf
. Maybe this is all too complicated for a C++ course and we should really just point people to the actual documentation or other material instead. I'm starting to think that that will be easier.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a tool section in the expert part ? That could be a solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a tools course, separate from a C++ course. VTune, perf, valgrind, can all be used for much more than just C++, so we can bundle this together with bash, coreutils, and some other command line tools that are used very often and make a new course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't
-a
a system-wide sampling? Why do I need a<command>
then? What isIBS
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IBS is explained above. The requirement to use system-wide sampling is a hardware requirement when using IBS on AMD CPUs. This is also explained in
perf
's documentation (seeman perf-list
). I added this example to show how to use event modifiers and to remind people that IBS requires system-wide sampling to work.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the picture brings something for people not knowing the tool ? I would maybe replace it with a bullet highlighting the things it can do which perf cannot (if any) and another giving the donwsides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since VTune is a graphical tool, I thought it would be nice to show what it looks like when you open it. You can use the picture to show the types of analyses that VTune is able to do instead of a bullet list, and just tell people when presenting about the extra features it has over perf. For detailed usage information, I'd point people to the online docs. One thing I'd mention while presenting is the Top-Down Microarchitecture Analysis, which is a very good method to find bottlenecks. While perf can also do it, it cannot show you detailed information for each symbol like VTune does, and the annotation of source code by VTune is also a lot easier to use than perf's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also link a talk from Ahmad Yasin, who was behind the creation of the Top-Down Microarchitecture Analysis Method at Intel. It's a very nice talk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even like to have more pictures. E.g. I love the microarchitecture analysis with the pipeline visualization. Or how a general hierarchical profile looks like. Or the pane showing contention between threads. Or even better, a live demonstration :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not care about picture themselves. I care that if there is a picture, it's understandable, that is that we explain what appears there. In this case, there is a LOT of explanations missing, and I'm not sure we want to include them actually.