Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand section on profilers (perf and VTune) #381

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
239 changes: 214 additions & 25 deletions talk/tools/profiling.tex
Original file line number Diff line number Diff line change
Expand Up @@ -4,39 +4,228 @@
\frametitle{Profiling}
\begin{block}{Conceptually}
\begin{itemize}
\item take a measurement of a performance aspect of a program
\item Take a measurement of a performance aspect of a program
\begin{itemize}
\item where in my code is most of the time spent?
\item is my program compute or memory bound?
\item does my program make good use of the cache?
\item is my program using all cores most of the time?
\item how often are threads blocked and why?
\item which API calls are made and in which order?
\item Where in my code is most of the time spent?
\item Is my program compute or memory bound?
\item Does my program make good use of the cache?
\item Is my program using all cores most of the time?
\item How often are threads blocked and why?
\item Which API calls are made and in which order?
\item ...
\end{itemize}
\item the goal is to find performance bottlenecks
\item is usually done on a compiled program, not on source code
\item The goal is to find performance bottlenecks
\item Usually done on a compiled program, not on source code
\end{itemize}
\end{block}
\end{frame}

\begin{frame}[fragile]
\frametitle{perf, VTune and uProf}
\begin{block}{perf}
\frametitle{\mintinline{bash}{perf} -- Performance analysis tools for Linux}
\setlength{\leftmargini}{0pt}
\begin{itemize}
\item perf is a powerful command line profiling tool for linux
\item compile with \mintinline{bash}{-g -fno-omit-frame-pointer}
\item \mintinline{bash}{perf stat -d <prg>} gathers performance statistics while running \mintinline{bash}{<prg>}
\item \mintinline{bash}{perf record -g <prg>} starts profiling \mintinline{bash}{<prg>}
\item \mintinline{bash}{perf report} displays a report from the last profile
\item More information in \href{https://perf.wiki.kernel.org/index.php/Main_Page}{this wiki}, \href{https://www.brendangregg.com/linuxperf.html}{this website} or \href{https://indico.cern.ch/event/980497/contributions/4130271/attachments/2161581/3647235/linux-systems-performance.pdf}{this talk}.
\item Powerful command line profiling tool for Linux
\item Not portable, the source code is part of the Linux kernel itself
\item Much lower overhead compared with \mintinline{bash}{valgrind}
\item In order to profile your code, make sure to compile with
\texttt{CXXFLAGS="-O2 -g -fno-omit-frame-pointer"}
sponce marked this conversation as resolved.
Show resolved Hide resolved
\item Counting and sampling
\begin{itemize}
\item Counting -- count occurrences of a given event (e.g.\ cache misses)
\item Time-based sampling -- sample the stack at regular time intervals
\item Event-based sampling -- take samples when event counter overflows
\item Instruction-based sampling -- sample instructions and precisely count events they create
\end{itemize}
amadio marked this conversation as resolved.
Show resolved Hide resolved
\item Static and dynamic tracing
\begin{itemize}
\item Static -- pre-defined tracepoints in software (e.g.\ scheduling events)
\item Dynamic -- tracepoints created dynamically with \mintinline{bash}{perf probe}
\end{itemize}
\end{itemize}
\end{block}
\begin{block}{Intel VTune and AMD uProf}
\begin{itemize}
\item Graphical profilers from CPU vendors with rich features
\item Needs vendor's CPU for full experience
\item More information on \href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}{Intel's website} and \href{https://developer.amd.com/amd-uprof/}{AMD's website}
\end{itemize}
\end{block}
\end{frame}

\begin{frame}[fragile]
\frametitle{\mintinline{bash}{perf} commands}
{ \scriptsize
\begin{block}{}
\begin{minted}{shell-session}
$ perf
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data and display annotated code
c2c Shared Data C2C/HITM Analyzer.
config Get and set variables in a configuration file.
diff Read perf.data and display the differential profile
evlist List the event names in a perf.data file
list List all symbolic event types
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data and display trace output
stat Run command and gather performance counter statistics
top System profiling tool.
version display the version of perf binary
probe Define new dynamic tracepoints
trace strace inspired tool
See 'perf help COMMAND' for more information on a specific command.
\end{minted}
\end{block}
}
\end{frame}
Comment on lines +51 to +75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful ? I think I would drop it

Copy link
Contributor Author

@amadio amadio Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a similar slide to this to give a general overview of perf in my own presentations, mentioning that there are more commands than the ones I cover. If you don't want to go into details, this could be a useful slide for that. However, other than that, it's probably fine to drop. I did have to shorten the description of the commands to fit in the slide anyway, so this is not quite what you'd get by running perf without arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first thought I also found this too much. On second thought, yeah, why shouldn't we leave an overview here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that this slide would be systematically skipped when you present. So if it's a pure reference, then let's put it in a reference section at the very end. Otherwise, let's drop it.

mentioning that there are more commands than the ones I cover

Useful indeed, but then I would mention that there are a lot of commands, not list them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that most people don't think it's useful, so I will drop this slide.


\begin{frame}[fragile]
\frametitle{Listing events with \mintinline{bash}{perf list}}
{ \scriptsize
\begin{block}{}
\begin{minted}{shell-session}
$ # List main hardware events
$ perf list hw

List of pre-defined events (to be used in -e):

branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]

$ # List main software/cache events
$ perf list sw
$ perf list cache

$ # List all pre-defined metrics
$ perf list metric

$ # List all currently known events:
$ perf list
\end{minted}
\end{block}
}
\end{frame}

\begin{frame}[fragile]
\frametitle{Counting events with \mintinline{bash}{perf stat}}
{ \scriptsize
\begin{block}{}
\begin{minted}{shell-session}
$ # Standard CPU counter statistics for the specified command:
$ perf stat <command>

$ # Detailed CPU counter statistics for the specified command:
$ perf stat -d <command>
$ perf stat -dd <command>

$ # Top-down microarchitecture analysis for the entire system, for 10s:
$ perf stat -a --topdown -- sleep 10

$ # L1 cache hit rate reported every 1000 ms for the specified command:
$ perf stat -e L1-dcache-loads,L1-dcache-load-misses -I 1000 <command>

$ # Instruction per cycle and Instruction-level parallelism, for command:
$ perf stat -M IPC,ILP -- <command>

$ # Measure GFLOPs system-wide, until Ctrl-C is used to stop:
$ perf stat -M GFLOPs

$ # Measure cycles and instructions 10 times, report results with stddev:
$ perf stat -e cycles,instructions -r 10 -- <command>
\end{minted}
\end{block}
}
\end{frame}


\begin{frame}[fragile]
\frametitle{Recording profiling information with \mintinline{bash}{perf record}}
{ \scriptsize
\begin{block}{}
\begin{minted}{shell-session}
$ # Sample on-CPU functions for the specified command, at 100 Hertz:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an on-CPU function? Does this relate to heterogeneous computing? In the sense that you don't profile GPU functions?

I just tried that command and it counted cycles. So maybe:

Suggested change
$ # Sample on-CPU functions for the specified command, at 100 Hertz:
$ # Sample cycles for the specified command, at 100 Hertz:

Copy link
Contributor Author

@amadio amadio Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf cannot take samples when the process is not running, that's why it's usually referred to on-CPU sampling, because samples are taken only when threads are scheduled on some CPU. However, you can also trace scheduling events to try to see what is going on when threads are off-CPU (i.e. being scheduled out, then back in). See https://www.brendangregg.com/offcpuanalysis.html for more information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start wondering if it's worth keeping examples that cannot be understood simply. The explanation you just gave is already far above the expected knowledge of the people attending the course. In order to explain that, you would need a whole set of slides starting with "thread scheduling", "sampling", etc...

$ perf record -F 100 -- <command>
sponce marked this conversation as resolved.
Show resolved Hide resolved

$ # Sample CPU stack traces (via frame pointers), at 100 Hertz, for 10s:
$ perf record -F 100 -g -- sleep 10
Comment on lines +148 to +149
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sleep 10 here the command to be profiled or a trick to profile something systemwide? Sorry for my limited knowledge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, I did intend to have -a to capture things system-wide, but the command as is records data only for the sleep command.


$ # Sample stack traces for PID using DWARF to unwind stacks, for 10s:
$ perf record -p <PID> --call-graph=dwarf -- sleep 10
Comment on lines +151 to +152
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it is even more surprising for me. The PID should give the process to profile. What does the sleep 10 do? Is there no flag to tell perf to count 10s? The current command line is surprising to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the sleep command is only used to give perf the start/stop timings (it's a very common thing to do with perf to use sleep, as there's no other easy way to tell perf to stop otherwise). The profiled process is actually the one given by <PID>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here we suppose that people are at easy with frame-pointers (previous line) and dwarf. That would require another set of slides by itself. Less and less convinced that we should not simplify drastically and give only one slide of examples with one line of each list/stat/record/report

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @sponce. Maybe I'm assuming too much prior knowledge that the average student doesn't/won't have. I guess in that case, showing just how to do the simplest case, which is to collect and view a report just using the default of cycles for the event is good enough for the course, and we can point people to other sets of slides when more advanced material is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm sure HSF people would love to create a full course dedicated to perf. And I promise I would be one of your first students :-)

Copy link
Contributor Author

@amadio amadio Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given a few talks here and there, so I have many slides on perf (not using LaTeX, though). I could think about converting the material I have into a course on performance analysis, and including other less known tools, like bpftrace, uftrace, bcc, etc. That said, perf itself is more than enough for a full course, as I doubt many people have used perf data, perf c2c, perf mem, and other less well known commands as well. Plus there is the post-processing and data visualization as well, which is also interesting (gprof2dot, flamegraph, d3js).


$ # Precise on-CPU user stack traces (no skid) using PEBS (Intel CPUs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an on-CPU stack trace? And what is skid? And what's PEBS? :)
I am asking because a future presenter of these slides might not know this. Is all the information relevant?

Maybe we need a slide introducing some terms of art and defining the acronyms. Or a glossary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained on-CPU above. Basically, there is a margin of error to attribute samples to instructions, as a number of instructions are in flight in parallel on the CPU at any given time. This error is called the skid in the sampling (see more information here). PEBS stands for Precise Event Based Sampling (PEBS), and is a feature on Intel CPUs that allows sampling with low or no skid. The sort of equivalent thing on AMD CPUs is IBS, or Instruction-based Sampling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking because a future presenter of these slides might not know this. Is all the information relevant?

I hope that someone presenting perf to others will read the manual pages and understand these examples ahead of time. I tried to give a general overview of how to do several different things with each of the most important commands, so of course that what I added I think is relevant information for people trying to use perf. Maybe this is all too complicated for a C++ course and we should really just point people to the actual documentation or other material instead. I'm starting to think that that will be easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is all too complicated for a C++ course

Do we need a tool section in the expert part ? That could be a solution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a tools course, separate from a C++ course. VTune, perf, valgrind, can all be used for much more than just C++, so we can bundle this together with bash, coreutils, and some other command line tools that are used very often and make a new course.

$ perf record -g -e cycles:up -- <command>

$ # Sample CPU stack traces using Instruction-based sampling (AMD CPUs):
$ # (Note that you need to use system-wide sampling for IBS on AMD CPUs)
$ perf record -a -g -e cycles:pp -- <command>
Comment on lines +157 to +159
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't -a a system-wide sampling? Why do I need a <command> then? What is IBS?

Copy link
Contributor Author

@amadio amadio Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IBS is explained above. The requirement to use system-wide sampling is a hardware requirement when using IBS on AMD CPUs. This is also explained in perf's documentation (see man perf-list). I added this example to show how to use event modifiers and to remind people that IBS requires system-wide sampling to work.


$ # Sample CPU stack traces once every 10k L1 data cache misses, for 5s:
$ perf record -a -g -e L1-dcache-load-misses -c 10000 -- sleep 5

$ # Sample CPUs at 100 Hertz, and show top addresses and symbols, live:
$ perf top -F 100
\end{minted}
\end{block}
}
\end{frame}

\begin{frame}[fragile]
\frametitle{Reporting and annotating source code with \mintinline{bash}{perf}}
{ \scriptsize
\begin{block}{}
\begin{minted}{shell-session}
$ # Standard reporting of perf.data in text UI interface:
$ perf report

$ # Report by self-time (excluding time spent in callees):
$ perf report --no-children

$ # Report per source line of code (needs debugging info to work):
$ perf report --no-children -s srcline

$ # Single inverted (caller-based) call-graph per binary:
$ perf report --inverted -s comm

$ # Text-based report per library, without call graph:
$ perf report --stdio -g none -s dso

$ # Hierarchical report for functions taking at least 1% of runtime:
$ perf report --stdio -g none --hierarchy --percent-limit 1

$ # Disassemble and annotate a symbol (instructions with percentages):
$ # (Needs debugging information available to show source code as well)
$ perf annotate <symbol>
\end{minted}
\end{block}
}
\end{frame}

\begin{frame}[fragile]
\frametitle{Further information on \mintinline{bash}{perf}}
\begin{itemize}
\item Official documentation in the Linux repository at
\href{https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/Documentation}
{linux/tools/perf/Documentation}
\item Perf Wiki at \url{https://perf.wiki.kernel.org/}
\item Linux \mintinline{bash}{perf} examples by Brendan Gregg
\url{https://www.brendangregg.com/linuxperf.html}
\item Scripts to visualize profiles as flamegraphs
\url{https://github.com/brendangregg/FlameGraph}
\item HSF Tools \& Packaging Working Group talk on Indico\\
\href{https://indico.cern.ch/event/974382/}
{Linux Systems Performance: Tracing, Profiling \& Visualization}
\end{itemize}
\end{frame}

\begin{frame}[fragile]
\frametitle{Intel VTune Profiler}
\centering
\includegraphics[width=0.75\textwidth]{tools/vtune.png}
\begin{itemize}
\item Very powerful GUI-based profiler for Intel CPUs and GPUs
\item Now free to use with
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html}{Intel oneAPI Base Toolkit} or
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}{standalone}
\item See the \href{https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/}
{official online documentation} for more information
\end{itemize}
\end{frame}
Comment on lines +219 to 231
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the picture brings something for people not knowing the tool ? I would maybe replace it with a bullet highlighting the things it can do which perf cannot (if any) and another giving the donwsides

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since VTune is a graphical tool, I thought it would be nice to show what it looks like when you open it. You can use the picture to show the types of analyses that VTune is able to do instead of a bullet list, and just tell people when presenting about the extra features it has over perf. For detailed usage information, I'd point people to the online docs. One thing I'd mention while presenting is the Top-Down Microarchitecture Analysis, which is a very good method to find bottlenecks. While perf can also do it, it cannot show you detailed information for each symbol like VTune does, and the annotation of source code by VTune is also a lot easier to use than perf's.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also link a talk from Ahmad Yasin, who was behind the creation of the Top-Down Microarchitecture Analysis Method at Intel. It's a very nice talk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even like to have more pictures. E.g. I love the microarchitecture analysis with the pipeline visualization. Or how a general hierarchical profile looks like. Or the pane showing contention between threads. Or even better, a live demonstration :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not care about picture themselves. I care that if there is a picture, it's understandable, that is that we explain what appears there. In this case, there is a LOT of explanations missing, and I'm not sure we want to include them actually.

Binary file added talk/tools/vtune.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.