Skip to content
This repository has been archived by the owner on Aug 1, 2023. It is now read-only.
/ CA-Project Public archive

Implementation and perfomance analisys of the conv2D algorithm in C++ and Cuda.

Notifications You must be signed in to change notification settings

federic0casu/CA-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to do performance analisys with perf

1 - Install perf

$ sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`

Getting started with perf is straightforward (you should be able to see what follows):

$ perf

 usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]

 The most commonly used perf commands are:
   annotate        Read perf.data (created by perf record) and display annotated code
   archive         Create archive with object files with build-ids found in perf.data file
   bench           General framework for benchmark suites
   buildid-cache   Manage build-id cache.
   buildid-list    List the buildids in a perf.data file
   c2c             Shared Data C2C/HITM Analyzer.
   config          Get and set variables in a configuration file.
   daemon          Run record sessions on background
   data            Data file related processing
   diff            Read perf.data files and display the differential profile
   evlist          List the event names in a perf.data file
   ftrace          simple wrapper for kernel's ftrace functionality
   inject          Filter to augment the events stream with additional information
   iostat          Show I/O performance metrics
   kallsyms        Searches running kernel for symbols
   kmem            Tool to trace/measure kernel memory properties
   kvm             Tool to trace/measure kvm guest os
   list            List all symbolic event types
   lock            Analyze lock events
   mem             Profile memory accesses
   record          Run a command and record its profile into perf.data
   report          Read perf.data (created by perf record) and display the profile
   sched           Tool to trace/measure scheduler properties (latencies)
   script          Read perf.data (created by perf record) and display trace output
   stat            Run a command and gather performance counter statistics
   test            Runs sanity tests.
   timechart       Tool to visualize total system behavior during a workload
   top             System profiling tool.
   version         display the version of perf binary
   probe           Define new dynamic tracepoints
   trace           strace inspired tool

 See 'perf help COMMAND' for more information on a specific command.

2 - How to use perf

The perf tool supports a list of measurable events that you can view with perf list command. The tool and underlying kernel interface can measure events coming from different sources. For instance, some events are pure kernel counters, in this case, they are called software events. Examples include context-switches, minor-faults, page-faults and others.

Another source of events is the processor itself and its Performance Monitoring Unit (PMU). It provides a list of events to measure micro-architectural events such as the number of cycles, instructions retired, L1 cache misses and so on. Those events are called “PMU hardware events” or “hardware events” for short. They vary with each processor type and model.

To see the list of hardware events (if the list is empty you should run the command with root privileges):

$ perf list hw

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

To see the list of software events:

$ perf list hw

List of pre-defined events (to be used in -e):

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  cgroup-switches                                    [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  duration_time                                      [Tool event]
  user_time                                          [Tool event]
  system_time                                        [Tool event]

Now we can start to do some analisys: perf stat

$ perf stat -r 10 ls -la > /dev/null

 Performance counter stats for 'ls -la' (10 runs):

              1,52 msec task-clock                #    0,552 CPUs utilized            ( +- 10,03% )
                 0      context-switches          #    0,000 /sec                   
                 0      cpu-migrations            #    0,000 /sec                   
               115      page-faults               #   57,560 K/sec                    ( +-  0,22% )
         3.623.911      cycles                    #    1,814 GHz                      ( +-  1,71% )
           298.384      stalled-cycles-frontend   #    8,56% frontend cycles idle     ( +-  4,90% )
           428.097      stalled-cycles-backend    #   12,28% backend cycles idle      ( +-  4,63% )
         3.422.327      instructions              #    0,98  insn per cycle         
                                                  #    0,13  stalled cycles per insn  ( +-  0,07% )
           700.873      branches                  #  350,805 M/sec                    ( +-  0,06% )
     <not counted>      branch-misses                                                 (0,00%)

          0,002750 +- 0,000213 seconds time elapsed  ( +-  7,76% )

What did we just do? We have recorded statistics about events (such as page-faults, stalled-cycles, branch-misses, ...) regarding ls -la. Ok, that's great. But if we are interested in some others events (cache-references, cache-misses, ...)?

$ perf stat -r 10 -e cache-references,cache-misses ls -la > /dev/null

 Performance counter stats for 'ls -la' (10 runs):

           291.222      cache-references                                              ( +-  0,84% )
            55.840      cache-misses              #   19,806 % of all cache refs      ( +-  2,42% )

          0,003197 +- 0,000249 seconds time elapsed  ( +-  7,80% )

Once again: ok, that's great, but we need to do perfomance analisys on our program! No problem folks:

$ sudo perf stat -r 30 -e cache-references,cache-misses ./conv img.txt ker.txt 0 2

 Performance counter stats for './conv img.txt ker.txt 0 2' (30 runs):

        28.383.742      cache-references                                              ( +- 12,49% )
         1.768.478      cache-misses              #    5,554 % of all cache refs      ( +-  1,27% )

           2,37044 +- 0,00781 seconds time elapsed  ( +-  0,33% )

How to improve perfomance with perf

The answer is perf record:

$ sudo perf record -e cache-references,cache-misses -r 10 ./conv img.txt ker.txt 0 2
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0,560 MB perf.data (11720 samples) ]

To see the statistics recorded by perf for us we need to run perf report:

$ sudo perf report -v

Output:

Available samples
10K cache-references
1K cache-misses

perf report allows us to choose which event (either hardware or software) we want to analyze. The events are those recorded by perf record during its execution.

cache references:

Samples: 10K of event 'cache-references', Event count (approx.): 36665484
Overhead  Command  Shared Object                                        Symbol
  23,02%  conv     /home/federico/Documents/CA-Project/simulation/conv  0x1c1f             B [.] convolute(void*)
  15,43%  conv     /proc/kcore                                          0xffffffff99586aa7 k [k] clear_page_rep
   7,22%  conv     /proc/kcore                                          0xffffffff995871ec k [k] copy_user_generic_string
   3,66%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x62aae            D [.] _IO_file_seekoff@@GLIBC_2.2.5
   3,34%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x477e1            D [.] __vfwscanf_internal
   2,81%  conv     /proc/kcore                                          0xffffffff990f0ec2 k [k] filemap_get_read_batch
   2,44%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x671cb            D [.] _IO_remove_marker
   1,87%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x63059            D [.] _IO_file_xsgetn_mmap
   1,82%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x64140            D [.] _IO_file_open
   1,53%  conv     /proc/kcore                                          0xffffffff98f8812c k [k] exit_to_user_mode_prepare
   1,29%  conv     /proc/kcore                                          0xffffffff9913e343 k [k] do_anonymous_page
   1,09%  conv     /proc/kcore                                          0xffffffff991453d9 k [k] __handle_mm_fault
   1,05%  conv     /proc/kcore                                          0xffffffff9916b2ae k [k] rmqueue_pcplist.constprop.0
   1,00%  conv     /proc/kcore                                          0xffffffff9916d682 k [k] get_page_from_freelist
   0,93%  conv     /home/federico/Documents/CA-Project/simulation/conv  0x2941             B [.] read_file(char*, char*, int, int, int, int, int*, int*)
   0,84%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x64017            D [.] _IO_file_close_it@@GLIBC_2.2.5
   0,79%  conv     /proc/kcore                                          0xffffffff98f7605f k [k] __rcu_read_lock
   0,76%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x8e890            D [.] __GI___strcasecmp_l_sse2
   0,76%  conv     /proc/kcore                                          0xffffffff991cf4b6 k [k] __mod_lruvec_page_state
   0,69%  conv     /proc/kcore                                          0xffffffff99583f82 k [k] xas_load
   0,69%  conv     /proc/kcore                                          0xffffffff991ccec1 k [k] get_mem_cgroup_from_mm

cache misses:

Samples: 1K of event 'cache-misses', Event count (approx.): 2213404
Overhead  Command  Shared Object                                        Symbol
  19,21%  conv     /proc/kcore                                          0xffffffff99586aa7 k [k] clear_page_rep
   6,96%  conv     /proc/kcore                                          0xffffffff995871ec k [k] copy_user_generic_string
   5,10%  conv     /home/federico/Documents/CA-Project/simulation/conv  0x1c64             B [.] convolute(void*)
   3,24%  conv     /proc/kcore                                          0xffffffff99167d83 k [k] __free_one_page
   2,16%  conv     /proc/kcore                                          0xffffffff9916acaa k [k] rmqueue_bulk
   2,05%  conv     /proc/kcore                                          0xffffffff98e9f5ba k [k] native_read_msr
   1,74%  conv     /proc/kcore                                          0xffffffff9916f675 k [k] __folio_alloc
   1,73%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x63071            D [.] _IO_file_xsgetn_mmap
   1,43%  conv     /proc/kcore                                          0xffffffff990f0ec2 k [k] filemap_get_read_batch
   1,29%  conv     /proc/kcore                                          0xffffffff98e9f7a6 k [k] native_write_msr
   1,25%  conv     /proc/kcore                                          0xffffffff991452d4 k [k] __handle_mm_fault
   1,23%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x62ad2            D [.] _IO_file_seekoff@@GLIBC_2.2.5
   1,16%  conv     /proc/kcore                                          0xffffffff98f34ad7 k [k] __update_load_avg_se
   1,09%  conv     /proc/kcore                                          0xffffffff9916d6a7 k [k] get_page_from_freelist
   0,96%  conv     /proc/kcore                                          0xffffffff9916b2fd k [k] rmqueue_pcplist.constprop.0
   0,94%  conv     /usr/lib/x86_64-linux-gnu/libc.so.6                  0x671cb            D [.] _IO_remove_marker
   0,88%  conv     /proc/kcore                                          0xffffffff99d25057 k [k] _raw_spin_lock
   0,83%  conv     /proc/kcore                                          0xffffffff991aa045 k [k] kmem_cache_alloc
   0,78%  conv     /proc/kcore                                          0xffffffff9913f3fd k [k] zap_pte_range
   0,78%  conv     /proc/kcore                                          0xffffffff991431a5 k [k] handle_pte_fault

perf report ranks functions based on the number of collected samples of the event under analysis (the higher the number of samples the higher the function is ranked). It is possible to customize the sorting order and therefore to view the data differently. The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function. The second column reports the process from which the samples were collected. The third column shows the name of the ELF image where the samples came from. If a program is dynamically linked, then this may show the name of a shared library. The fourth column indicates the privilege level at which the sample was taken, i.e. when the program was running when it was interrupted:

  • [.]: user lever
  • [k]: kernel
  • [g]: guest kernel level (virtualization)
  • [u]: guest os user space
  • [H]: hypervisor

The final column shows the symbol name.

By going up and down with arrow keys and then pressing Enter (or 'a'), we can select the function we want to analyze. Example: we want to understand why the function convolute(void*) is responsible for the 23,02% of the total cache references samples (so convolute(void*) is likely to be the hottest function of our program with respect to cache references:

Samples: 10K of event 'cache-references', 4000 Hz, Event count (approx.): 36665484
convolute(void*)  /home/federico/Documents/CA-Project/simulation/conv [Percent: local period]
  0,03 │1be0:   add     %edx,%eax
  0,49 │1be2:   imul    -0x28(%rbp),%eax
  0,25 │1be6:   mov     -0x40(%rbp),%ecx
  8,32 │1be9:   mov     -0x38(%rbp),%edx
  0,76 │1bec:   add     %ecx,%edx
  0,11 │1bee:   add     %edx,%eax
  0,40 │1bf0:   cltq
  0,56 │1bf2:   lea     0x0(,%rax,4),%rdx
  1,70 │1bfa:   mov     -0x8(%rbp),%rax
  0,18 │1bfe:   add     %rdx,%rax
  1,27 │1c01:   mov     (%rax),%edx
 38,67 │1c03:   mov     -0x3c(%rbp),%eax
  0,04 │1c06:   imul    -0x24(%rbp),%eax
  0,08 │1c0a:   mov     %eax,%ecx
       │1c0c:   mov     -0x38(%rbp),%eax
       │1c0f:   add     %ecx,%eax
       │1c11:   cltq
  0,04 │1c13:   lea     0x0(,%rax,4),%rcx
       │1c1b:   mov     -0x10(%rbp),%rax
 10,79 │1c1f:   add     %rcx,%rax
  0,03 │1c22:   mov     (%rax),%eax

Each instruction have its relative percentage of samples reported (first column).

We can find the hottest instruction by pressing 'H'. Also, we can cycle through the hottest instructions by pressing TAB. In convolute(void*) the hottest instructions are the following:

38.67 |1c03:   mov     -0x3c(%rbp),%eax

18.56 |1c2a:   addl    $0x1,-0x38(%rbp)

10,79 |1c1f:   add     %rcx,%rax

 8,32 │1be9:   mov     -0x38(%rbp),%edx

 5,49 |1c27:   add      %eax,-48(%rbp)

 4,25 |1c24:   imul     %edx,%eax

it's time to inspect what 'convolute(void*)' is actually doing:

$ objdump -d conv

...

0000000000001b41 <_Z9convolutePv>:
    1b41:	f3 0f 1e fa          	endbr64 
    1b45:	55                   	push   %rbp
    1b46:	48 89 e5             	mov    %rsp,%rbp
    1b49:	48 89 7d a8          	mov    %rdi,-0x58(%rbp)
    1b4d:	48 8b 45 a8          	mov    -0x58(%rbp),%rax
    1b51:	48 89 45 e0          	mov    %rax,-0x20(%rbp)
    1b55:	48 8b 45 e0          	mov    -0x20(%rbp),%rax
    1b59:	48 8b 00             	mov    (%rax),%rax
    1b5c:	48 89 45 e8          	mov    %rax,-0x18(%rbp)
    1b60:	48 8b 45 e0          	mov    -0x20(%rbp),%rax
    1b64:	48 8b 40 10          	mov    0x10(%rax),%rax
    1b68:	48 89 45 f0          	mov    %rax,-0x10(%rbp)
    1b6c:	48 8b 45 e0          	mov    -0x20(%rbp),%rax
    1b70:	48 8b 40 08          	mov    0x8(%rax),%rax
    1b74:	48 89 45 f8          	mov    %rax,-0x8(%rbp)
    1b78:	48 8b 45 e0          	mov    -0x20(%rbp),%rax
    1b7c:	8b 40 18             	mov    0x18(%rax),%eax
    1b7f:	89 45 cc             	mov    %eax,-0x34(%rbp)
    1b82:	48 8b 45 e0          	mov    -0x20(%rbp),%rax
    1b86:	8b 40 1c             	mov    0x1c(%rax),%eax
    1b89:	89 45 d0             	mov    %eax,-0x30(%rbp)
    1b8c:	48 8b 45 e0          	mov    -0x20(%rbp),%rax
    1b90:	8b 40 20             	mov    0x20(%rax),%eax
    1b93:	89 45 d4             	mov    %eax,-0x2c(%rbp)
    1b96:	48 8b 45 e0          	mov    -0x20(%rbp),%rax
    1b9a:	8b 40 24             	mov    0x24(%rax),%eax
    1b9d:	89 45 d8             	mov    %eax,-0x28(%rbp)
    1ba0:	48 8b 45 e0          	mov    -0x20(%rbp),%rax
    1ba4:	8b 40 28             	mov    0x28(%rax),%eax
    1ba7:	89 45 dc             	mov    %eax,-0x24(%rbp)
    1baa:	c7 45 b8 00 00 00 00 	movl   $0x0,-0x48(%rbp)
    1bb1:	8b 45 cc             	mov    -0x34(%rbp),%eax
    1bb4:	89 45 bc             	mov    %eax,-0x44(%rbp)
    1bb7:	e9 c5 00 00 00       	jmp    1c81 <_Z9convolutePv+0x140>
    1bbc:	c7 45 c0 00 00 00 00 	movl   $0x0,-0x40(%rbp)
    1bc3:	e9 a9 00 00 00       	jmp    1c71 <_Z9convolutePv+0x130>
    1bc8:	c7 45 c4 00 00 00 00 	movl   $0x0,-0x3c(%rbp)
    1bcf:	eb 69                	jmp    1c3a <_Z9convolutePv+0xf9>
    1bd1:	c7 45 c8 00 00 00 00 	movl   $0x0,-0x38(%rbp)
    1bd8:	eb 54                	jmp    1c2e <_Z9convolutePv+0xed>
    1bda:	8b 55 bc             	mov    -0x44(%rbp),%edx
    1bdd:	8b 45 c4             	mov    -0x3c(%rbp),%eax
    1be0:	01 d0                	add    %edx,%eax
    1be2:	0f af 45 d8          	imul   -0x28(%rbp),%eax
    1be6:	8b 4d c0             	mov    -0x40(%rbp),%ecx
    1be9:	8b 55 c8             	mov    -0x38(%rbp),%edx         # <-- 4th hottest instruction ( 8,32%)
    1bec:	01 ca                	add    %ecx,%edx
    1bee:	01 d0                	add    %edx,%eax
    1bf0:	48 98                	cltq   
    1bf2:	48 8d 14 85 00 00 00 	lea    0x0(,%rax,4),%rdx
    1bf9:	00 
    1bfa:	48 8b 45 f8          	mov    -0x8(%rbp),%rax
    1bfe:	48 01 d0             	add    %rdx,%rax
    1c01:	8b 10                	mov    (%rax),%edx
    1c03:	8b 45 c4             	mov    -0x3c(%rbp),%eax         # <-- 1st hottest instruction (38,67%)
    1c06:	0f af 45 dc          	imul   -0x24(%rbp),%eax
    1c0a:	89 c1                	mov    %eax,%ecx
    1c0c:	8b 45 c8             	mov    -0x38(%rbp),%eax
    1c0f:	01 c8                	add    %ecx,%eax
    1c11:	48 98                	cltq   
    1c13:	48 8d 0c 85 00 00 00 	lea    0x0(,%rax,4),%rcx
    1c1a:	00 
    1c1b:	48 8b 45 f0          	mov    -0x10(%rbp),%rax
    1c1f:	48 01 c8             	add    %rcx,%rax                # <-- 3rd hottest instruction (10,79%)
    1c22:	8b 00                	mov    (%rax),%eax
    1c24:	0f af c2             	imul   %edx,%eax                # <-- 6th hottest instruction ( 4,25%)
    1c27:	01 45 b8             	add    %eax,-0x48(%rbp)         # <-- 5th hottest instruction ( 5,49%)
    1c2a:	83 45 c8 01          	addl   $0x1,-0x38(%rbp)         # <-- 2nd hottest instruction (18,56%)
    1c2e:	8b 45 c8             	mov    -0x38(%rbp),%eax
    1c31:	3b 45 dc             	cmp    -0x24(%rbp),%eax
    1c34:	7c a4                	jl     1bda <_Z9convolutePv+0x99>
    1c36:	83 45 c4 01          	addl   $0x1,-0x3c(%rbp)
    1c3a:	8b 45 c4             	mov    -0x3c(%rbp),%eax
    1c3d:	3b 45 dc             	cmp    -0x24(%rbp),%eax
    1c40:	7c 8f                	jl     1bd1 <_Z9convolutePv+0x90>
    1c42:	8b 45 bc             	mov    -0x44(%rbp),%eax
    1c45:	0f af 45 d4          	imul   -0x2c(%rbp),%eax
    1c49:	89 c2                	mov    %eax,%edx
    1c4b:	8b 45 c0             	mov    -0x40(%rbp),%eax
    1c4e:	01 d0                	add    %edx,%eax
    1c50:	48 98                	cltq   
    1c52:	48 8d 14 85 00 00 00 	lea    0x0(,%rax,4),%rdx
    1c59:	00 
    1c5a:	48 8b 45 e8          	mov    -0x18(%rbp),%rax
    1c5e:	48 01 c2             	add    %rax,%rdx
    1c61:	8b 45 b8             	mov    -0x48(%rbp),%eax
    1c64:	89 02                	mov    %eax,(%rdx)
    1c66:	c7 45 b8 00 00 00 00 	movl   $0x0,-0x48(%rbp)
    1c6d:	83 45 c0 01          	addl   $0x1,-0x40(%rbp)
    1c71:	8b 45 c0             	mov    -0x40(%rbp),%eax
    1c74:	3b 45 d4             	cmp    -0x2c(%rbp),%eax
    1c77:	0f 8c 4b ff ff ff    	jl     1bc8 <_Z9convolutePv+0x87>
    1c7d:	83 45 bc 01          	addl   $0x1,-0x44(%rbp)
    1c81:	8b 45 bc             	mov    -0x44(%rbp),%eax
    1c84:	3b 45 d0             	cmp    -0x30(%rbp),%eax
    1c87:	0f 8c 2f ff ff ff    	jl     1bbc <_Z9convolutePv+0x7b>
    1c8d:	90                   	nop
    1c8e:	90                   	nop
    1c8f:	5d                   	pop    %rbp
    1c90:	c3                   	ret

    ...

Actually, at first glance, it's impossible to figure out what this assembly code is doing. After an hour I was able to figure out the meaning of each instruction. The result is the following:

#
# struct task {			// offset
#	int *output;		// 0
#   int *input;			// +8	
# 	int *kernel;		// +16
#   int start_row;		// +24
#	int end_row;		// +28
#	int output_columns;	// +32
#   int input_rows;		// +36
#	int kernel_size;	// +40
# };
#

_Z9convolutePv:
.LFB2422:
	.cfi_startproc
	endbr64
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6

	movq	%rdi, -88(%rbp)
	movq	-88(%rbp), %rax	
	movq	%rax, -32(%rbp)		# struct task *__task = (struct task *)argument; 
                                	# -32(%rbp) = __task
	movq	-32(%rbp), %rax
	movq	(%rax), %rax
	movq	%rax, -24(%rbp)		# auto output = __task->output;
                                	# -24(%rbp) = output
	movq	-32(%rbp), %rax		
	movq	16(%rax), %rax
	movq	%rax, -16(%rbp)		# auto kernel = __task->kernel;
                                	# -16(%rbp) = kernel
	movq	-32(%rbp), %rax
	movq	8(%rax), %rax
	movq	%rax, -8(%rbp)		# auto input = __task->input;
                                	# -8(%rbp) = input
	movq	-32(%rbp), %rax
	movl	24(%rax), %eax
	movl	%eax, -52(%rbp)		# auto start = __task->start_row;
                                	# -52(%rbp) = start
	movq	-32(%rbp), %rax
	movl	28(%rax), %eax
	movl	%eax, -48(%rbp)		# auto end = __task->end_row;
                                	# -48(%rbp) = end
	movq	-32(%rbp), %rax
	movl	32(%rax), %eax
	movl	%eax, -44(%rbp)		# auto out_size_y  = __task->output_columns;
                                	# -44(%rbp) = out_size_y
	movq	-32(%rbp), %rax
	movl	36(%rax), %eax
	movl	%eax, -40(%rbp)		# auto in_size_x   = __task->input_rows;
                                	# -40(%rbp) = in_size_x
	movq	-32(%rbp), %rax
	movl	40(%rax), %eax
	movl	%eax, -36(%rbp)		# auto kernel_size = __task->kernel_size;
                                	# -36(%rbp) = kernel_size
	movl	$0, -72(%rbp)		# int convolute = 0;

	movl	-52(%rbp), %eax		# %eax = start
	movl	%eax, -68(%rbp)		# auto x = start;
	jmp	.L5
.L12:
	movl	$0, -64(%rbp)		# auto y = 0;
	jmp	.L6
.L11:
	movl	$0, -60(%rbp)		# auto kx = 0;
	jmp	.L7
.L10:
	movl	$0, -56(%rbp)		# auto ky = 0;
	jmp	.L8
.L9:
	movl	-68(%rbp), %edx		# %edx = x
	movl	-60(%rbp), %eax		# %eax = kx
	addl	%edx, %eax		# %eax = x + kx
	imull	-40(%rbp), %eax		# %eax = (x + kx) * in_size_x
	movl	-64(%rbp), %ecx		# %ecx = y
	movl	-56(%rbp), %edx		# %edx = ky
	addl	%ecx, %edx		# %edx = y + ky
	addl	%edx, %eax		# %eax = (x + kx) * in_size_x + (y +ky)
	cltq
	leaq	0(,%rax,4), %rdx	# %rdx = %rax * 4	// We can "interpret" this instruction as
					#			// the following: ((x + kx) * in_size_x + (y + ky)) * sizeof(int)	

	movq	-8(%rbp), %rax		# %rax = &input[0]  	// Since input is a pointer, this instruction 
					#               	// is effectively loading the address of the 
					#               	// 1-st element of the input image.
	
	addq	%rdx, %rax		# %rax = &input[(x + kx) * in_size_x + (y + ky)]
	movl	(%rax), %edx		# %edx = input[(x + kx) * in_size_x + (y + ky)]
	
	movl	-60(%rbp), %eax		# %eax = kx
	imull	-36(%rbp), %eax		# %eax = kx * kernel_size
	movl	%eax, %ecx		# %ecx = kx * kernel_size
	movl	-56(%rbp), %eax		# %eax = ky
	addl	%ecx, %eax		# %eax = kx * kernel_size + ky
	cltq
	leaq	0(,%rax,4), %rcx	# rcx = %rax * 4	// Same as above: (kx * kernel_size + ky) * sizeof(int)
	movq	-16(%rbp), %rax		# %rax = &kernel[0]
	addq	%rcx, %rax		# %rax = &kernel[kx * kernel_size + ky]
	movl	(%rax), %eax		# %eax = kernel[kx * kernel_size + ky]
	imull	%edx, %eax		# %eax = input[(x + kx) * in_size_x + (y + ky)] * kernel[kx * kernel_size + ky]
	addl	%eax, -72(%rbp)		# convolute += (input[(x + kx) * in_size_x + (y + ky)] * kernel[kx * kernel_size + ky]);
	addl	$1, -56(%rbp)		# ky++
.L8:
	movl	-56(%rbp), %eax		# %eax = ky
	cmpl	-36(%rbp), %eax		# ky < kernel_size
	jl	.L9
	addl	$1, -60(%rbp)		# kx++
.L7:
	movl	-60(%rbp), %eax		# %eax = kx
	cmpl	-36(%rbp), %eax		# kx < kernel_size
	jl	.L10
	movl	-68(%rbp), %eax		# %eax = x
	imull	-44(%rbp), %eax		# %eax = x * out_size_y
	movl	%eax, %edx		# %edx = x * out_size_y
	movl	-64(%rbp), %eax		# %eax = y
	addl	%edx, %eax		# %eax = x * out_size_y + y
	cltq
	leaq	0(,%rax,4), %rdx	# %rdx = (x * out_size_y + y) * sizeof(int)
	movq	-24(%rbp), %rax		# %rax = &output[0]
	addq	%rax, %rdx		# %rdx = &output[0] + (x * out_size_y + y) * sizeof(int)
	movl	-72(%rbp), %eax		# %eax = convolute
	movl	%eax, (%rdx)		# output[x * out_size_y + y] = convolute;
	movl	$0, -72(%rbp)		# convolute = 0;
	addl	$1, -64(%rbp)		# y++
.L6:
	movl	-64(%rbp), %eax		# %eax = y
	cmpl	-44(%rbp), %eax		# y < out_size_y
	jl	.L11
	addl	$1, -68(%rbp)		# x++
.L5:
	movl	-68(%rbp), %eax		# %eax = x
	cmpl	-48(%rbp), %eax     	# x < end
	jl	.L12
	nop
	nop
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc

Optimizations:

__task will not be used before the inizialization of variables:

  • output, input, kernel
  • start, end
  • out_size_y, in_size_x, kernel_size

Also, we can use registers to hold variables (instead of allocating stack memory):

  • %ebx = convolute
  • %r12 = x
  • %r13 = y
  • %r14 = kx
  • %r15 = ky
# convolute_opt.s

.global _Z13convolute_optPv:

_Z13convolute_optPv:                # void* convolute_opt(void*)
.LFB2422:
        .cfi_startproc
        endbr64
        pushq	%rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq	%rsp, %rbp
        subq    $0x80, %rsp         # Stack must be aligned at 16 bytes
        .cfi_def_cfa_register 6
        movq	(%rdi), %rax
        movq	%rax, -24(%rbp)     # -24(%rbp) = &output[0]
        movq	16(%rdi), %rax
        movq	%rax, -16(%rbp)     # -16(%rbp) = &kernel[0]
        movq	8(%rdi), %rax
        movq	%rax, -8(%rbp)      # -8(%rbp) = &input[0]
        movl	24(%rdi), %eax
        movl	%eax, -52(%rbp)     # -52(%rbp) = start
        movl	28(%rdi), %eax
        movl	%eax, -48(%rbp)     # -48(%rbp) = end
        movl	32(%rdi), %eax
        movl	%eax, -44(%rbp)     # -44(%rbp) = out_size_y 
        movl	36(%rdi), %eax
        movl	%eax, -40(%rbp)     # -40(%rbp) = in_size_x    
        movl	40(%rdi), %eax
        movl	%eax, -36(%rbp)     # -36(%rbp) = kernel_size
        push    %rbx                # Saving %rbx (%rbx is a callee-saved register)
        push    %r12                # Saving %r12 (%r12 is a callee-saved register)
        push    %r13                # Saving %r13 (%r13 is a callee-saved register)
        push    %r14                # Saving %r14 (%r14 is a callee-saved register)
        push    %r15                # Saving %r15 (%r15 is a callee-saved register)
        xor 	%rbx, %rbx          # %rbx = convolute = 0
        movl	-52(%rbp), %eax	    # %eax = start
        cltq                        
	movq	%rax, %r12          # %r12 = x = start
	jmp	.L5
.L12:    
        xor     %r13, %r13          # %r13 = y = 0
        jmp .L6
.L11:
        xor     %r14, %r14          # %r14 = kx = 0
        jmp	.L7
.L10:
	xor     %r15, %r15          # %r15 = ky = 0
        jmp .L8
.L9:
        movq    %r12, %rcx          # %rcx = x 
	add     %r14, %rcx          # %rcx = x + kx
        movl    -40(%rbp), %eax     # %eax = in_size_x
        cltq
        imul	%rcx, %rax	    # %rax = (x + kx) * in_size_x
	movq	%r13, %rcx	    # %rcx = y
        movq	%r15, %rdx          # %rdx = ky
        addq	%rcx, %rdx	    # %rdx = y + ky
        addq	%rdx, %rax          # %rax = (x + kx) * in_size_x + (y + ky)
        leaq	0(,%rax,4), %rdx    # %rdx = %rax * 4
	movq	-8(%rbp), %rax	    # %rax = &input[0]
        addq	%rdx, %rax          # %rax = &input[(x + kx) * in_size_x + (y + ky)]
	movl	(%rax), %edx	    # %edx = input[(x + kx) * in_size_x + (y + ky)]
	movq	%r14, %rcx	    # %rcx = kx
        movl    -36(%rbp), %eax     # %eax = kernel_size
        cltq
	imul	%rcx, %rax          # %rax = kx * kernel_size
	movq	%rax, %rcx	    # %rcx = kx * kernel_size
	movq	%r15, %rax  	    # %rax = ky
	addq	%rcx, %rax	    # %rax = kx * kernel_size + ky
	leaq	0(,%rax,4), %rcx    # rcx = %rax * 4
	movq	-16(%rbp), %rax	    # %rax = &kernel[0]
	addq	%rcx, %rax	    # %rax = &kernel[kx * kernel_size + ky]
	movl	(%rax), %eax	    # %eax = kernel[kx * kernel_size + ky]
        imull	%edx, %eax          # %eax = input[(x + kx) * in_size_x + (y + ky)] * kernel[kx * kernel_size + ky]
	addl	%eax, %ebx          # %ebx += %eax (convolute += %eax)
	addq    $1, %r15            # ky++
.L8:
        movl    -36(%rbp), %eax
        cltq
	cmpq	%rax, %r15  	    # ky < kernel_size
	jl	.L9 
        addq	$1, %r14    	    # kx++
.L7:
        movl    -36(%rbp), %eax
        cltq
	cmpq	%rax, %r14	    # kx < kernel_size
        jl	.L10
        movq	%r12, %rax  	    # %rax = x
        movslq  -44(%rbp), %rcx     # %rcx out_size_y
	imulq	%rcx, %rax	    # %rax = x * out_size_y
        movq	%rax, %rdx	    # %rdx = x * out_size_y
	movq	%r13, %rax  	    # %rax = y
        addq	%rdx, %rax	    # %rax = x * out_size_y + y	
        leaq	0(,%rax,4), %rdx    # %rdx = %rax * 4
	movq	-24(%rbp), %rax	    # %rax = &output[0]
        addq	%rax, %rdx	    # %rdx = &output[x * out_size_y + y]
	movl	%ebx, (%rdx)	    # output[x * out_size_y + y] = convolute;
	xorl    %ebx, %ebx          # convolute = 0
	addq	$1, %r13    	    # y++
.L6:
        movl    -44(%rbp), %eax     # %eax = out_size_y
        cltq
	cmpq	%rax, %r13  	    # y < out_size_y
	jl      .L11
	addq	$1, %r12    	    # x++
.L5:
        movl    -48(%rbp), %eax     # %eax = end
        cltq
	cmpq	%rax, %r12          # x < end
	jl	.L12
        popq %r15
        popq %r14
        popq %r13
	popq %r12
	popq %rbx
        movq %rbp, %rsp 
	popq %rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc

Let's compile and run the optimized code (I think that the code should perfom better than the previous one. We will see it...):

$ g++ -o conv -lpthread -Wall conv_main.cpp conv_pthread.cpp utility.cpp convolute_opt.s 
$ sudo perf stat -r 30 -e cache-references,cache-misses ./conv image.txt kernel.txt 0 2

 Performance counter stats for './conv image.txt kernel.txt 0 2' (30 runs):

         6.776.719      cache-references                                              ( +-  1,27% )
           470.948      cache-misses              #    6,794 % of all cache refs      ( +-  2,12% )

           0,41677 +- 0,00182 seconds time elapsed  ( +-  0,44% )

In the first simulation we obtained 1.768.478 cache misses. Now we have just obtained 470.948 cache misses: not so bad.

Scheduler anlisys with perf

Sometimes it can be worth spending some time figuring out how our multithread application exploits CPU resources (physical threads). The perf sched subcommand provides a number of tools for analyzing kernel CPU scheduler behaviour.

2 threads:

$ sudo perf sched record ./conv image.txt kernel.txt 0 2
$ sudo perf sched map > out.txt
$ cat out.txt

[...]
 
  *.   I0  .   M0  .   .   .   .    11821.565211 secs 
   .   I0  .   M0  .   .  *P0  .    11821.565417 secs P0 => budgie-wm:cs0:1567
   .   I0  .  *.   .   .   P0  .    11821.565427 secs 
   .  *Q0  .   .   .   .   P0  .    11821.565484 secs Q0 => gfx:850
   .   Q0  .  *M0  .   .   P0  .    11821.565495 secs 
   .   Q0  .   M0  .   .  *.   .    11821.565500 secs 
   .  *I0  .   M0  .   .   .   .    11821.565513 secs I0 => conv:18079
  *L0  I0  .   M0  .   .   .   .    11821.565515 secs 
   L0  I0  .  *.   .   .   .   .    11821.565522 secs 
   L0  I0  .   .   .   .  *K0  .    11821.565586 secs 
   L0  I0  .   .   .   .  *.   .    11821.565623 secs 
   L0  I0  .  *M0  .   .   .   .    11821.565627 secs 
  *.   I0  .   M0  .   .   .   .    11821.565635 secs 

[...]

   I0  .   .   M0  .  *.   .   .    11821.885617 secs 
   I0  .   .   M0  .  *L0  .   .    11821.885655 secs 
   I0  .   .   M0  .  *.   .   .    11821.885675 secs 
   I0  .   .  *.   .   .   .   .    11821.885677 secs 
   I0 *Q0  .   .   .   .   .   .    11821.885861 secs 
   I0 *.   .   .   .   .   .   .    11821.885864 secs 
   I0  .   .   .   .   .  *B2  .    11821.886138 secs B2 => conv:18081  // 1st thread: B2
   I0  .   .  *C2  .   .   B2  .    11821.886159 secs C2 => conv:18082  // 2nd thread: C2
  *.   .   .   C2  .   .   B2  .    11821.886160 secs 
   .   .   .   C2  .  *O0  B2  .    11821.888082 secs 
   .   .   .   C2  .  *.   B2  .    11821.888092 secs 
   .   .  *Z0  C2  .   .   B2  .    11821.891775 secs 
   .   .  *.   C2  .   .   B2  .    11821.891913 secs 
   .   .   .   C2  .  *O0  B2  .    11821.892076 secs 

[...]

   N0 *Q0  .   C2  .   K0  B2  .    11821.966168 secs 
   N0 *.   .   C2  .   K0  B2  .    11821.966180 secs 
   N0  .   .   C2  .  *.   B2  .    11821.966222 secs 
   N0  .   .   C2 *I0  .   B2  .    11821.966296 secs // 1st thread (B2) ends
   N0  .   .   C2  I0  .  *.   .    11821.966308 secs 
   N0  .   .   C2 *.   .   .   .    11821.966309 secs 
   N0  .   .   C2  .  *L0  .   .    11821.967038 secs 


[...]

N0  .   .   C2  .  *.   .   .    11821.983755 secs 
   N0  .   .   C2  .   .   .  *O0   11821.984131 secs 
   N0  .   .   C2  .   .   .  *.    11821.984139 secs 
   N0  .   .   C2 *I0  .   .   .    11821.984373 secs // 2nd thread (C2) ends
   N0  .   .  *.   I0  .   .   .    11821.984386 secs // thread main (I0) takes control
   N0  .   .   .   I0  .  *L0  .    11821.984508 secs 
  *.   .   .   .   I0  .   L0  .    11821.984522 secs 
   .   .  *S0  .   I0  .   L0  .    11821.984643 secs 

[...]

8 threads:

$ sudo perf sched record ./conv image.txt kernel.txt 0 8
$ sudo perf sched map > out.txt
$ cat out.txt
  *A0                                9861.654193 secs A0 => migration/0:16
  *.                                 9861.654217 secs .  => swapper:0
   .  *B0                            9861.654281 secs B0 => migration/1:22
   .  *.                             9861.654298 secs 
   .   .  *C0                        9861.654340 secs C0 => migration/2:28
   .   .  *.                         9861.654350 secs 
   .   .   .  *D0                    9861.654376 secs D0 => migration/3:34
   .   .   .  *.                     9861.654383 secs 
   .   .   .   .  *E0                9861.654483 secs E0 => migration/4:40
   .   .   .   .  *.                 9861.654509 secs 
   .   .   .   .   .  *F0            9861.654570 secs F0 => migration/5:46
   .   .   .   .   .  *.             9861.654585 secs 
   .   .   .   .   .   .  *G0        9861.654714 secs G0 => migration/6:52
   .   .   .   .   .   .  *.         9861.654740 secs 
   .   .   .   .   .   .  *H0        9861.654891 secs H0 => perf-exec:17223
   .   .   .   .   .   .  *G0        9861.654999 secs 
   .   .   .  *H0  .   .   G0        9861.655011 secs 
   .   .   .   H0  .   .  *.         9861.655019 secs 
   .   .   .   H0  .   .   .  *.     9861.655060 secs 
   .   .   .  *I0  .   .   .   .     9861.655565 secs I0 => ksoftirqd/3:35
   .   .   .  *H0  .   .   .   .     9861.655636 secs H0 => conv:17223	     // ./conv is starting
   .   .   .   H0  .  *J0  .   .     9861.659546 secs J0 => rcu_preempt:15
   .   .   .   H0  .  *.   .   .     9861.659561 secs 
   .   .   .   H0  .  *J0  .   .     9861.667561 secs 
   .   .   .   H0  .  *.   .   .     9861.667570 secs 
   .   .   .  *I0  .   .   .   .     9861.671564 secs 
   .   .   .  *H0  .   .   .   .     9861.671586 secs 
   .   .   .   H0  .  *J0  .   .     9861.675547 secs  

[...]

   .  *K0  .   H0  .   .   .   .     9861.917653 secs 
   .  *.   .   H0  .   .   .   .     9861.917709 secs 
   .   .   .   H0  .   .   .  *L0    9861.919026 secs 
   .  *K0  .   H0  .   .   .   L0    9861.919507 secs 
   .   K0  .   H0  .   .   .  *.     9861.919523 secs 
   .  *.   .   H0  .   .   .   .     9861.922049 secs 
   .   .   .   H0  .   .   .  *L0    9861.922671 secs 
   .   .   .   H0  .   .   .  *.     9861.922692 secs 
   .   .   .   H0  .  *S0  .   .     9861.964174 secs S0 => conv:17225 // 1st thread: S0
   .   .  *T0  H0  .   S0  .   .     9861.964222 secs T0 => conv:17228 // 2nd thread: T0
   .   .   T0  H0 *U0  S0  .   .     9861.964251 secs U0 => conv:17229 // 3rd thread: U0
   .   .   T0 *V0  U0  S0  .   .     9861.964324 secs V0 => conv:17232 // 4th thread: V0
  *W0  .   T0  V0  U0  S0  .   .     9861.964403 secs W0 => conv:17231 // 5th thread: W0
   W0 *X0  T0  V0  U0  S0  .   .     9861.964403 secs X0 => conv:17227 // 6th thread: X0
   W0  X0  T0  V0  U0  S0 *Y0  .     9861.964406 secs Y0 => conv:17230 // 7th thread: Y0
   W0  X0  T0  V0  U0  S0  Y0 *Z0    9861.964406 secs Z0 => conv:17226 // 8th thread: Z0
   W0  X0  T0  V0  U0 *J0  Y0  Z0    9861.967584 secs 
   W0  X0  T0  V0  U0 *S0  Y0  Z0    9861.967596 secs 
   W0  X0  T0  V0  U0 *J0  Y0  Z0    9861.975577 secs 
   W0  X0  T0  V0  U0 *S0  Y0  Z0    9861.975586 secs 
   W0  X0  T0  V0  U0 *J0  Y0  Z0    9861.987578 secs 
   W0  X0  T0  V0  U0 *S0  Y0  Z0    9861.987587 secs 
   W0  X0  T0 *.   U0  S0  Y0  Z0    9861.997128 secs 
   W0  X0 *.   .   U0  S0  Y0  Z0    9861.997274 secs 
   W0  X0  .  *J0  U0  S0  Y0  Z0    9861.999576 secs 
   W0  X0  .  *.   U0  S0  Y0  Z0    9861.999581 secs 
   W0  X0 *A1  .   U0  S0  Y0  Z0    9862.003579 secs A1 => kcompactd0:70
  *O0  X0  A1  .   U0  S0  Y0  Z0    9862.003581 secs 
   O0  X0 *.   .   U0  S0  Y0  Z0    9862.003587 secs 
  *W0  X0  .   .   U0  S0  Y0  Z0    9862.003609 secs 
   W0  X0  .  *J0  U0  S0  Y0  Z0    9862.007567 secs 
   W0  X0  .  *.   U0  S0  Y0  Z0    9862.007570 secs 
   W0  X0 *K0  .   U0  S0  Y0  Z0    9862.010144 secs 
   W0  X0 *.   .   U0  S0  Y0  Z0    9862.010214 secs 
   W0  X0  .  *J0  U0  S0  Y0  Z0    9862.011579 secs 
   W0  X0  .  *.   U0  S0  Y0  Z0    9862.011582 secs 
   W0  X0 *K0  .   U0  S0  Y0  Z0    9862.015262 secs 
   W0  X0 *.   .   U0  S0  Y0  Z0    9862.015324 secs 
   W0  X0 *K0  .   U0  S0  Y0  Z0    9862.016376 secs 
   W0  X0 *.   .   U0  S0  Y0  Z0    9862.016428 secs 
   W0  X0  .   .   U0  S0  Y0 *L0    9862.017400 secs 
   W0  X0 *M0  .   U0  S0  Y0  L0    9862.017437 secs 
   W0  X0 *.   .   U0  S0  Y0  L0    9862.017457 secs 
   W0  X0  .   .   U0  S0  Y0 *Z0    9862.017637 secs 
   W0  X0 *K0  .   U0  S0  Y0  Z0    9862.018477 secs 
   W0  X0 *.   .   U0  S0  Y0  Z0    9862.018501 secs 
   W0  X0  .   .   U0  S0  Y0 *L0    9862.021156 secs 
   W0  X0 *K0  .   U0  S0  Y0  L0    9862.021201 secs 
   W0  X0 *.   .   U0  S0  Y0  L0    9862.021301 secs 
   W0  X0  .   .   U0  S0  Y0 *Z0    9862.021402 secs 
   W0  X0 *K0  .   U0  S0  Y0  Z0    9862.022346 secs 
   W0  X0 *.   .   U0  S0  Y0  Z0    9862.022366 secs 
  *A0  X0  .   .   U0  S0  Y0  Z0    9862.023581 secs 
   A0  X0 *W0  .   U0  S0  Y0  Z0    9862.023595 secs 
  *.   X0  W0  .   U0  S0  Y0  Z0    9862.023599 secs 
   .   X0  W0  .   U0  S0 *.   Z0    9862.024051 secs 
   .   X0  W0  .   U0  S0  .  *.     9862.024201 secs 
   .   X0  W0  .   U0  S0  .  *L0    9862.025187 secs 
  *N0  X0  W0  .   U0  S0  .   L0    9862.025245 secs 
  *K0  X0  W0  .   U0  S0  .   L0    9862.025308 secs 
  *.   X0  W0  .   U0  S0  .   L0    9862.025366 secs 
   .   X0  W0  .   U0  S0  .  *.     9862.025436 secs 
   .  *.   W0  .   U0  S0  .   .     9862.025601 secs 
   .   .   W0 *J0  U0  S0  .   .     9862.027581 secs 
   .   .   W0 *.   U0  S0  .   .     9862.027591 secs 
   .   .  *.   .   U0  S0  .   .     9862.028495 secs 
   .   .   .   .  *.   S0  .   .     9862.029336 secs 
   .   .   .  *H0  .   S0  .   .     9862.029388 secs // thread main (H0) take control
   .   .   .   H0  .  *.   .   .     9862.029389 secs // thread main (H0) ends
   .   .   .  *.   .   .   .   .     9862.032602 secs 
   .   .   .   .   .   .   .  *B1    9862.032604 secs B1 => perf:17220
   .   .   .   .   .   .   .  *C1    9862.032676 secs C1 => migration/7:58
  *B1  .   .   .   .   .   .   C1    9862.032690 secs 
   B1  .   .   .   .   .   .  *.     9862.032690 secs 
   B1 *B1  .   .   .   .   .   .     9862.032730 secs 
   B1  B1 *B1  .   .   .   .   .     9862.032758 secs 
   B1  B1  B1 *B1  .   .   .   .     9862.032782 secs 
   B1  B1  B1  B1 *B1  .   .   .     9862.032800 secs 
   B1  B1  B1  B1  B1 *B1  .   .     9862.032827 secs 
   B1  B1  B1  B1  B1  B1 *B1  .     9862.032850 secs 
   B1  B1  B1  B1  B1  B1  B1 *B1    9862.032882 secs 

A vertical dotted line means that the associated physical thread is idle. Basically we have just found that the application is using 7 physical threads.

About

Implementation and perfomance analisys of the conv2D algorithm in C++ and Cuda.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages