Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworked faster VPXBoolReader #65

Merged
merged 11 commits into from
Apr 22, 2024
Merged

Conversation

Melirius
Copy link
Collaborator

@Melirius Melirius commented Apr 18, 2024

VPXBoolReader get is one the hottest functions so every single operation reduction there gives a nice performance improvement. Using left-shifted range we can get rid of split and subtraction in shift, and presented big_split calculation scheme reduces dependency chain length from 5 to 4. Then I have ~ 4% performance gain on decoding.

@Melirius
Copy link
Collaborator Author

@microsoft-github-policy-service agree

@Melirius Melirius requested a review from mcroomp April 18, 2024 17:13
@Melirius Melirius self-assigned this Apr 18, 2024
@mcroomp
Copy link
Collaborator

mcroomp commented Apr 18, 2024

Pretty cool... let me run the benchmark (I put in a CR to make perf tests more reproducible on machines with p-cores/e-cores).
Maybe @danielrh can have a look as well. Thanks!

@mcroomp
Copy link
Collaborator

mcroomp commented Apr 19, 2024

Can you merge the latest changes? I added a flag to run the threads at high priority so that they don't get randomly assigned to e-cores and give random perf results. Thanks!

@Melirius
Copy link
Collaborator Author

Melirius commented Apr 19, 2024

I've merged the changes and applied a quick fix to get it working on Linux.

Performance on my machine (AMD Ryzen 9 5950x)
HEAD 5ac1daf

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/1.jpg
2024-04-19T12:49:55.810Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 2604ms of CPU time in 2611ms of wall time
2024-04-19T12:49:55.821Z INFO  [lepton_jpeg_util] Total CPU time consumed:5226ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/1.jpg':

       420 831 657      cache-references                                                        (45,68%)
        29 593 645      cache-misses                     #    7,03% of all cache refs           (45,53%)
    11 950 248 122      cycles                                                                  (45,53%)
       794 726 352      stalled-cycles-backend           #    6,65% backend cycles idle         (45,54%)
        30 584 690      stalled-cycles-frontend          #    0,26% frontend cycles idle        (45,62%)
    25 288 083 663      instructions                     #    2,12  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (45,46%)
     2 821 325 264      branch-instructions                                                     (45,46%)
       122 789 527      branch-misses                    #    4,35% of all branches             (45,60%)
     4 364 043 009      ic_fetch_stall.ic_stall_any                                             (45,58%)
        24 492 492      l2_cache_misses_from_ic_miss                                            (45,28%)
       719 649 048      l2_latency.l2_cycles_waiting_on_fills                                        (45,31%)
            83 331      faults                                                                
                 1      migrations                                                            

       2,670911983 seconds time elapsed

       2,478538000 seconds user
       0,176465000 seconds sys

With +avx2

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/8.jpg
2024-04-19T17:31:38.017Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 2407ms of CPU time in 2415ms of wall time
2024-04-19T17:31:38.027Z INFO  [lepton_jpeg_util] Total CPU time consumed:4833ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/8.jpg':

       398 284 458      cache-references                                                        (45,36%)
        23 056 218      cache-misses                     #    5,79% of all cache refs           (45,36%)
    11 302 720 704      cycles                                                                  (45,36%)
       385 552 345      stalled-cycles-backend           #    3,41% backend cycles idle         (45,44%)
        27 924 718      stalled-cycles-frontend          #    0,25% frontend cycles idle        (45,67%)
    22 753 227 053      instructions                     #    2,01  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,76%)
     2 753 876 077      branch-instructions                                                     (45,76%)
       122 616 534      branch-misses                    #    4,45% of all branches             (45,76%)
     4 253 811 562      ic_fetch_stall.ic_stall_any                                             (45,62%)
        17 812 156      l2_cache_misses_from_ic_miss                                            (45,45%)
       748 264 211      l2_latency.l2_cycles_waiting_on_fills                                        (45,25%)
            83 322      faults                                                                
                 1      migrations                                                            

       2,470532743 seconds time elapsed

       2,287802000 seconds user
       0,179984000 seconds sys

This PR ae72622

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/6.jpg
2024-04-19T13:58:06.292Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 2301ms of CPU time in 2309ms of wall time
2024-04-19T13:58:06.302Z INFO  [lepton_jpeg_util] Total CPU time consumed:4621ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/6.jpg':

       424 181 098      cache-references                                                        (45,14%)
        38 771 629      cache-misses                     #    9,14% of all cache refs           (45,31%)
    10 708 303 804      cycles                                                                  (45,48%)
       974 902 117      stalled-cycles-backend           #    9,10% backend cycles idle         (45,74%)
        19 834 697      stalled-cycles-frontend          #    0,19% frontend cycles idle        (46,00%)
    25 080 659 902      instructions                     #    2,34  insn per cycle            
                                                  #    0,04  stalled cycles per insn     (46,09%)
     2 470 568 447      branch-instructions                                                     (45,92%)
        83 787 955      branch-misses                    #    3,39% of all branches             (45,75%)
     4 425 221 507      ic_fetch_stall.ic_stall_any                                             (45,42%)
        30 889 820      l2_cache_misses_from_ic_miss                                            (45,08%)
       775 685 423      l2_latency.l2_cycles_waiting_on_fills                                        (44,86%)
            83 331      faults                                                                
                 1      migrations                                                            

       2,366150566 seconds time elapsed

       2,187776000 seconds user
       0,175982000 seconds sys

With +avx2

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/7.jpg
2024-04-19T17:29:30.160Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 2130ms of CPU time in 2138ms of wall time
2024-04-19T17:29:30.170Z INFO  [lepton_jpeg_util] Total CPU time consumed:4279ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/7.jpg':

       393 597 797      cache-references                                                        (44,87%)
        24 348 442      cache-misses                     #    6,19% of all cache refs           (45,23%)
    10 027 485 596      cycles                                                                  (45,41%)
       507 383 544      stalled-cycles-backend           #    5,06% backend cycles idle         (45,61%)
        31 006 974      stalled-cycles-frontend          #    0,31% frontend cycles idle        (45,90%)
    22 609 551 128      instructions                     #    2,25  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (46,03%)
     2 402 630 001      branch-instructions                                                     (46,03%)
        82 632 040      branch-misses                    #    3,44% of all branches             (45,84%)
     4 741 516 339      ic_fetch_stall.ic_stall_any                                             (45,63%)
        18 350 203      l2_cache_misses_from_ic_miss                                            (45,27%)
       783 604 112      l2_latency.l2_cycles_waiting_on_fills                                        (44,90%)
            83 321      faults                                                                
                 1      migrations                                                            

       2,192710468 seconds time elapsed

       2,007697000 seconds user
       0,183972000 seconds sys

@Melirius Melirius requested a review from mcroomp April 19, 2024 14:17
@Melirius
Copy link
Collaborator Author

Pretty cool... let me run the benchmark (I put in a CR to make perf tests more reproducible on machines with p-cores/e-cores). Maybe @danielrh can have a look as well. Thanks!

Do you have performance results on Intel? I don't have one to check, unfortunately.

@mcroomp
Copy link
Collaborator

mcroomp commented Apr 19, 2024 via email

@Melirius
Copy link
Collaborator Author

Melirius commented Apr 19, 2024

I have AMD Ryzen 9 5950x.

Here’s arm64 for example produces pretty much the same code. The only difference is whether you subtract first, then shift or vice versa. https://godbolt.org/z/hn5Ga8oj5

Huh, the simple variant have one additional mov command, maybe that can give a difference. There is some difference in assembler between variants for x64 also: https://godbolt.org/z/GK4r3rjjh

But it is strange that both variants are effectively serialized working only with one register.

@danielrh
Copy link

danielrh commented Apr 19, 2024 via email

@mcroomp
Copy link
Collaborator

mcroomp commented Apr 20, 2024

@Melirius @danielrh

https://godbolt.org/z/K3oPeo16e

Ryzen looks like it would benefit more because of the slow lzcnt instructions.

Here's the new code (with target-feature=+lzcnt):

        mov     eax, edi
        xor     ecx, ecx
        shr     eax, 8
        add     eax, -65536
        imul    eax, esi
        and     eax, -16777216
        add     eax, 16777216
        sub     edi, eax
        cmp     eax, edx
        cmova   edi, eax
        cmovbe  ecx, eax
        lzcnt   esi, edi
        sub     edx, ecx
        shlx    eax, edx, esi
        shlx    edx, edi, esi

old code (with lzcnt):

        lea     eax, [rdi - 1]
        imul    eax, esi
        shr     eax, 8
        inc     eax
        mov     rcx, rax
        shl     rcx, 56
        sub     edi, eax
        lzcnt   esi, edi
        lzcnt   r8d, eax
        xor     r9d, r9d
        cmp     rcx, rdx
        cmovbe  r9, rcx
        cmova   edi, eax
        cmovbe  r8d, esi
        sub     rdx, r9
        add     r8b, -24
        shlx    rax, rdx, r8
        shlx    edx, edi, r8d

@Melirius
Copy link
Collaborator Author

I can run on an intel box this evening

Any results?

@danielrh
Copy link

I can run on an intel box this evening

Any results?

Unfortunately I don't know where you got img_52MP_7k.lep...is that avail in the repo? should I run any old image?

@danielrh
Copy link

This is what I see for iphonecity.lep


git rev-parse HEAD
5ac1daf79e22e26d954da52fe8006c31ad21134d


2024-04-22T08:39:30.152Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 398ms of CPU time in 399ms of wall time
2024-04-22T08:39:30.153Z INFO  [lepton_jpeg_util] Total CPU time consumed:800ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/iphonecity.lep /tmp/xx':

         5,151,138      cache-references                                                        (83.25%)
           471,472      cache-misses                     #    9.15% of all cache refs           (83.29%)
     1,480,178,367      cycles                                                                  (66.56%)
   <not supported>      stalled-cycles-backend                                                
   <not supported>      stalled-cycles-frontend                                               
     2,720,312,899      instructions                     #    1.84  insn per cycle              (83.28%)
       315,389,419      branch-instructions                                                     (83.27%)
        14,548,275      branch-misses                    #    4.61% of all branches             (83.78%)
             5,469      faults                                                                
                 1      migrations                                                            

       0.407268567 seconds time elapsed

       0.394556000 seconds user
       0.011956000 seconds sys



+avx2:

Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/iphonecity.lep /tmp/xx':

         5,643,394      cache-references                                                        (82.69%)
           340,949      cache-misses                     #    6.04% of all cache refs           (83.55%)
     1,437,397,096      cycles                                                                  (67.44%)
   <not supported>      stalled-cycles-backend                                                
   <not supported>      stalled-cycles-frontend                                               
     2,553,670,210      instructions                     #    1.78  insn per cycle              (83.72%)
       312,979,313      branch-instructions                                                     (83.71%)
        13,952,614      branch-misses                    #    4.46% of all branches             (82.81%)
             5,460      faults                                                                
                 1      migrations                                                            

       0.393540966 seconds time elapsed

       0.384563000 seconds user
       0.008011000 seconds sys



=============================================

ae726227a7a0e0ac0527f2739def3fe5dcc8a179

2024-04-22T08:45:32.654Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 370ms of CPU time in 372ms of wall time
2024-04-22T08:45:32.656Z INFO  [lepton_jpeg_util] Total CPU time consumed:744ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/iphonecity.lep /tmp/xx':

         6,300,659      cache-references                                                        (83.09%)
           486,259      cache-misses                     #    7.72% of all cache refs           (83.14%)
     1,371,574,776      cycles                                                                  (66.26%)
   <not supported>      stalled-cycles-backend                                                
   <not supported>      stalled-cycles-frontend                                               
     2,762,692,620      instructions                     #    2.01  insn per cycle              (83.14%)
       275,235,285      branch-instructions                                                     (84.16%)
         9,633,215      branch-misses                    #    3.50% of all branches             (83.56%)
             5,468      faults                                                                
                 1      migrations                                                            

       0.379980205 seconds time elapsed

       0.363041000 seconds user
       0.015957000 seconds sys



 +avx2:


2024-04-22T08:44:40.328Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 356ms of CPU time in 357ms of wall time
2024-04-22T08:44:40.329Z INFO  [lepton_jpeg_util] Total CPU time consumed:715ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/iphonecity.lep /tmp/xx':

         5,189,905      cache-references                                                        (83.10%)
           492,160      cache-misses                     #    9.48% of all cache refs           (83.56%)
     1,317,461,577      cycles                                                                  (67.12%)
   <not supported>      stalled-cycles-backend                                                
   <not supported>      stalled-cycles-frontend                                               
     2,564,398,480      instructions                     #    1.95  insn per cycle              (83.56%)
       267,417,297      branch-instructions                                                     (83.56%)
         9,603,285      branch-misses                    #    3.59% of all branches             (82.87%)
             5,455      faults                                                                
                 1      migrations                                                            

       0.365565362 seconds time elapsed

       0.356594000 seconds user
       0.008013000 seconds sys

@Melirius
Copy link
Collaborator Author

I can run on an intel box this evening

Any results?

Unfortunately I don't know where you got img_52MP_7k.lep...is that avail in the repo? should I run any old image?

No, this one is used by me for benchmarking - it is large and has many differently filled DCT blocks. Here it is
img_52MP_7k.zip

@danielrh
Copy link

ok with your image:


git rev-parse HEAD
5ac1daf79e22e26d954da52fe8006c31ad21134d

2024-04-22T08:50:53.165Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 3619ms of CPU time in 3629ms of wall time
2024-04-22T08:50:53.176Z INFO  [lepton_jpeg_util] Total CPU time consumed:7259ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util img_52MP_7k.lep /tmp/xx':

        58,694,640      cache-references                                                        (83.35%)
         6,914,281      cache-misses                     #   11.78% of all cache refs           (83.34%)
    13,398,962,522      cycles                                                                  (66.54%)
   <not supported>      stalled-cycles-backend                                                
   <not supported>      stalled-cycles-frontend                                               
    25,387,078,859      instructions                     #    1.89  insn per cycle              (83.24%)
     2,816,885,906      branch-instructions                                                     (83.38%)
       115,346,979      branch-misses                    #    4.09% of all branches             (83.44%)
            83,325      faults                                                                
                 1      migrations                                                            

       3.675375864 seconds time elapsed

       3.512820000 seconds user
       0.159855000 seconds sys


+avx2


2024-04-22T08:51:45.531Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 3521ms of CPU time in 3531ms of wall time
2024-04-22T08:51:45.542Z INFO  [lepton_jpeg_util] Total CPU time consumed:7065ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util img_52MP_7k.lep /tmp/xx':

        60,577,588      cache-references                                                        (83.34%)
         7,007,498      cache-misses                     #   11.57% of all cache refs           (83.34%)
    12,863,141,628      cycles                                                                  (66.48%)
   <not supported>      stalled-cycles-backend                                                
   <not supported>      stalled-cycles-frontend                                               
    23,010,499,173      instructions                     #    1.79  insn per cycle              (83.24%)
     2,826,061,161      branch-instructions                                                     (83.44%)
       116,254,158      branch-misses                    #    4.11% of all branches             (83.41%)
            83,315      faults                                                                
                 1      migrations                                                            

       3.578881555 seconds time elapsed

       3.425915000 seconds user
       0.151907000 seconds sys




=============================================



2024-04-22T08:48:38.807Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 3420ms of CPU time in 3431ms of wall time
2024-04-22T08:48:38.819Z INFO  [lepton_jpeg_util] Total CPU time consumed:6864ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util img_52MP_7k.lep /tmp/xx':

        65,685,939      cache-references                                                        (83.32%)
         7,091,943      cache-misses                     #   10.80% of all cache refs           (83.32%)
    12,688,279,906      cycles                                                                  (66.52%)
   <not supported>      stalled-cycles-backend                                                
   <not supported>      stalled-cycles-frontend                                               
    25,521,145,969      instructions                     #    2.01  insn per cycle              (83.29%)
     2,469,191,083      branch-instructions                                                     (83.43%)
        77,374,253      branch-misses                    #    3.13% of all branches             (83.42%)
            83,328      faults                                                                
                 1      migrations                                                            

       3.477645504 seconds time elapsed

       3.320785000 seconds user
       0.155849000 seconds sys

 +avx2:

2024-04-22T08:49:42.679Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 3263ms of CPU time in 3273ms of wall time
2024-04-22T08:49:42.691Z INFO  [lepton_jpeg_util] Total CPU time consumed:6549ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util img_52MP_7k.lep /tmp/xx':

        59,461,447      cache-references                                                        (83.37%)
         7,436,967      cache-misses                     #   12.51% of all cache refs           (83.28%)
    11,906,332,086      cycles                                                                  (66.61%)
   <not supported>      stalled-cycles-backend                                                
   <not supported>      stalled-cycles-frontend                                               
    23,146,550,899      instructions                     #    1.94  insn per cycle              (83.36%)
     2,476,227,430      branch-instructions                                                     (83.38%)
        77,897,650      branch-misses                    #    3.15% of all branches             (83.38%)
            83,319      faults                                                                
                 1      migrations                                                            

       3.320737303 seconds time elapsed

       3.183607000 seconds user
       0.135983000 seconds sys

@Melirius
Copy link
Collaborator Author

So on Intel we have nice performance improvement too, good.

@mcroomp
Copy link
Collaborator

mcroomp commented Apr 22, 2024

I think this change is great with two minor modifications:

  • Remove the unsafe for now, right now the library has no unsafe code and since the only real benefit is for CPUs that are more than 11 years old. If we add the following code right after the split calculation, it does a cold jump that never gets executed and so the branch predictor pretty much ignores it.
 	        // so optimizer understands that 0 should never happen and uses a cold jump
  	        // if we don't have LZCNT on x86 CPUs (older BSR instruction requires check for zero).
  	        // This is better since the branch prediction figures quickly this never happens and can run
  	        // the code sequentially.
  	        #[cfg(all(
  	            not(target_feature = "lzcnt"),
  	            any(target_arch = "x86", target_arch = "x86_64")
  	        ))]
  	        assert!(split < tmp_range);
  • I'd prefer:
    let split = ((((tmp_range - 0x1000000) >> 8) * probability) & 0xff000000) + 0x1000000;
    since this is what the compiler ends up emitting and it's clearer compared to the vanilla code that what we are doing is exactly what we did before but with tmp_range << 24.

Thanks!

@Melirius
Copy link
Collaborator Author

Melirius commented Apr 22, 2024

I think this change is great with two minor modifications:
* Remove the unsafe for now, right now the library has no unsafe code and since the only real benefit is for CPUs that are more than 11 years old. If we add the following code right after the split calculation, it does a cold jump that never gets executed and so the branch predictor pretty much ignores it.
...
* I'd prefer:
let split = ((((tmp_range - 0x1000000) >> 8) * probability) & 0xff000000) + 0x1000000;
since this is what the compiler ends up emitting and it's clearer compared to the vanilla code that what we are doing is exactly what we did before but with tmp_range << 24.

Thanks!

The first point is very much detrimental to performance: I got

Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':

       392 835 432      cache-references                                                        (45,31%)
        26 246 563      cache-misses                     #    6,68% of all cache refs           (45,31%)
    11 148 423 929      cycles                                                                  (45,31%)
       448 831 266      stalled-cycles-backend           #    4,03% backend cycles idle         (45,41%)
        25 088 462      stalled-cycles-frontend          #    0,23% frontend cycles idle        (45,66%)
    22 407 825 690      instructions                     #    2,01  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,77%)
     2 626 947 282      branch-instructions                                                     (45,77%)
       126 178 314      branch-misses                    #    4,80% of all branches             (45,77%)
     3 955 491 492      ic_fetch_stall.ic_stall_any                                             (45,71%)
        20 961 529      l2_cache_misses_from_ic_miss                                            (45,48%)
       711 230 610      l2_latency.l2_cycles_waiting_on_fills                                        (45,25%)
            83 322      faults                                                                
                 1      migrations                                                            

       2,466859191 seconds time elapsed

       2,303568000 seconds user
       0,159970000 seconds sys

instead of previous

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/7.jpg
2024-04-19T17:29:30.160Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 2130ms of CPU time in 2138ms of wall time
2024-04-19T17:29:30.170Z INFO  [lepton_jpeg_util] Total CPU time consumed:4279ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/7.jpg':

       393 597 797      cache-references                                                        (44,87%)
        24 348 442      cache-misses                     #    6,19% of all cache refs           (45,23%)
    10 027 485 596      cycles                                                                  (45,41%)
       507 383 544      stalled-cycles-backend           #    5,06% backend cycles idle         (45,61%)
        31 006 974      stalled-cycles-frontend          #    0,31% frontend cycles idle        (45,90%)
    22 609 551 128      instructions                     #    2,25  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (46,03%)
     2 402 630 001      branch-instructions                                                     (46,03%)
        82 632 040      branch-misses                    #    3,44% of all branches             (45,84%)
     4 741 516 339      ic_fetch_stall.ic_stall_any                                             (45,63%)
        18 350 203      l2_cache_misses_from_ic_miss                                            (45,27%)
       783 604 112      l2_latency.l2_cycles_waiting_on_fills                                        (44,90%)
            83 321      faults                                                                
                 1      migrations                                                            

       2,192710468 seconds time elapsed

       2,007697000 seconds user
       0,183972000 seconds sys

The pb is that each instruction counts here, even branch that never executed.
Maybe use debug_assert then?

On the second I agree.

Excluded unsafe assume, simplified split by comments of @mcroomp
Assert commented out till the discussion result
@Melirius
Copy link
Collaborator Author

Melirius commented Apr 22, 2024

Performance of +avx2 version of 8c841d8 with excluded assert

ivan@ivan-5950:~/lepton_jpeg_rust$ sudo rm images/0.jpg; sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg
2024-04-22T14:14:46.266Z INFO  [lepton_jpeg_util::structs::lepton_format] worker threads 2164ms of CPU time in 2171ms of wall time
2024-04-22T14:14:46.276Z INFO  [lepton_jpeg_util] Total CPU time consumed:4346ms

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':

       386 433 841      cache-references                                                        (45,26%)
        28 699 537      cache-misses                     #    7,43% of all cache refs           (45,44%)
    10 075 442 518      cycles                                                                  (45,62%)
       439 770 884      stalled-cycles-backend           #    4,36% backend cycles idle         (45,87%)
        20 099 023      stalled-cycles-frontend          #    0,20% frontend cycles idle        (46,02%)
    23 715 600 325      instructions                     #    2,35  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,95%)
     2 947 789 446      branch-instructions                                                     (45,78%)
        82 757 552      branch-misses                    #    2,81% of all branches             (45,60%)
     4 414 774 689      ic_fetch_stall.ic_stall_any                                             (45,29%)
        23 576 017      l2_cache_misses_from_ic_miss                                            (45,06%)
       697 260 755      l2_latency.l2_cycles_waiting_on_fills                                        (44,98%)
            83 320      faults                                                                
                 1      migrations                                                            

       2,226758367 seconds time elapsed

       2,051486000 seconds user
       0,171956000 seconds sys

src/structs/vpx_bool_reader.rs Outdated Show resolved Hide resolved
src/structs/vpx_bool_reader.rs Show resolved Hide resolved
src/structs/vpx_bool_reader.rs Show resolved Hide resolved
src/structs/vpx_bool_reader.rs Show resolved Hide resolved
@Melirius Melirius requested a review from mcroomp April 22, 2024 16:23
@Melirius
Copy link
Collaborator Author

With new assert +avx2+lzcnt 0c60819

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':

       413 999 596      cache-references                                                        (45,34%)
        26 783 155      cache-misses                     #    6,47% of all cache refs           (45,54%)
     9 057 700 647      cycles                                                                  (45,74%)
       501 142 117      stalled-cycles-backend           #    5,53% backend cycles idle         (45,87%)
        19 111 073      stalled-cycles-frontend          #    0,21% frontend cycles idle        (45,98%)
    22 724 329 790      instructions                     #    2,51  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,92%)
     2 689 040 589      branch-instructions                                                     (45,72%)
        78 702 707      branch-misses                    #    2,93% of all branches             (45,52%)
     3 755 339 761      ic_fetch_stall.ic_stall_any                                             (45,32%)
        20 247 759      l2_cache_misses_from_ic_miss                                            (45,13%)
       851 064 970      l2_latency.l2_cycles_waiting_on_fills                                        (45,02%)
            83 322      faults                                                                
                 1      migrations                                                            

       2,003679910 seconds time elapsed

       1,831763000 seconds user
       0,167978000 seconds sys

HEAD 5ac1daf

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':

       390 126 953      cache-references                                                        (45,24%)
        28 850 992      cache-misses                     #    7,40% of all cache refs           (45,42%)
     9 172 030 367      cycles                                                                  (45,42%)
       395 459 199      stalled-cycles-backend           #    4,31% backend cycles idle         (45,43%)
        20 297 434      stalled-cycles-frontend          #    0,22% frontend cycles idle        (45,55%)
    23 285 910 196      instructions                     #    2,54  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (45,70%)
     2 387 202 436      branch-instructions                                                     (45,76%)
        72 726 804      branch-misses                    #    3,05% of all branches             (45,76%)
     4 172 462 359      ic_fetch_stall.ic_stall_any                                             (45,75%)
        23 132 818      l2_cache_misses_from_ic_miss                                            (45,56%)
       761 995 310      l2_latency.l2_cycles_waiting_on_fills                                        (45,35%)
            83 319      faults                                                                
                 1      migrations                                                            

       2,034150060 seconds time elapsed

       1,875813000 seconds user
       0,155984000 seconds sys

@danielrh
Copy link

danielrh commented Apr 22, 2024 via email

@mcroomp mcroomp merged commit 949957e into microsoft:main Apr 22, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants