Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local stream state #101

Merged
merged 6 commits into from
Oct 19, 2024
Merged

Local stream state #101

merged 6 commits into from
Oct 19, 2024

Conversation

Melirius
Copy link
Collaborator

@Melirius Melirius commented Oct 17, 2024

Using local stream reader state it is possible to reduce number of reads and make branch prediction more successful.

Base performance:

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/img_52MP_7k.jpg':

       444 051 506      cache-references                                                        (41,55%)
        41 665 486      cache-misses                     #    9,38% of all cache refs           (41,56%)
     8 113 956 517      cycles                                                                  (41,60%)
       440 862 436      ic_fetch_stall.ic_stall_back_pressure                                        (41,68%)
       457 817 718      stalled-cycles-frontend          #    5,64% frontend cycles idle        (41,90%)
    18 577 966 449      instructions                     #    2,29  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (41,91%)
     1 848 764 378      branch-instructions                                                     (41,88%)
        66 021 856      branch-misses                    #    3,57% of all branches             (41,83%)
     3 584 631 601      ic_fetch_stall.ic_stall_any                                             (41,74%)
        20 880 168      ic_fetch_stall.ic_stall_dq_empty                                        (41,66%)
        34 198 382      l2_cache_misses_from_ic_miss                                            (41,63%)
       875 464 756      l2_latency.l2_cycles_waiting_on_fills                                        (41,58%)
            99 041      faults                                                                
                 1      migrations                                                            

       1,789772123 seconds time elapsed

       1,595364000 seconds user
       0,194044000 seconds sys

MR performance:

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/img_52MP_7k.jpg':

       442 976 049      cache-references                                                        (41,92%)
        32 149 359      cache-misses                     #    7,26% of all cache refs           (41,79%)
     7 816 351 937      cycles                                                                  (41,71%)
       347 399 557      ic_fetch_stall.ic_stall_back_pressure                                        (41,65%)
       446 775 084      stalled-cycles-frontend          #    5,72% frontend cycles idle        (41,60%)
    18 219 535 389      instructions                     #    2,33  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (41,57%)
     1 901 228 430      branch-instructions                                                     (41,63%)
        67 440 269      branch-misses                    #    3,55% of all branches             (41,55%)
     3 245 622 864      ic_fetch_stall.ic_stall_any                                             (41,68%)
        20 712 859      ic_fetch_stall.ic_stall_dq_empty                                        (41,90%)
        25 226 527      l2_cache_misses_from_ic_miss                                            (41,91%)
       876 045 483      l2_latency.l2_cycles_waiting_on_fills                                        (41,91%)
            99 039      faults                                                                
                 1      migrations                                                            

       1,726217313 seconds time elapsed

       1,531874000 seconds user
       0,193984000 seconds sys

@Melirius Melirius requested a review from mcroomp October 17, 2024 23:06
@Melirius
Copy link
Collaborator Author

I'll introduce the same scheme into writer, let's see if it helps there.

@Melirius
Copy link
Collaborator Author

With inlining it is even faster:

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/img_52MP_7k.jpg':

       460 519 848      cache-references                                                        (41,58%)
        32 673 291      cache-misses                     #    7,09% of all cache refs           (41,57%)
     7 616 518 496      cycles                                                                  (41,61%)
       416 529 485      ic_fetch_stall.ic_stall_back_pressure                                        (41,65%)
       465 646 089      stalled-cycles-frontend          #    6,11% frontend cycles idle        (41,72%)
    17 783 817 601      instructions                     #    2,33  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,75%)
     1 641 432 871      branch-instructions                                                     (41,81%)
        66 669 812      branch-misses                    #    4,06% of all branches             (41,91%)
     3 130 681 513      ic_fetch_stall.ic_stall_any                                             (41,85%)
        18 740 701      ic_fetch_stall.ic_stall_dq_empty                                        (41,74%)
        24 072 665      l2_cache_misses_from_ic_miss                                            (41,65%)
       937 944 473      l2_latency.l2_cycles_waiting_on_fills                                        (41,56%)
            99 040      faults                                                                
                 1      migrations                                                            

       1,684940568 seconds time elapsed

       1,487663000 seconds user
       0,196955000 seconds sys

@Melirius
Copy link
Collaborator Author

Results for encoding.
Base:

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k2.lep':

       841 203 573      cache-references                                                        (41,95%)
        92 493 657      cache-misses                     #   11,00% of all cache refs           (42,07%)
    17 647 399 635      cycles                                                                  (41,94%)
       788 045 094      ic_fetch_stall.ic_stall_back_pressure                                        (41,80%)
     1 065 114 135      stalled-cycles-frontend          #    6,04% frontend cycles idle        (41,79%)
    43 791 165 868      instructions                     #    2,48  insn per cycle            
                                                  #    0,02  stalled cycles per insn     (41,74%)
     4 893 011 954      branch-instructions                                                     (41,70%)
       164 179 387      branch-misses                    #    3,36% of all branches             (41,66%)
     6 730 270 402      ic_fetch_stall.ic_stall_any                                             (41,62%)
        42 649 850      ic_fetch_stall.ic_stall_dq_empty                                        (41,73%)
        76 514 469      l2_cache_misses_from_ic_miss                                            (41,80%)
     1 878 302 906      l2_latency.l2_cycles_waiting_on_fills                                        (41,85%)
           182 917      faults                                                                
                 1      migrations                                                            

       3,929784849 seconds time elapsed

       3,602288000 seconds user
       0,325935000 seconds sys

Now:

 Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.jpg images/img_52MP_7k1.lep':

       854 138 732      cache-references                                                        (41,84%)
        84 858 736      cache-misses                     #    9,94% of all cache refs           (41,72%)
    16 311 102 949      cycles                                                                  (41,48%)
       730 305 471      ic_fetch_stall.ic_stall_back_pressure                                        (41,50%)
     1 052 505 605      stalled-cycles-frontend          #    6,45% frontend cycles idle        (41,81%)
    40 933 550 832      instructions                     #    2,51  insn per cycle            
                                                  #    0,03  stalled cycles per insn     (41,86%)
     4 452 381 966      branch-instructions                                                     (41,89%)
       166 099 799      branch-misses                    #    3,73% of all branches             (42,06%)
     5 633 254 959      ic_fetch_stall.ic_stall_any                                             (42,04%)
        39 811 150      ic_fetch_stall.ic_stall_dq_empty                                        (42,12%)
        71 293 887      l2_cache_misses_from_ic_miss                                            (41,98%)
     1 978 503 677      l2_latency.l2_cycles_waiting_on_fills                                        (41,88%)
           183 257      faults                                                                
                 1      migrations                                                            

       3,624653793 seconds time elapsed

       3,302782000 seconds user
       0,318786000 seconds sys

@Melirius
Copy link
Collaborator Author

In principle, local stream state can be shifted even higher on the function invocation hierarchy into model.rs, but it will be less clear and I'm not sure if it gives more performance.

@mcroomp mcroomp merged commit 7799db0 into main Oct 19, 2024
3 checks passed
@mcroomp mcroomp deleted the Local-stream-state branch October 19, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants