-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworked faster VPXBoolReader #65
Conversation
@microsoft-github-policy-service agree |
Pretty cool... let me run the benchmark (I put in a CR to make perf tests more reproducible on machines with p-cores/e-cores). |
Can you merge the latest changes? I added a flag to run the threads at high priority so that they don't get randomly assigned to e-cores and give random perf results. Thanks! |
I've merged the changes and applied a quick fix to get it working on Linux. Performance on my machine (AMD Ryzen 9 5950x) ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/1.jpg
2024-04-19T12:49:55.810Z INFO [lepton_jpeg_util::structs::lepton_format] worker threads 2604ms of CPU time in 2611ms of wall time
2024-04-19T12:49:55.821Z INFO [lepton_jpeg_util] Total CPU time consumed:5226ms
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/1.jpg':
420 831 657 cache-references (45,68%)
29 593 645 cache-misses # 7,03% of all cache refs (45,53%)
11 950 248 122 cycles (45,53%)
794 726 352 stalled-cycles-backend # 6,65% backend cycles idle (45,54%)
30 584 690 stalled-cycles-frontend # 0,26% frontend cycles idle (45,62%)
25 288 083 663 instructions # 2,12 insn per cycle
# 0,03 stalled cycles per insn (45,46%)
2 821 325 264 branch-instructions (45,46%)
122 789 527 branch-misses # 4,35% of all branches (45,60%)
4 364 043 009 ic_fetch_stall.ic_stall_any (45,58%)
24 492 492 l2_cache_misses_from_ic_miss (45,28%)
719 649 048 l2_latency.l2_cycles_waiting_on_fills (45,31%)
83 331 faults
1 migrations
2,670911983 seconds time elapsed
2,478538000 seconds user
0,176465000 seconds sys With ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/8.jpg
2024-04-19T17:31:38.017Z INFO [lepton_jpeg_util::structs::lepton_format] worker threads 2407ms of CPU time in 2415ms of wall time
2024-04-19T17:31:38.027Z INFO [lepton_jpeg_util] Total CPU time consumed:4833ms
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/8.jpg':
398 284 458 cache-references (45,36%)
23 056 218 cache-misses # 5,79% of all cache refs (45,36%)
11 302 720 704 cycles (45,36%)
385 552 345 stalled-cycles-backend # 3,41% backend cycles idle (45,44%)
27 924 718 stalled-cycles-frontend # 0,25% frontend cycles idle (45,67%)
22 753 227 053 instructions # 2,01 insn per cycle
# 0,02 stalled cycles per insn (45,76%)
2 753 876 077 branch-instructions (45,76%)
122 616 534 branch-misses # 4,45% of all branches (45,76%)
4 253 811 562 ic_fetch_stall.ic_stall_any (45,62%)
17 812 156 l2_cache_misses_from_ic_miss (45,45%)
748 264 211 l2_latency.l2_cycles_waiting_on_fills (45,25%)
83 322 faults
1 migrations
2,470532743 seconds time elapsed
2,287802000 seconds user
0,179984000 seconds sys This PR ae72622 ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/6.jpg
2024-04-19T13:58:06.292Z INFO [lepton_jpeg_util::structs::lepton_format] worker threads 2301ms of CPU time in 2309ms of wall time
2024-04-19T13:58:06.302Z INFO [lepton_jpeg_util] Total CPU time consumed:4621ms
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/6.jpg':
424 181 098 cache-references (45,14%)
38 771 629 cache-misses # 9,14% of all cache refs (45,31%)
10 708 303 804 cycles (45,48%)
974 902 117 stalled-cycles-backend # 9,10% backend cycles idle (45,74%)
19 834 697 stalled-cycles-frontend # 0,19% frontend cycles idle (46,00%)
25 080 659 902 instructions # 2,34 insn per cycle
# 0,04 stalled cycles per insn (46,09%)
2 470 568 447 branch-instructions (45,92%)
83 787 955 branch-misses # 3,39% of all branches (45,75%)
4 425 221 507 ic_fetch_stall.ic_stall_any (45,42%)
30 889 820 l2_cache_misses_from_ic_miss (45,08%)
775 685 423 l2_latency.l2_cycles_waiting_on_fills (44,86%)
83 331 faults
1 migrations
2,366150566 seconds time elapsed
2,187776000 seconds user
0,175982000 seconds sys With ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/7.jpg
2024-04-19T17:29:30.160Z INFO [lepton_jpeg_util::structs::lepton_format] worker threads 2130ms of CPU time in 2138ms of wall time
2024-04-19T17:29:30.170Z INFO [lepton_jpeg_util] Total CPU time consumed:4279ms
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/7.jpg':
393 597 797 cache-references (44,87%)
24 348 442 cache-misses # 6,19% of all cache refs (45,23%)
10 027 485 596 cycles (45,41%)
507 383 544 stalled-cycles-backend # 5,06% backend cycles idle (45,61%)
31 006 974 stalled-cycles-frontend # 0,31% frontend cycles idle (45,90%)
22 609 551 128 instructions # 2,25 insn per cycle
# 0,02 stalled cycles per insn (46,03%)
2 402 630 001 branch-instructions (46,03%)
82 632 040 branch-misses # 3,44% of all branches (45,84%)
4 741 516 339 ic_fetch_stall.ic_stall_any (45,63%)
18 350 203 l2_cache_misses_from_ic_miss (45,27%)
783 604 112 l2_latency.l2_cycles_waiting_on_fills (44,90%)
83 321 faults
1 migrations
2,192710468 seconds time elapsed
2,007697000 seconds user
0,183972000 seconds sys |
Do you have performance results on Intel? I don't have one to check, unfortunately. |
What cpu are you running? The only CPU that the 0 check for leading_zeros helps is the really old intel ones.
Here’s arm64 for example produces pretty much the same code. The only difference is whether you subtract first, then shift or vice versa.
https://godbolt.org/z/hn5Ga8oj5
From: Ivan Siutsou ***@***.***>
Sent: Friday, April 19, 2024 7:36 PM
To: microsoft/lepton_jpeg_rust ***@***.***>
Cc: Kristof Roomp ***@***.***>; Review requested ***@***.***>
Subject: Re: [microsoft/lepton_jpeg_rust] Reworked faster VPXBoolReader (PR #65)
Pretty cool... let me run the benchmark (I put in a CR to make perf tests more reproducible on machines with p-cores/e-cores). Maybe @danielrh <https://github.com/danielrh> can have a look as well. Thanks!
Do you have performance results on Intel? I don't have one to check, unfortunately.
—
Reply to this email directly, view it on GitHub <#65 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6EMMTNUHIFI2TAIELWK3LY6FIYZAVCNFSM6AAAAABGNVJUTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGAYDSNJZGA> .
You are receiving this because your review was requested. <https://github.com/notifications/beacon/AE6EMMTMDMMIEE74OXO5SV3Y6FIYZA5CNFSM6AAAAABGNVJUTKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT3GQIDM.gif> Message ID: ***@***.*** ***@***.***> >
|
I have AMD Ryzen 9 5950x.
Huh, the simple variant have one additional But it is strange that both variants are effectively serialized working only with one register. |
I can run on an intel box this evening
…On Fri, Apr 19, 2024 at 11:48 AM Ivan Siutsou ***@***.***> wrote:
I have AMD Ryzen 9 5950x.
—
Reply to this email directly, view it on GitHub
<#65 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAALRELPLFUO5627L336YMLY6FRHRAVCNFSM6AAAAABGNVJUTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGEYTIMBUG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
https://godbolt.org/z/K3oPeo16e Ryzen looks like it would benefit more because of the slow lzcnt instructions. Here's the new code (with target-feature=+lzcnt):
old code (with lzcnt):
|
Any results? |
Unfortunately I don't know where you got img_52MP_7k.lep...is that avail in the repo? should I run any old image? |
This is what I see for iphonecity.lep
|
No, this one is used by me for benchmarking - it is large and has many differently filled DCT blocks. Here it is |
ok with your image:
|
So on Intel we have nice performance improvement too, good. |
I think this change is great with two minor modifications:
Thanks! |
The first point is very much detrimental to performance: I got Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':
392 835 432 cache-references (45,31%)
26 246 563 cache-misses # 6,68% of all cache refs (45,31%)
11 148 423 929 cycles (45,31%)
448 831 266 stalled-cycles-backend # 4,03% backend cycles idle (45,41%)
25 088 462 stalled-cycles-frontend # 0,23% frontend cycles idle (45,66%)
22 407 825 690 instructions # 2,01 insn per cycle
# 0,02 stalled cycles per insn (45,77%)
2 626 947 282 branch-instructions (45,77%)
126 178 314 branch-misses # 4,80% of all branches (45,77%)
3 955 491 492 ic_fetch_stall.ic_stall_any (45,71%)
20 961 529 l2_cache_misses_from_ic_miss (45,48%)
711 230 610 l2_latency.l2_cycles_waiting_on_fills (45,25%)
83 322 faults
1 migrations
2,466859191 seconds time elapsed
2,303568000 seconds user
0,159970000 seconds sys instead of previous ivan@ivan-5950:~/lepton_jpeg_rust$ sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/7.jpg
2024-04-19T17:29:30.160Z INFO [lepton_jpeg_util::structs::lepton_format] worker threads 2130ms of CPU time in 2138ms of wall time
2024-04-19T17:29:30.170Z INFO [lepton_jpeg_util] Total CPU time consumed:4279ms
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/7.jpg':
393 597 797 cache-references (44,87%)
24 348 442 cache-misses # 6,19% of all cache refs (45,23%)
10 027 485 596 cycles (45,41%)
507 383 544 stalled-cycles-backend # 5,06% backend cycles idle (45,61%)
31 006 974 stalled-cycles-frontend # 0,31% frontend cycles idle (45,90%)
22 609 551 128 instructions # 2,25 insn per cycle
# 0,02 stalled cycles per insn (46,03%)
2 402 630 001 branch-instructions (46,03%)
82 632 040 branch-misses # 3,44% of all branches (45,84%)
4 741 516 339 ic_fetch_stall.ic_stall_any (45,63%)
18 350 203 l2_cache_misses_from_ic_miss (45,27%)
783 604 112 l2_latency.l2_cycles_waiting_on_fills (44,90%)
83 321 faults
1 migrations
2,192710468 seconds time elapsed
2,007697000 seconds user
0,183972000 seconds sys The pb is that each instruction counts here, even branch that never executed. On the second I agree. |
Excluded unsafe assume, simplified split by comments of @mcroomp Assert commented out till the discussion result
Performance of ivan@ivan-5950:~/lepton_jpeg_rust$ sudo rm images/0.jpg; sudo perf stat -B -e cache-references,cache-misses,cycles,stalled-cycles-backend,stalled-cycles-frontend,instructions,branch-instructions,branch-misses,ic_fetch_stall.ic_stall_any,l2_cache_misses_from_ic_miss,l2_latency.l2_cycles_waiting_on_fills,faults,migrations taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg
2024-04-22T14:14:46.266Z INFO [lepton_jpeg_util::structs::lepton_format] worker threads 2164ms of CPU time in 2171ms of wall time
2024-04-22T14:14:46.276Z INFO [lepton_jpeg_util] Total CPU time consumed:4346ms
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':
386 433 841 cache-references (45,26%)
28 699 537 cache-misses # 7,43% of all cache refs (45,44%)
10 075 442 518 cycles (45,62%)
439 770 884 stalled-cycles-backend # 4,36% backend cycles idle (45,87%)
20 099 023 stalled-cycles-frontend # 0,20% frontend cycles idle (46,02%)
23 715 600 325 instructions # 2,35 insn per cycle
# 0,02 stalled cycles per insn (45,95%)
2 947 789 446 branch-instructions (45,78%)
82 757 552 branch-misses # 2,81% of all branches (45,60%)
4 414 774 689 ic_fetch_stall.ic_stall_any (45,29%)
23 576 017 l2_cache_misses_from_ic_miss (45,06%)
697 260 755 l2_latency.l2_cycles_waiting_on_fills (44,98%)
83 320 faults
1 migrations
2,226758367 seconds time elapsed
2,051486000 seconds user
0,171956000 seconds sys |
With new assert Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':
413 999 596 cache-references (45,34%)
26 783 155 cache-misses # 6,47% of all cache refs (45,54%)
9 057 700 647 cycles (45,74%)
501 142 117 stalled-cycles-backend # 5,53% backend cycles idle (45,87%)
19 111 073 stalled-cycles-frontend # 0,21% frontend cycles idle (45,98%)
22 724 329 790 instructions # 2,51 insn per cycle
# 0,02 stalled cycles per insn (45,92%)
2 689 040 589 branch-instructions (45,72%)
78 702 707 branch-misses # 2,93% of all branches (45,52%)
3 755 339 761 ic_fetch_stall.ic_stall_any (45,32%)
20 247 759 l2_cache_misses_from_ic_miss (45,13%)
851 064 970 l2_latency.l2_cycles_waiting_on_fills (45,02%)
83 322 faults
1 migrations
2,003679910 seconds time elapsed
1,831763000 seconds user
0,167978000 seconds sys HEAD 5ac1daf Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':
390 126 953 cache-references (45,24%)
28 850 992 cache-misses # 7,40% of all cache refs (45,42%)
9 172 030 367 cycles (45,42%)
395 459 199 stalled-cycles-backend # 4,31% backend cycles idle (45,43%)
20 297 434 stalled-cycles-frontend # 0,22% frontend cycles idle (45,55%)
23 285 910 196 instructions # 2,54 insn per cycle
# 0,02 stalled cycles per insn (45,70%)
2 387 202 436 branch-instructions (45,76%)
72 726 804 branch-misses # 3,05% of all branches (45,76%)
4 172 462 359 ic_fetch_stall.ic_stall_any (45,75%)
23 132 818 l2_cache_misses_from_ic_miss (45,56%)
761 995 310 l2_latency.l2_cycles_waiting_on_fills (45,35%)
83 319 faults
1 migrations
2,034150060 seconds time elapsed
1,875813000 seconds user
0,155984000 seconds sys |
We should avoid unsafe in all cases, if possible. I’ve not encountered
something that’s more than a sub percent difference with unsafe. Usually
it’s about teaching the optimizer that all array indices can be checked
with a single branch or marking one side of a branch cold somehow. Asserts
or constrained types can help. (Eg use u32 but later upcast to u64 to
demonstrate that an index doesn’t wrap,if the data fits in 32 bit
addressing)
…On Mon, Apr 22, 2024 at 9:30 AM Ivan Siutsou ***@***.***> wrote:
With new assert +avx2+lzcnt 0c60819
<0c60819>
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':
413 999 596 cache-references (45,34%)
26 783 155 cache-misses # 6,47% of all cache refs (45,54%)
9 057 700 647 cycles (45,74%)
501 142 117 stalled-cycles-backend # 5,53% backend cycles idle (45,87%)
19 111 073 stalled-cycles-frontend # 0,21% frontend cycles idle (45,98%)
22 724 329 790 instructions # 2,51 insn per cycle
# 0,02 stalled cycles per insn (45,92%)
2 689 040 589 branch-instructions (45,72%)
78 702 707 branch-misses # 2,93% of all branches (45,52%)
3 755 339 761 ic_fetch_stall.ic_stall_any (45,32%)
20 247 759 l2_cache_misses_from_ic_miss (45,13%)
851 064 970 l2_latency.l2_cycles_waiting_on_fills (45,02%)
83 322 faults
1 migrations
2,003679910 seconds time elapsed
1,831763000 seconds user
0,167978000 seconds sys
HEAD 5ac1daf
<5ac1daf>
Performance counter stats for 'taskset -c 10 nice -n -20 target/release/lepton_jpeg_util images/img_52MP_7k.lep images/0.jpg':
390 126 953 cache-references (45,24%)
28 850 992 cache-misses # 7,40% of all cache refs (45,42%)
9 172 030 367 cycles (45,42%)
395 459 199 stalled-cycles-backend # 4,31% backend cycles idle (45,43%)
20 297 434 stalled-cycles-frontend # 0,22% frontend cycles idle (45,55%)
23 285 910 196 instructions # 2,54 insn per cycle
# 0,02 stalled cycles per insn (45,70%)
2 387 202 436 branch-instructions (45,76%)
72 726 804 branch-misses # 3,05% of all branches (45,76%)
4 172 462 359 ic_fetch_stall.ic_stall_any (45,75%)
23 132 818 l2_cache_misses_from_ic_miss (45,56%)
761 995 310 l2_latency.l2_cycles_waiting_on_fills (45,35%)
83 319 faults
1 migrations
2,034150060 seconds time elapsed
1,875813000 seconds user
0,155984000 seconds sys
—
Reply to this email directly, view it on GitHub
<#65 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAALREKY2JFA6EXFADPFXYTY6U3JHAVCNFSM6AAAAABGNVJUTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZQGEZTIMZYG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
VPXBoolReader
get
is one the hottest functions so every single operation reduction there gives a nice performance improvement. Using left-shiftedrange
we can get rid ofsplit
and subtraction inshift
, and presentedbig_split
calculation scheme reduces dependency chain length from 5 to 4. Then I have ~ 4% performance gain on decoding.