Always inline InterpCx::layout_of after perf regression#143334
Always inline InterpCx::layout_of after perf regression#143334Stypox wants to merge 3 commits intorust-lang:masterfrom
Conversation
|
That's not quite the right command -- and indeed you won't have permission. @bors2 try |
Always inline InterpCx::layout_of after perf regression Followup to #142721 to fix the performance regression. I ran one quick benchmark locally (`ctfe-stress-5`) and it does seem to be faster. I further tried adding `#[inline(always)]` to `compiler/rustc_middle/src/ty/layout.rs` in `layout_of()` under `LayoutOfHelpers` but that didn't change the benchmark results at all. `@rust-timer` build e7e3c9e (I'm not sure I have permission to do the above) r? `@RalfJung` <!-- homu-ignore:start --> <!-- If this PR is related to an unstable feature or an otherwise tracked effort, please link to the relevant tracking issue here. If you don't know of a related tracking issue or there are none, feel free to ignore this. This PR will get automatically assigned to a reviewer. In case you would like a specific user to review your work, you can assign it to them by using r? <reviewer name> --> <!-- homu-ignore:end -->
|
Argh... |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (1ab17c0): comparison URL. Overall result: ❌ regressions - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 2.3%, secondary 1.4%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary -0.6%, secondary 0.9%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 462.452s -> 461.112s (-0.29%) |
|
So, uh, somehow this actually makes things even worse...? |
|
It seems potentially non-trivial to optimize away the drop glue for this type ENTERED_TRACING_SPAN = ();
fn enter_tracing_span(tracing::Span) -> Self::ENTERED_TRACING_SPAN { () }The macro then simply expands to |
|
@rustbot author |
|
Reminder, once the PR becomes ready for a review, use |
I couldn't use this directly as "associated type defaults are unstable" and, most importantly, since trait implementations would be able to override fn enter_trace_span(_span: tracing::Span) -> impl EnteredTraceSpan { () }
trait EnteredTraceSpan {}
impl EnteredTraceSpan for () {}
impl EnteredTraceSpan for tracing::span::EnteredSpan {}I again ran a very quick benchmark and again the latest commit seems slightly faster than the one before on my device, but it obviously shouldn't be trusted, so please run another full benchmark. @rustbot ready |
Hopefully this will make tracing calls be optimized out properly when tracing is disabled
Ah, fair.
Interesting, I would have just removed the default. ;) |
|
@bors2 try |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (ce51ffb): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -3.9%, secondary -0.6%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -1.9%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -1.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 460.935s -> 461.093s (0.03%) |
|
Fun, so that does seem to help for ctfe-stress, but other benchmarks don't like it, and it doesn't fully negate the effect of #143334. Can you try adding The only other thing I can think of is an entirely different, closure-based API. But given that the regression only affects secondary benchmarks / stress tests, I don't think that's worth it. |
|
I added inline(always) to the machine hook and also opened #143520 as an alternative using a closure (I'm not sure it's what you had in mind but given the simple change I wanted to try it anyway). Could you run the benchmarks on both? Thanks! |
Fix perf regression caused by tracing See #143334, this is another alternative that may be worth benchmarking as suggested in #143334 (comment). r? `@RalfJung` <!-- homu-ignore:start --> <!-- If this PR is related to an unstable feature or an otherwise tracked effort, please link to the relevant tracking issue here. If you don't know of a related tracking issue or there are none, feel free to ignore this. This PR will get automatically assigned to a reviewer. In case you would like a specific user to review your work, you can assign it to them by using r? <reviewer name> --> <!-- homu-ignore:end -->
|
@bors2 try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Always inline InterpCx::layout_of after perf regression Followup to #142721 to fix the performance regression. I ran one quick benchmark locally (`ctfe-stress-5`) and it does seem to be faster. I further tried adding `#[inline(always)]` to `compiler/rustc_middle/src/ty/layout.rs` in `layout_of()` under `LayoutOfHelpers` but that didn't change the benchmark results at all. `@rust-timer` build e7e3c9e (I'm not sure I have permission to do the above) r? `@RalfJung` <!-- homu-ignore:start --> <!-- If this PR is related to an unstable feature or an otherwise tracked effort, please link to the relevant tracking issue here. If you don't know of a related tracking issue or there are none, feel free to ignore this. This PR will get automatically assigned to a reviewer. In case you would like a specific user to review your work, you can assign it to them by using r? <reviewer name> --> <!-- homu-ignore:end -->
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (e5441bd): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -1.6%, secondary -2.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary -1.2%, secondary 3.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 460.756s -> 461.178s (0.09%) |
|
Pretty much the same as the previous one. Should we land this or do you want to try the other closure variant in the other PR? |
|
☔ The latest upstream changes (presumably #143582) made this pull request unmergeable. Please resolve the merge conflicts. |
|
TLDR: I would close this PR and merge #143520 instead, as it should be at least as good as what we have now (i.e. the I did a bit of analysis based on the strings contained in the
The command I used to make the comparisons was, among others ($1 and $2 are the executables): diff <(strings -n 20 $1 | rustfilt | sort) <(strings -n 20 $2 | rustfilt | sort) > ${1}_vs_${2}.txtThese are the binaries I compared: https://github.com/Stypox/testing-apks/releases/download/19/miri-binaries-comparisons.zip |
|
I agree, let's close in favor of #143520. |
…=RalfJung Fix perf regression caused by tracing See rust-lang#143334, this is another alternative that may be worth benchmarking as suggested in rust-lang#143334 (comment). r? `@RalfJung`
…=RalfJung Fix perf regression caused by tracing See rust-lang#143334, this is another alternative that may be worth benchmarking as suggested in rust-lang#143334 (comment). r? ``@RalfJung``
Rollup merge of #143520 - Stypox:enter_trace_span-closure, r=RalfJung Fix perf regression caused by tracing See #143334, this is another alternative that may be worth benchmarking as suggested in #143334 (comment). r? ``@RalfJung``
Fix perf regression caused by tracing See rust-lang/rust#143334, this is another alternative that may be worth benchmarking as suggested in rust-lang/rust#143334 (comment). r? ``@RalfJung``
Followup to #142721 to fix the performance regression. I ran one quick benchmark locally (
ctfe-stress-5) and it does seem to be faster. I further tried adding#[inline(always)]tocompiler/rustc_middle/src/ty/layout.rsinlayout_of()underLayoutOfHelpersbut that didn't change the benchmark results at all.@rust-timer build e7e3c9e
(I'm not sure I have permission to do the above)
r? @RalfJung