Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable full LTO on Python 3.12 and 3.13 #529

Closed
wants to merge 1 commit into from
Closed

Enable full LTO on Python 3.12 and 3.13 #529

wants to merge 1 commit into from

Conversation

zanieb
Copy link
Member

@zanieb zanieb commented Feb 13, 2025

Closes #528

@@ -427,7 +427,11 @@ if [ -n "${CPYTHON_OPTIMIZED}" ]; then
fi

if [ -n "${CPYTHON_LTO}" ]; then
CONFIGURE_FLAGS="${CONFIGURE_FLAGS} --with-lto"
# On Python 3.12 and 3.13, `--with-lto` enables ThinLTO by default, while on other versions it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about 3.14? Did it change in 3.14?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.14a5 and below still defaults to ThinLTO. 3.14a6 (upcoming) will default back to fullLTO.

It's safe to just pass --with-lto=full to all versions of 3.14. That will always work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the linked documentation says they're reverting the ThinLTO default in 3.14.

Seems best to just set it on all versions.

@zanieb
Copy link
Member Author

zanieb commented Feb 13, 2025

I'll try to do a brief analysis of the performance and build time trade-offs here.

@zanieb
Copy link
Member Author

zanieb commented Feb 13, 2025

Total job time (main / branch)

  • Linux 34m / 37m
  • macOS 1h 13m / 1h 3m

Target job time (main / branch)

  • aarch64-apple-darwin / 3.12 / pgo+lto 15m / 16m
  • aarch64-apple-darwin / 3.13 / pgo+lto 14m / 16m
  • aarch64-apple-darwin / 3.14 / pgo+lto 15m / 16m
  • x86_64-apple-darwin / 3.12 / pgo+lto 15m / 17m
  • x86_64-apple-darwin / 3.13 / pgo+lto 16m / 19m
  • x86_64-apple-darwin / 3.14 / pgo+lto 16m / 18m
  • x86_64-unknown-linux-gnu / 3.13 / pgo+lto 28m / 28m

@zanieb
Copy link
Member Author

zanieb commented Feb 13, 2025

Surprisingly, this is consistently a bit slower on the pystones benchmark used for #524 on aarch64-apple-darwin

❯ hyperfine "$(uv python find 3.12.9) ./pystones.py" "./python/install/bin/python ./pystones.py"
Benchmark 1: /Users/zb/.local/share/uv/python/cpython-3.12.9-macos-aarch64-none/bin/python3.12 ./pystones.py
  Time (mean ± σ):      75.1 ms ±   1.0 ms    [User: 72.3 ms, System: 2.1 ms]
  Range (min … max):    73.2 ms …  76.6 ms    39 runs
 
Benchmark 2: ./python/install/bin/python ./pystones.py
  Time (mean ± σ):      79.2 ms ±   1.0 ms    [User: 76.0 ms, System: 2.5 ms]
  Range (min … max):    77.6 ms …  81.7 ms    36 runs
 
Summary
  /Users/zb/.local/share/uv/python/cpython-3.12.9-macos-aarch64-none/bin/python3.12 ./pystones.py ran
    1.05 ± 0.02 times faster than ./python/install/bin/python ./pystones.py
❯ hyperfine "$(uv python find 3.13.2) ./pystones.py" "./python/install/bin/python3.13 ./pystones.py"
Benchmark 1: /Users/zb/workspace/python-build-standalone/.venv/bin/python3 ./pystones.py
  Time (mean ± σ):      75.7 ms ±   2.3 ms    [User: 72.4 ms, System: 2.4 ms]
  Range (min … max):    72.6 ms …  84.6 ms    39 runs
 
Benchmark 2: ./python/install/bin/python3.13 ./pystones.py
  Time (mean ± σ):      80.6 ms ±   2.4 ms    [User: 76.5 ms, System: 3.0 ms]
  Range (min … max):    77.1 ms …  88.4 ms    34 runs
 
Summary
  /Users/zb/workspace/python-build-standalone/.venv/bin/python3 ./pystones.py ran
    1.06 ± 0.05 times faster than ./python/install/bin/python3.13 ./pystones.py

@zanieb
Copy link
Member Author

zanieb commented Feb 13, 2025

This is less pronounced (but still present?) if I compare to an artifact from main instead of a uv-installed binary

❯ hyperfine "./main/install/bin/python3.13 ./pystones.py" "./branch/install/bin/python3.13 ./pystones.py" --min-runs 100
Benchmark 1: ./main/install/bin/python3.13 ./pystones.py
  Time (mean ± σ):      77.4 ms ±   1.1 ms    [User: 74.5 ms, System: 2.3 ms]
  Range (min … max):    75.1 ms …  80.4 ms    100 runs
 
Benchmark 2: ./branch/install/bin/python3.13 ./pystones.py
  Time (mean ± σ):      79.2 ms ±   1.1 ms    [User: 76.3 ms, System: 2.3 ms]
  Range (min … max):    77.4 ms …  85.0 ms    100 runs
 
Summary
  ./main/install/bin/python3.13 ./pystones.py ran
    1.02 ± 0.02 times faster than ./branch/install/bin/python3.13 ./pystones.py

@zanieb
Copy link
Member Author

zanieb commented Feb 13, 2025

I'll find another benchmark to try.

The summary at python/cpython#122580 (comment) is relatively compelling.

@zanieb
Copy link
Member Author

zanieb commented Feb 13, 2025

With the script from python/cpython#122580 (comment) — which was the original reproduction for a regression, I also don't see an improvement

❯ hyperfine "./main/install/bin/python3.13 ./repro.py" "./branch/install/bin/python3.13 ./repro.py"
Benchmark 1: ./main/install/bin/python3.13 ./repro.py
  Time (mean ± σ):      5.926 s ±  0.048 s    [User: 5.900 s, System: 0.021 s]
  Range (min … max):    5.833 s …  5.975 s    10 runs
 
Benchmark 2: ./branch/install/bin/python3.13 ./repro.py
  Time (mean ± σ):      5.988 s ±  0.035 s    [User: 5.962 s, System: 0.018 s]
  Range (min … max):    5.927 s …  6.050 s    10 runs
 
Summary
  ./main/install/bin/python3.13 ./repro.py ran
    1.01 ± 0.01 times faster than ./branch/install/bin/python3.13 ./repro.py

@Fidget-Spinner
Copy link

@zanieb It might be that this was a bug in Apple Clang but fixed in normal Clang. The version of clang Ned used was clang-1500 but knowing Apple, that isn't actually clang 15 but some older internal version. The Clang version here is pretty new in comparison.

So I'm sorry for the noise if that turns out to be the case.

@Fidget-Spinner
Copy link

I think my guess is probably right. I will close the issue as the build time tradeoff is not worth it. So sorry for the noise!

@zanieb
Copy link
Member Author

zanieb commented Feb 13, 2025

Does this mean I want to explicitly request thin on 3.14 now? :)

@Fidget-Spinner
Copy link

Does this mean I want to explicitly request thin on 3.14 now? :)

I think I will revert the change upstream. It's probably somewhat disruptive now anyways to flip-flop. Thanks.

@zanieb zanieb closed this Feb 13, 2025
@zanieb
Copy link
Member Author

zanieb commented Feb 25, 2025

I'm going to explore this again in the context of #535 and #539 — I think thin LTO is one of the few configuration differences between our build and the conda-forge one.

I'll test some of the binaries here with the pyperformance suite.

@zanieb zanieb mentioned this pull request Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On Clang, LTO is ThinLTO, which leaves a lot of performance out for macOS
3 participants