Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpython 3.13 installed with UV slow and not compiled with --enable-experimental-jit=yes-off` #535

Open
paugier opened this issue Feb 20, 2025 · 9 comments

Comments

@paugier
Copy link

paugier commented Feb 20, 2025

I tried a very simple pure Python benchmark (see https://gricad-gitlab.univ-grenoble-alpes.fr/augierpi/augierpi.gricad-pages.univ-grenoble-alpes.fr/-/tree/branch/default/content/docs/2025/about-py-jit) and figured out that CPython 3.13 installed with UV is slower than CPython 3.13 installed from conda-forge.

The benchmark is very simple (the goal was to be able to observe an effect of the new JIT in CPython 3.13):

def short_calcul(n):
    result = 0
    for i in range(1, n+1):
        result += i
    return result

def long_calcul(num):
    result = 0
    for i in range(num):
        result += short_calcul(i) - short_calcul(i)
    return result

The results:

$ /usr/bin/python3 bench_loops_sum.py
3.11.2 (main, Sep 14 2024, 03:00:30) [GCC 12.2.0]
Number of long_calcul per second: 56.10

$ pypy bench_loops_sum.py
3.11.11 (b38de282cead, Feb 05 2025, 16:26:37)
[PyPy 7.3.18 with GCC 10.2.1 20210130 (Red Hat 10.2.1-11)]
Number of long_calcul per second: 1992.83

$ python bench_loops_sum.py
3.13.2 | packaged by conda-forge | (main, Feb 17 2025, 14:10:22) [GCC 13.3.0]
Number of long_calcul per second: 51.12

$ PYTHON_JIT=1 python bench_loops_sum.py
3.13.2 | packaged by conda-forge | (main, Feb 17 2025, 14:10:22) [GCC 13.3.0]
Number of long_calcul per second: 60.39

$ python bench_loops_sum.py
3.13.2 (main, Feb 12 2025, 14:51:17) [Clang 19.1.6 ]
Number of long_calcul per second: 40.91

$ python bench_loops_sum.py
3.14.0a5 (main, Feb 12 2025, 14:51:40) [Clang 19.1.6 ]
Number of long_calcul per second: 51.41

This is bad:

  • CPython 3.13 installed with UV is slow
  • CPython installed with UV are not compiled with --enable-experimental-jit=yes-off
@zanieb
Copy link
Member

zanieb commented Feb 24, 2025

Thanks for the report. Do you know if Conda builds with any particular performance flags?

@zanieb
Copy link
Member

zanieb commented Feb 25, 2025

I did an actual performance run (Linux x86-64) and it looks like there's a significant difference here

❯ uvx pyperf compare_to conda-forge-313.json pbs-313.json
All benchmarks:
===============

2to3: Mean +- std dev: [conda-forge-313] 230 ms +- 1 ms -> [pbs-313] 279 ms +- 2 ms: 1.21x slower
async_generators: Mean +- std dev: [conda-forge-313] 347 ms +- 4 ms -> [pbs-313] 451 ms +- 4 ms: 1.30x slower
async_tree_none: Mean +- std dev: [conda-forge-313] 325 ms +- 9 ms -> [pbs-313] 402 ms +- 10 ms: 1.24x slower
async_tree_cpu_io_mixed: Mean +- std dev: [conda-forge-313] 516 ms +- 6 ms -> [pbs-313] 620 ms +- 6 ms: 1.20x slower
async_tree_cpu_io_mixed_tg: Mean +- std dev: [conda-forge-313] 518 ms +- 33 ms -> [pbs-313] 635 ms +- 34 ms: 1.23x slower
async_tree_eager: Mean +- std dev: [conda-forge-313] 104 ms +- 2 ms -> [pbs-313] 138 ms +- 2 ms: 1.33x slower
async_tree_eager_cpu_io_mixed: Mean +- std dev: [conda-forge-313] 369 ms +- 9 ms -> [pbs-313] 427 ms +- 9 ms: 1.16x slower
async_tree_eager_cpu_io_mixed_tg: Mean +- std dev: [conda-forge-313] 471 ms +- 23 ms -> [pbs-313] 552 ms +- 21 ms: 1.17x slower
async_tree_eager_io: Mean +- std dev: [conda-forge-313] 780 ms +- 37 ms -> [pbs-313] 926 ms +- 41 ms: 1.19x slower
async_tree_eager_io_tg: Mean +- std dev: [conda-forge-313] 785 ms +- 51 ms -> [pbs-313] 917 ms +- 47 ms: 1.17x slower
async_tree_eager_memoization: Mean +- std dev: [conda-forge-313] 235 ms +- 13 ms -> [pbs-313] 283 ms +- 12 ms: 1.21x slower
async_tree_eager_memoization_tg: Mean +- std dev: [conda-forge-313] 321 ms +- 19 ms -> [pbs-313] 396 ms +- 19 ms: 1.23x slower
async_tree_eager_tg: Mean +- std dev: [conda-forge-313] 247 ms +- 11 ms -> [pbs-313] 302 ms +- 12 ms: 1.23x slower
async_tree_io: Mean +- std dev: [conda-forge-313] 744 ms +- 34 ms -> [pbs-313] 907 ms +- 36 ms: 1.22x slower
async_tree_io_tg: Mean +- std dev: [conda-forge-313] 746 ms +- 37 ms -> [pbs-313] 910 ms +- 37 ms: 1.22x slower
async_tree_memoization: Mean +- std dev: [conda-forge-313] 400 ms +- 43 ms -> [pbs-313] 494 ms +- 47 ms: 1.24x slower
async_tree_memoization_tg: Mean +- std dev: [conda-forge-313] 404 ms +- 5 ms -> [pbs-313] 500 ms +- 3 ms: 1.24x slower
async_tree_none_tg: Mean +- std dev: [conda-forge-313] 295 ms +- 7 ms -> [pbs-313] 372 ms +- 6 ms: 1.26x slower
asyncio_tcp: Mean +- std dev: [conda-forge-313] 365 ms +- 4 ms -> [pbs-313] 359 ms +- 3 ms: 1.02x faster
asyncio_tcp_ssl: Mean +- std dev: [conda-forge-313] 1.34 sec +- 0.01 sec -> [pbs-313] 1.35 sec +- 0.00 sec: 1.01x slower
asyncio_websockets: Mean +- std dev: [conda-forge-313] 519 ms +- 7 ms -> [pbs-313] 1.54 sec +- 0.00 sec: 2.96x slower
chameleon: Mean +- std dev: [conda-forge-313] 6.17 ms +- 0.07 ms -> [pbs-313] 8.54 ms +- 0.06 ms: 1.38x slower
chaos: Mean +- std dev: [conda-forge-313] 54.5 ms +- 1.5 ms -> [pbs-313] 72.7 ms +- 0.7 ms: 1.33x slower
comprehensions: Mean +- std dev: [conda-forge-313] 15.0 us +- 0.2 us -> [pbs-313] 19.7 us +- 0.2 us: 1.31x slower
bench_mp_pool: Mean +- std dev: [conda-forge-313] 15.4 ms +- 9.2 ms -> [pbs-313] 7.15 ms +- 1.30 ms: 2.15x faster
bench_thread_pool: Mean +- std dev: [conda-forge-313] 913 us +- 39 us -> [pbs-313] 980 us +- 30 us: 1.07x slower
coroutines: Mean +- std dev: [conda-forge-313] 21.9 ms +- 0.2 ms -> [pbs-313] 27.8 ms +- 0.2 ms: 1.27x slower
coverage: Mean +- std dev: [conda-forge-313] 74.7 ms +- 0.9 ms -> [pbs-313] 88.8 ms +- 1.3 ms: 1.19x slower
crypto_pyaes: Mean +- std dev: [conda-forge-313] 62.5 ms +- 0.5 ms -> [pbs-313] 82.2 ms +- 0.8 ms: 1.31x slower
dask: Mean +- std dev: [conda-forge-313] 301 ms +- 15 ms -> [pbs-313] 351 ms +- 14 ms: 1.17x slower
deepcopy: Mean +- std dev: [conda-forge-313] 333 us +- 5 us -> [pbs-313] 437 us +- 4 us: 1.31x slower
deepcopy_reduce: Mean +- std dev: [conda-forge-313] 3.04 us +- 0.06 us -> [pbs-313] 4.08 us +- 0.03 us: 1.35x slower
deepcopy_memo: Mean +- std dev: [conda-forge-313] 36.9 us +- 0.4 us -> [pbs-313] 47.3 us +- 0.6 us: 1.28x slower
deltablue: Mean +- std dev: [conda-forge-313] 2.74 ms +- 0.02 ms -> [pbs-313] 3.95 ms +- 0.03 ms: 1.44x slower
django_template: Mean +- std dev: [conda-forge-313] 31.3 ms +- 0.4 ms -> [pbs-313] 43.4 ms +- 0.4 ms: 1.39x slower
docutils: Mean +- std dev: [conda-forge-313] 2.05 sec +- 0.01 sec -> [pbs-313] 2.39 sec +- 0.02 sec: 1.16x slower
dulwich_log: Mean +- std dev: [conda-forge-313] 59.2 ms +- 0.5 ms -> [pbs-313] 76.3 ms +- 0.4 ms: 1.29x slower
fannkuch: Mean +- std dev: [conda-forge-313] 354 ms +- 2 ms -> [pbs-313] 475 ms +- 4 ms: 1.34x slower
float: Mean +- std dev: [conda-forge-313] 71.7 ms +- 0.9 ms -> [pbs-313] 94.7 ms +- 1.3 ms: 1.32x slower
create_gc_cycles: Mean +- std dev: [conda-forge-313] 921 us +- 6 us -> [pbs-313] 1.07 ms +- 0.00 ms: 1.16x slower
gc_traversal: Mean +- std dev: [conda-forge-313] 3.15 ms +- 0.28 ms -> [pbs-313] 3.51 ms +- 0.09 ms: 1.12x slower
generators: Mean +- std dev: [conda-forge-313] 28.2 ms +- 0.5 ms -> [pbs-313] 35.0 ms +- 0.2 ms: 1.24x slower
genshi_text: Mean +- std dev: [conda-forge-313] 20.2 ms +- 0.3 ms -> [pbs-313] 27.8 ms +- 0.3 ms: 1.37x slower
genshi_xml: Mean +- std dev: [conda-forge-313] 48.4 ms +- 0.8 ms -> [pbs-313] 67.1 ms +- 0.6 ms: 1.39x slower
go: Mean +- std dev: [conda-forge-313] 130 ms +- 1 ms -> [pbs-313] 158 ms +- 1 ms: 1.21x slower
hexiom: Mean +- std dev: [conda-forge-313] 5.57 ms +- 0.04 ms -> [pbs-313] 7.56 ms +- 0.05 ms: 1.36x slower
html5lib: Mean +- std dev: [conda-forge-313] 62.8 ms +- 0.9 ms -> [pbs-313] 69.0 ms +- 0.5 ms: 1.10x slower
json_dumps: Mean +- std dev: [conda-forge-313] 9.08 ms +- 0.16 ms -> [pbs-313] 10.4 ms +- 0.1 ms: 1.14x slower
json_loads: Mean +- std dev: [conda-forge-313] 21.2 us +- 0.2 us -> [pbs-313] 24.6 us +- 0.2 us: 1.16x slower
logging_format: Mean +- std dev: [conda-forge-313] 5.91 us +- 0.20 us -> [pbs-313] 8.46 us +- 0.12 us: 1.43x slower
logging_silent: Mean +- std dev: [conda-forge-313] 92.0 ns +- 2.3 ns -> [pbs-313] 114 ns +- 2 ns: 1.24x slower
logging_simple: Mean +- std dev: [conda-forge-313] 5.29 us +- 0.08 us -> [pbs-313] 7.55 us +- 0.09 us: 1.43x slower
mako: Mean +- std dev: [conda-forge-313] 9.36 ms +- 0.20 ms -> [pbs-313] 12.0 ms +- 0.1 ms: 1.28x slower
mdp: Mean +- std dev: [conda-forge-313] 2.27 sec +- 0.03 sec -> [pbs-313] 2.34 sec +- 0.02 sec: 1.03x slower
meteor_contest: Mean +- std dev: [conda-forge-313] 87.1 ms +- 0.6 ms -> [pbs-313] 103 ms +- 1 ms: 1.18x slower
nbody: Mean +- std dev: [conda-forge-313] 82.5 ms +- 1.3 ms -> [pbs-313] 138 ms +- 4 ms: 1.67x slower
nqueens: Mean +- std dev: [conda-forge-313] 70.7 ms +- 0.9 ms -> [pbs-313] 98.3 ms +- 0.6 ms: 1.39x slower
pathlib: Mean +- std dev: [conda-forge-313] 19.7 ms +- 0.1 ms -> [pbs-313] 22.1 ms +- 0.1 ms: 1.12x slower
pickle: Mean +- std dev: [conda-forge-313] 10.6 us +- 0.1 us -> [pbs-313] 10.7 us +- 0.1 us: 1.01x slower
pickle_dict: Mean +- std dev: [conda-forge-313] 25.6 us +- 0.3 us -> [pbs-313] 18.9 us +- 0.6 us: 1.36x faster
pickle_list: Mean +- std dev: [conda-forge-313] 3.96 us +- 0.06 us -> [pbs-313] 3.61 us +- 0.08 us: 1.10x faster
pickle_pure_python: Mean +- std dev: [conda-forge-313] 267 us +- 2 us -> [pbs-313] 362 us +- 4 us: 1.35x slower
pidigits: Mean +- std dev: [conda-forge-313] 166 ms +- 1 ms -> [pbs-313] 180 ms +- 0 ms: 1.08x slower
pprint_safe_repr: Mean +- std dev: [conda-forge-313] 667 ms +- 12 ms -> [pbs-313] 953 ms +- 5 ms: 1.43x slower
pprint_pformat: Mean +- std dev: [conda-forge-313] 1.37 sec +- 0.02 sec -> [pbs-313] 1.95 sec +- 0.02 sec: 1.43x slower
pyflate: Mean +- std dev: [conda-forge-313] 403 ms +- 2 ms -> [pbs-313] 498 ms +- 2 ms: 1.24x slower
python_startup: Mean +- std dev: [conda-forge-313] 9.68 ms +- 0.03 ms -> [pbs-313] 13.6 ms +- 0.1 ms: 1.40x slower
python_startup_no_site: Mean +- std dev: [conda-forge-313] 6.76 ms +- 0.03 ms -> [pbs-313] 10.5 ms +- 0.1 ms: 1.55x slower
raytrace: Mean +- std dev: [conda-forge-313] 241 ms +- 3 ms -> [pbs-313] 300 ms +- 4 ms: 1.24x slower
regex_compile: Mean +- std dev: [conda-forge-313] 117 ms +- 1 ms -> [pbs-313] 157 ms +- 1 ms: 1.34x slower
regex_dna: Mean +- std dev: [conda-forge-313] 153 ms +- 3 ms -> [pbs-313] 151 ms +- 1 ms: 1.01x faster
regex_effbot: Mean +- std dev: [conda-forge-313] 2.43 ms +- 0.06 ms -> [pbs-313] 2.49 ms +- 0.05 ms: 1.02x slower
regex_v8: Mean +- std dev: [conda-forge-313] 21.7 ms +- 0.6 ms -> [pbs-313] 23.0 ms +- 0.2 ms: 1.06x slower
richards: Mean +- std dev: [conda-forge-313] 46.0 ms +- 0.6 ms -> [pbs-313] 57.4 ms +- 0.4 ms: 1.25x slower
richards_super: Mean +- std dev: [conda-forge-313] 52.3 ms +- 0.9 ms -> [pbs-313] 62.9 ms +- 0.4 ms: 1.20x slower
scimark_fft: Mean +- std dev: [conda-forge-313] 325 ms +- 5 ms -> [pbs-313] 426 ms +- 17 ms: 1.31x slower
scimark_lu: Mean +- std dev: [conda-forge-313] 109 ms +- 1 ms -> [pbs-313] 119 ms +- 1 ms: 1.09x slower
scimark_monte_carlo: Mean +- std dev: [conda-forge-313] 61.2 ms +- 0.5 ms -> [pbs-313] 75.0 ms +- 2.6 ms: 1.23x slower
scimark_sor: Mean +- std dev: [conda-forge-313] 123 ms +- 1 ms -> [pbs-313] 161 ms +- 1 ms: 1.31x slower
scimark_sparse_mat_mult: Mean +- std dev: [conda-forge-313] 3.89 ms +- 0.12 ms -> [pbs-313] 5.78 ms +- 0.40 ms: 1.49x slower
spectral_norm: Mean +- std dev: [conda-forge-313] 104 ms +- 1 ms -> [pbs-313] 138 ms +- 3 ms: 1.33x slower
sqlglot_normalize: Mean +- std dev: [conda-forge-313] 261 ms +- 3 ms -> [pbs-313] 131 ms +- 1 ms: 1.99x faster
sqlglot_optimize: Mean +- std dev: [conda-forge-313] 47.5 ms +- 0.4 ms -> [pbs-313] 61.6 ms +- 0.4 ms: 1.30x slower
sqlglot_parse: Mean +- std dev: [conda-forge-313] 1.12 ms +- 0.01 ms -> [pbs-313] 1.47 ms +- 0.01 ms: 1.31x slower
sqlglot_transpile: Mean +- std dev: [conda-forge-313] 1.37 ms +- 0.01 ms -> [pbs-313] 1.77 ms +- 0.01 ms: 1.29x slower
sqlite_synth: Mean +- std dev: [conda-forge-313] 2.09 us +- 0.03 us -> [pbs-313] 3.42 us +- 0.01 us: 1.63x slower
sympy_expand: Mean +- std dev: [conda-forge-313] 409 ms +- 3 ms -> [pbs-313] 514 ms +- 3 ms: 1.26x slower
sympy_integrate: Mean +- std dev: [conda-forge-313] 16.2 ms +- 0.1 ms -> [pbs-313] 19.1 ms +- 0.1 ms: 1.18x slower
sympy_sum: Mean +- std dev: [conda-forge-313] 122 ms +- 1 ms -> [pbs-313] 149 ms +- 1 ms: 1.22x slower
sympy_str: Mean +- std dev: [conda-forge-313] 238 ms +- 3 ms -> [pbs-313] 291 ms +- 2 ms: 1.22x slower
telco: Mean +- std dev: [conda-forge-313] 7.37 ms +- 0.18 ms -> [pbs-313] 9.37 ms +- 0.27 ms: 1.27x slower
tomli_loads: Mean +- std dev: [conda-forge-313] 1.94 sec +- 0.03 sec -> [pbs-313] 2.84 sec +- 0.07 sec: 1.47x slower
tornado_http: Mean +- std dev: [conda-forge-313] 91.0 ms +- 1.0 ms -> [pbs-313] 107 ms +- 1 ms: 1.18x slower
typing_runtime_protocols: Mean +- std dev: [conda-forge-313] 146 us +- 4 us -> [pbs-313] 185 us +- 3 us: 1.27x slower
unpack_sequence: Mean +- std dev: [conda-forge-313] 35.7 ns +- 0.4 ns -> [pbs-313] 48.1 ns +- 1.8 ns: 1.35x slower
unpickle: Mean +- std dev: [conda-forge-313] 11.6 us +- 0.2 us -> [pbs-313] 14.1 us +- 0.2 us: 1.22x slower
unpickle_list: Mean +- std dev: [conda-forge-313] 4.41 us +- 0.06 us -> [pbs-313] 4.80 us +- 0.05 us: 1.09x slower
unpickle_pure_python: Mean +- std dev: [conda-forge-313] 193 us +- 1 us -> [pbs-313] 247 us +- 2 us: 1.28x slower
xml_etree_parse: Mean +- std dev: [conda-forge-313] 129 ms +- 2 ms -> [pbs-313] 254 ms +- 2 ms: 1.96x slower
xml_etree_iterparse: Mean +- std dev: [conda-forge-313] 85.0 ms +- 1.1 ms -> [pbs-313] 141 ms +- 2 ms: 1.66x slower
xml_etree_generate: Mean +- std dev: [conda-forge-313] 77.0 ms +- 0.7 ms -> [pbs-313] 98.0 ms +- 0.7 ms: 1.27x slower
xml_etree_process: Mean +- std dev: [conda-forge-313] 53.0 ms +- 0.6 ms -> [pbs-313] 69.5 ms +- 0.6 ms: 1.31x slower

Geometric mean: 1.24x slower

I consider this fairly high priority, but I don't know what the source of the difference is.

@zanieb
Copy link
Member

zanieb commented Feb 25, 2025

Looking at https://github.com/conda-forge/python-feedstock/blob/main/recipe/build_base.sh and not seeing anything obvious.

@zanieb
Copy link
Member

zanieb commented Feb 25, 2025

@paugier What platform and architecture did you run your benchmarks on?

@zanieb
Copy link
Member

zanieb commented Feb 25, 2025

@zanieb
Copy link
Member

zanieb commented Feb 25, 2025

Our v3 builds are a bit better, but that's not the bulk of it (geometric mean: 1.20x slower)

@Fidget-Spinner
Copy link

FWIW, I don't see any slowdown on conda-forge 3.13 vs uv 3.14.0a5 on my machine (AMD64 Linux) on this benchmark:

(py313-conda-forge) ken@ken-Legion-5-Pro-16IAH7H:~/Documents/GitHub/cpython$ time python ./bm_calc.py

real	0m1.492s
user	0m1.488s
sys	0m0.003s

time uv run --python 3.14.0a5 python ./bm_calc.py

real	0m1.267s
user	0m1.255s
sys	0m0.011s

In fact, the conda forge is significantly slower

@zanieb
Copy link
Member

zanieb commented Feb 25, 2025

3.14 <-> 3.13 doesn't feel like a fair comparison since we're using the tail calling interpreter.

@Fidget-Spinner
Copy link

Oh wow, I do see a significant slowdown on 3.13 (compare the previous comment)

(cpython) ken@ken-Legion-5-Pro-16IAH7H:~/Documents/GitHub/cpython$ time uv run --python 3.13 python ./bm_calc.py

real	0m1.939s
user	0m1.897s
sys	0m0.025s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants