Extra slow pytorch imports (~30s) #889

kpister · 2024-11-17T14:40:04Z

Describe the bug
Pytorch related imports are taking an extra long time to resolve (15x longer?) when using scalene vs python.

To Reproduce
I have a simple test.py file, which is just import torch.
Run scalene test.py
Wait ~30s for report to finish
Run python test.py
Wait ~2s

Screenshots

** Versions **

OS: debian 11
Python: 3.11.10
Scalene: 1.5.48
Torch: 2.5.1+cu121

I enabled gpu with scalene

The text was updated successfully, but these errors were encountered:

emeryberger · 2024-11-18T17:34:36Z

Thanks for the report. We've been able to reproduce this locally and are looking into it.

emeryberger · 2024-11-18T17:35:21Z

In the interim, as a work-around, you can specify --cpu --gpu (the culprit at the moment appears to be the memory /copy profiling).

sternj · 2024-11-18T20:04:01Z

Likewise reproduced, also with around a 50x slowdown.

sternj · 2024-11-19T15:09:42Z

On torch==2.5.1, disabling the settrace reduces the runtime from ~100s to ~4s. I'm going to see whether it was introduced in a particular Scalene commit or with a particular Pytorch version.

sternj · 2024-11-19T15:31:32Z

I do not see this performance degradation at 5e457916606b1ebc. Bisecting.

sternj · 2024-11-19T15:41:36Z

Problem was introduced in b9ad0a56582cf4d

sternj · 2024-11-20T21:01:00Z

I've been looking into this more and the root problem has to do with how and when the interpreter calls the tracing function.

At the moment, it seems that the C logic that decides when to disable the PyEval_SetTrace callback isn't properly disabling the callback in library calls. This is incurring a function call overhead unconditionally for every single event (opcode, line, call, and return) for every part of every library executed anywhere in the program. Since importing Pytorch invokes a lot of code, the function call overhead adds up incredibly quickly.

I'm making headway on figuring out how precisely to do this disabling, CPython actually checks in several different places and manners to see whether a trace callback exists and whether to execute it. This is governed in both the PyThreadState struct and the _PyCFrame struct, with a lot of that happening in ceval.c. I think I'll have a solution by the end of tomorrow.

sternj self-assigned this Nov 18, 2024

sternj linked a pull request Nov 24, 2024 that will close this issue

WIP: fix line invalidation speed #891

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra slow pytorch imports (~30s) #889

Extra slow pytorch imports (~30s) #889

kpister commented Nov 17, 2024

emeryberger commented Nov 18, 2024

emeryberger commented Nov 18, 2024

sternj commented Nov 18, 2024

sternj commented Nov 19, 2024

sternj commented Nov 19, 2024

sternj commented Nov 19, 2024

sternj commented Nov 20, 2024

Extra slow pytorch imports (~30s) #889

Extra slow pytorch imports (~30s) #889

Comments

kpister commented Nov 17, 2024

emeryberger commented Nov 18, 2024

emeryberger commented Nov 18, 2024

sternj commented Nov 18, 2024

sternj commented Nov 19, 2024

sternj commented Nov 19, 2024

sternj commented Nov 19, 2024

sternj commented Nov 20, 2024