Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedded interpreter is 10% slower #270

Open
vitrun opened this issue Dec 2, 2022 · 5 comments
Open

Embedded interpreter is 10% slower #270

vitrun opened this issue Dec 2, 2022 · 5 comments

Comments

@vitrun
Copy link

vitrun commented Dec 2, 2022

Playing around with multipy, I found same program executed obviously slower in embedded interpreter. The original code is echo.py:

import time

def echo(_):
    start = time.time_ns()
    sum = 0
    for i in range(1000_000):
        sum += i
    cost = (time.time_ns() - start)/1000_000
    print(cost)

The cost is around 40 when executed by system python. It remains the same when bytecode generation(.pyc and__pycache__ files) is disabled.

40.741847
40.761805
40.764069
40.746884
40.773777
40.747697
40.765722
40.761535
40.782256

It is packaged by:

from torch.package import PackageExporter
from echo import echo

with PackageExporter("echo.zip") as ex:
    ex.intern("echo")
    ex.save_pickle("model", "model.pkl", echo)

and executed by multipy as follows:

int main(int argc, const char *argv[])
{
    if (argc != 3) {
        std::cerr << "usage: example-app <path-to-exported-script-module> thread_count\n";
        return -1;
    }

    torch::deploy::InterpreterManager manager(4);
    torch::deploy::ReplicatedObj model;
    try {
        torch::deploy::Package package = manager.loadPackage(argv[1]);
        model = package.loadPickle("model", "model.pkl");
        int n = std::stoi(argv[2]);
        for (int i=0; i<n; i++) {
            auto I = manager.acquireOne();
            auto echo = I.fromMovable(model);
            echo({1});
        }
        return 0;
    } catch (const c10::Error &e)
    {
        std::cerr << "error loading the model\n";
        std::cerr << e.msg();
        return -1;
    }
}

which prints out:

44.830043
45.339226
44.659302
44.789638
44.924977
44.725203
44.823946
44.660343
44.622977

why is it 10% slower?

@d4l3k
Copy link
Member

d4l3k commented Dec 6, 2022

Just tried reproing this. That's super weird -- it should be identical performance wise since we just statically link against libpython.a

(multipy3.8.6) tristanr@tristanr-arch2 ~/D/m/m/runtime (main)> python /tmp/echo.py
43.73609
41.270134
41.685253
41.603169
42.208917
41.864256
42.094766
42.056876
42.097169
42.261114
(multipy3.8.6) tristanr@tristanr-arch2 ~/D/m/m/runtime (main)> build/interactive_embedded_interpreter --pyscript /tmp/echo.py
Registering torch::deploy builtin library tensorrt (idx 0) with 0 builtin modules
torch::deploy builtin tensorrt contains 0 modules
Registering torch::deploy builtin library cpython_internal (idx 1) with 0 builtin modules
torch::deploy builtin cpython_internal contains 6 modules
Registering torch::deploy builtin library tensorrt (idx 0) with 0 builtin modules
torch::deploy builtin tensorrt contains 0 modules
Registering torch::deploy builtin library cpython_internal (idx 1) with 0 builtin modules
torch::deploy builtin cpython_internal contains 6 modules
[W OperatorEntry.cpp:150] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::get_gradients(int context_id) -> Dict(Tensor, Tensor)
    registered at aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: (catch all)
  previous kernel: registered at ../torch/csrc/jit/runtime/register_distributed_ops.cpp:278
       new kernel: registered at ../torch/csrc/jit/runtime/register_distributed_ops.cpp:278 (function registerKernel)
47.25619
49.349608
49.558454
49.680924
50.044119
49.850917
49.532072
49.720465
49.203451
49.277325
(multipy3.8.6) tristanr@tristanr-arch2 ~/D/m/m/runtime (main)> cat /tmp/echo.py 
import time

def echo(_):
    start = time.time_ns()
    sum = 0
    for i in range(1000_000):
        sum += i
    cost = (time.time_ns() - start)/1000_000
    print(cost)

for i in range(10):
    echo(0)

@d4l3k
Copy link
Member

d4l3k commented Dec 6, 2022

This seems to be related to some of the compiler flags we set when building -- I'm investigating further

@d4l3k
Copy link
Member

d4l3k commented Dec 6, 2022

Python installed via pyenv seems to be much slower than the native version that's installed on Arch Linux

tristanr@tristanr-arch2 ~/Developer [SIGINT]> sudo nice -n -20 /usr/bin/python3 ~/Developer/echo.py
378.472401
370.134925
381.323995
374.603642
375.891184
374.331494
375.040931
376.909634
378.385388
tristanr@tristanr-arch2 ~/Developer [SIGINT]> sudo nice -n -20 ~/.pyenv/versions/3.10.8/bin/python3 ~/Developer/echo.py
430.036246
440.862393
430.479176
431.409667
434.115699
431.353992
452.264208
431.333353
437.43321

same python version

@vitrun
Copy link
Author

vitrun commented Dec 8, 2022

Python installed via pyenv seems to be much slower than the native version that's installed on Arch Linux

Even weirder, on my Mac I found exactly the opposite:

➜ /tmp sudo nice -n -20 /usr/local/bin/python3.10 echo.py
50.569
50.604
51.091
51.228
50.237
50.766
49.796
51.722
50.149
51.529

(tmp) ➜ /tmp sudo nice -n -20 ~/miniconda3/envs/tmp/bin/python3.10 echo.py
41.665
41.686
41.608
41.038
41.729
41.845
44.16
43.437
43.53
43.494

@d4l3k
Copy link
Member

d4l3k commented Dec 16, 2022

There seems to be a lot of special sauce that goes into tuning the python compilation. Conda seems to be more consistent than pyenv. My Arch Linux native Python works better than pyenv

Likely can play with the compilation for libpython.a somewhat to get this to be more performant. Unfortunately deploy requires -fPIC which seems to have an unavoidable performance impact since it means the compiler can't optimize quite as much

We're doing some experimentation with dynamo which could help alleviate python overhead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants