Embedded interpreter is 10% slower #270

vitrun · 2022-12-02T07:44:38Z

Playing around with multipy, I found same program executed obviously slower in embedded interpreter. The original code is echo.py:

import time

def echo(_):
    start = time.time_ns()
    sum = 0
    for i in range(1000_000):
        sum += i
    cost = (time.time_ns() - start)/1000_000
    print(cost)

The cost is around 40 when executed by system python. It remains the same when bytecode generation(.pyc and__pycache__ files) is disabled.

It is packaged by:

from torch.package import PackageExporter
from echo import echo

with PackageExporter("echo.zip") as ex:
    ex.intern("echo")
    ex.save_pickle("model", "model.pkl", echo)

and executed by multipy as follows:

int main(int argc, const char *argv[])
{
    if (argc != 3) {
        std::cerr << "usage: example-app <path-to-exported-script-module> thread_count\n";
        return -1;
    }

    torch::deploy::InterpreterManager manager(4);
    torch::deploy::ReplicatedObj model;
    try {
        torch::deploy::Package package = manager.loadPackage(argv[1]);
        model = package.loadPickle("model", "model.pkl");
        int n = std::stoi(argv[2]);
        for (int i=0; i<n; i++) {
            auto I = manager.acquireOne();
            auto echo = I.fromMovable(model);
            echo({1});
        }
        return 0;
    } catch (const c10::Error &e)
    {
        std::cerr << "error loading the model\n";
        std::cerr << e.msg();
        return -1;
    }
}

which prints out:

why is it 10% slower?

The text was updated successfully, but these errors were encountered:

d4l3k · 2022-12-06T21:11:34Z

Just tried reproing this. That's super weird -- it should be identical performance wise since we just statically link against libpython.a

(multipy3.8.6) tristanr@tristanr-arch2 ~/D/m/m/runtime (main)> python /tmp/echo.py
43.73609
41.270134
41.685253
41.603169
42.208917
41.864256
42.094766
42.056876
42.097169
42.261114
(multipy3.8.6) tristanr@tristanr-arch2 ~/D/m/m/runtime (main)> build/interactive_embedded_interpreter --pyscript /tmp/echo.py
Registering torch::deploy builtin library tensorrt (idx 0) with 0 builtin modules
torch::deploy builtin tensorrt contains 0 modules
Registering torch::deploy builtin library cpython_internal (idx 1) with 0 builtin modules
torch::deploy builtin cpython_internal contains 6 modules
Registering torch::deploy builtin library tensorrt (idx 0) with 0 builtin modules
torch::deploy builtin tensorrt contains 0 modules
Registering torch::deploy builtin library cpython_internal (idx 1) with 0 builtin modules
torch::deploy builtin cpython_internal contains 6 modules
[W OperatorEntry.cpp:150] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::get_gradients(int context_id) -> Dict(Tensor, Tensor)
    registered at aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: (catch all)
  previous kernel: registered at ../torch/csrc/jit/runtime/register_distributed_ops.cpp:278
       new kernel: registered at ../torch/csrc/jit/runtime/register_distributed_ops.cpp:278 (function registerKernel)
47.25619
49.349608
49.558454
49.680924
50.044119
49.850917
49.532072
49.720465
49.203451
49.277325
(multipy3.8.6) tristanr@tristanr-arch2 ~/D/m/m/runtime (main)> cat /tmp/echo.py 
import time

def echo(_):
    start = time.time_ns()
    sum = 0
    for i in range(1000_000):
        sum += i
    cost = (time.time_ns() - start)/1000_000
    print(cost)

for i in range(10):
    echo(0)

d4l3k · 2022-12-06T21:30:19Z

This seems to be related to some of the compiler flags we set when building -- I'm investigating further

d4l3k · 2022-12-06T23:32:55Z

Python installed via pyenv seems to be much slower than the native version that's installed on Arch Linux

tristanr@tristanr-arch2 ~/Developer [SIGINT]> sudo nice -n -20 /usr/bin/python3 ~/Developer/echo.py
378.472401
370.134925
381.323995
374.603642
375.891184
374.331494
375.040931
376.909634
378.385388
tristanr@tristanr-arch2 ~/Developer [SIGINT]> sudo nice -n -20 ~/.pyenv/versions/3.10.8/bin/python3 ~/Developer/echo.py
430.036246
440.862393
430.479176
431.409667
434.115699
431.353992
452.264208
431.333353
437.43321

same python version

vitrun · 2022-12-08T07:58:40Z

Python installed via pyenv seems to be much slower than the native version that's installed on Arch Linux

Even weirder, on my Mac I found exactly the opposite:

➜ /tmp sudo nice -n -20 /usr/local/bin/python3.10 echo.py
50.569
50.604
51.091
51.228
50.237
50.766
49.796
51.722
50.149
51.529

(tmp) ➜ /tmp sudo nice -n -20 ~/miniconda3/envs/tmp/bin/python3.10 echo.py
41.665
41.686
41.608
41.038
41.729
41.845
44.16
43.437
43.53
43.494

d4l3k · 2022-12-16T21:16:41Z

There seems to be a lot of special sauce that goes into tuning the python compilation. Conda seems to be more consistent than pyenv. My Arch Linux native Python works better than pyenv

Likely can play with the compilation for libpython.a somewhat to get this to be more performant. Unfortunately deploy requires -fPIC which seems to have an unavoidable performance impact since it means the compiler can't optimize quite as much

We're doing some experimentation with dynamo which could help alleviate python overhead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedded interpreter is 10% slower #270

Embedded interpreter is 10% slower #270

vitrun commented Dec 2, 2022

d4l3k commented Dec 6, 2022

d4l3k commented Dec 6, 2022

d4l3k commented Dec 6, 2022

vitrun commented Dec 8, 2022

d4l3k commented Dec 16, 2022

Embedded interpreter is 10% slower #270

Embedded interpreter is 10% slower #270

Comments

vitrun commented Dec 2, 2022

d4l3k commented Dec 6, 2022

d4l3k commented Dec 6, 2022

d4l3k commented Dec 6, 2022

vitrun commented Dec 8, 2022

d4l3k commented Dec 16, 2022