Speedup `CumSum` for large arrays #22048

neNasko1 · 2024-09-10T22:41:31Z

Description

This PR refactors the CPU kernel for the CumSum operator. The new implementation strives to have as little indirection as possible.

Motivation and Context

Currently the CumSum operator perform very poorly in the case of 1D tensors(it was slower than a python loop). This is caused by the extensive use of the SliceIterator-s.

Here is a relevant snippet:

import time
import ndonnx as ndx
import onnxruntime as ort
import numpy as np
import onnx

def test_cumsum(sz):
    a = ndx.array(shape=(sz,), dtype=ndx.int64)
    b = ndx.cumsum(a)
    model = ndx.build({'a': a}, {'b': b})
    onnx.save(model, "model.onnx")

    input = np.ones(sz, np.int64)
    start = time.time()
    result = ort.InferenceSession(model.SerializeToString()).run(None, {'a': input})
    end = time.time()
    return end - start

def test_cumsum_by_hand(sz):
    input = np.ones(sz, np.int64)
    start = time.time()
    answer = [0]
    for i in input:
        answer.append(answer[-1] + i)
    end = time.time()
    return end - start

print(test_cumsum(int(1e7))) 
print(test_cumsum_by_hand(int(1e7)))

Before

0.9794480800628662
0.4518160820007324

After

0.02483987808227539
0.5496008396148682

The model.onnx:

The flame graph:

snnn · 2024-09-13T17:03:01Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2024-09-13T17:03:08Z

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

snnn · 2024-09-13T17:03:08Z

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

snnn · 2024-09-13T17:03:21Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

snnn · 2024-09-13T17:03:29Z

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2024-09-13T17:03:44Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-09-13T17:03:48Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-09-13T17:03:52Z

Azure Pipelines successfully started running 5 pipeline(s).

onnxruntime/core/providers/cpu/math/cumsum.cc

onnxruntime/test/providers/cpu/math/cumsum_test.cc

snnn · 2024-09-16T18:21:51Z

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

snnn · 2024-09-16T18:21:57Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

snnn · 2024-09-16T18:22:10Z

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2024-09-16T18:22:23Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-09-16T18:22:25Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-09-16T18:23:26Z

Azure Pipelines successfully started running 5 pipeline(s).

snnn · 2024-09-16T19:47:52Z

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

snnn · 2024-09-16T19:47:59Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

snnn · 2024-09-16T19:48:06Z

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2024-09-16T19:48:24Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-09-16T19:48:26Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-09-16T19:48:29Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2024-09-16T20:26:44Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-09-16T20:27:00Z

Azure Pipelines successfully started running 9 pipeline(s).

neNasko1 · 2024-09-16T21:57:39Z

@snnn can we rerun the pipelines again? Seems like there was some conversion issue that existed.

snnn · 2024-09-16T22:29:40Z

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

snnn · 2024-09-16T22:29:53Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

snnn · 2024-09-16T22:30:02Z

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2024-09-16T22:30:15Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-09-16T22:30:24Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-09-16T22:30:35Z

Azure Pipelines successfully started running 5 pipeline(s).

snnn · 2024-09-17T00:08:39Z

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

snnn · 2024-09-17T00:08:47Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

snnn · 2024-09-17T00:08:57Z

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2024-09-17T00:09:09Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-09-17T00:09:11Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-09-17T00:09:19Z

Azure Pipelines successfully started running 5 pipeline(s).

neNasko1 · 2024-09-17T11:05:53Z

@snnn all cpu tests are passing, I don't think that the CI failures are caused by this change as they are only on the CUDA and DML pipelines and there is this line in the logs:

1: Run failed but expected success: CUDA failure 702: the launch timed out and was terminated ; GPU=0 ; hostname=7e50a0e1c000000 ; file=D:\a\_work\1\s\onnxruntime\core\providers\cuda\gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle()));

Can you comment if there is something left for me to do?

snnn · 2024-09-17T17:44:17Z

I will run the pipelines again.

neNasko1 · 2024-09-17T19:41:41Z

I think the test was failing because the tensor was too big and it timed out. Can we rerun one last time?

snnn · 2024-09-17T20:16:09Z

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

snnn · 2024-09-17T20:16:17Z

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

snnn · 2024-09-17T20:16:25Z

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

azure-pipelines · 2024-09-17T20:16:41Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-09-17T20:16:44Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-09-17T20:16:49Z

Azure Pipelines successfully started running 5 pipeline(s).

snnn · 2024-09-17T22:53:22Z

Thank you!

Atanas Dimitrov added 3 commits September 11, 2024 00:26

Improve cumsum performance

136ea26

Reverse the order of loops to look more coherent

f8f4be1

QOL

192cf23

snnn added the core runtime issues related to core runtime label Sep 13, 2024

snnn reviewed Sep 13, 2024

View reviewed changes

onnxruntime/core/providers/cpu/math/cumsum.cc Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Sep 13, 2024

View reviewed changes

onnxruntime/test/providers/cpu/math/cumsum_test.cc Fixed Show fixed Hide fixed

Atanas Dimitrov added 2 commits September 16, 2024 20:34

Comments after code review

5e8f160

Merge branch 'main' into cpu-cumsum-speedup

00b376b

Fix warnings on windows

5007ef7

Fix conversion warnings

45cb072

Remove auto keyword, be more explicit about type conversions

1a9a495

Appease linter

ceff3bb

Decrease size of tested tensor

883000d

snnn approved these changes Sep 17, 2024

View reviewed changes

snnn merged commit 275eb40 into microsoft:main Sep 17, 2024
71 checks passed

Speedup CumSum for large arrays #22048

Speedup CumSum for large arrays #22048

Conversation

neNasko1 commented Sep 10, 2024

Description

Motivation and Context

snnn commented Sep 13, 2024

azure-pipelines bot commented Sep 13, 2024

snnn commented Sep 13, 2024

snnn commented Sep 13, 2024

snnn commented Sep 13, 2024

azure-pipelines bot commented Sep 13, 2024

azure-pipelines bot commented Sep 13, 2024

azure-pipelines bot commented Sep 13, 2024

snnn commented Sep 16, 2024

snnn commented Sep 16, 2024

snnn commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

snnn commented Sep 16, 2024

snnn commented Sep 16, 2024

snnn commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

neNasko1 commented Sep 16, 2024

snnn commented Sep 16, 2024

snnn commented Sep 16, 2024

snnn commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

azure-pipelines bot commented Sep 16, 2024

snnn commented Sep 17, 2024

snnn commented Sep 17, 2024

snnn commented Sep 17, 2024

azure-pipelines bot commented Sep 17, 2024

azure-pipelines bot commented Sep 17, 2024

azure-pipelines bot commented Sep 17, 2024

neNasko1 commented Sep 17, 2024

snnn commented Sep 17, 2024

neNasko1 commented Sep 17, 2024

snnn commented Sep 17, 2024

snnn commented Sep 17, 2024

snnn commented Sep 17, 2024

azure-pipelines bot commented Sep 17, 2024

azure-pipelines bot commented Sep 17, 2024

azure-pipelines bot commented Sep 17, 2024

snnn commented Sep 17, 2024

Speedup `CumSum` for large arrays #22048

Speedup `CumSum` for large arrays #22048