Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup CumSum for large arrays #22048

Merged
merged 10 commits into from
Sep 17, 2024
Merged

Conversation

neNasko1
Copy link
Contributor

Description

This PR refactors the CPU kernel for the CumSum operator. The new implementation strives to have as little indirection as possible.

Motivation and Context

Currently the CumSum operator perform very poorly in the case of 1D tensors(it was slower than a python loop). This is caused by the extensive use of the SliceIterator-s.

Here is a relevant snippet:

import time
import ndonnx as ndx
import onnxruntime as ort
import numpy as np
import onnx

def test_cumsum(sz):
    a = ndx.array(shape=(sz,), dtype=ndx.int64)
    b = ndx.cumsum(a)
    model = ndx.build({'a': a}, {'b': b})
    onnx.save(model, "model.onnx")

    input = np.ones(sz, np.int64)
    start = time.time()
    result = ort.InferenceSession(model.SerializeToString()).run(None, {'a': input})
    end = time.time()
    return end - start

def test_cumsum_by_hand(sz):
    input = np.ones(sz, np.int64)
    start = time.time()
    answer = [0]
    for i in input:
        answer.append(answer[-1] + i)
    end = time.time()
    return end - start

print(test_cumsum(int(1e7))) 
print(test_cumsum_by_hand(int(1e7))) 

Before

0.9794480800628662
0.4518160820007324

After

0.02483987808227539
0.5496008396148682

The model.onnx:
image

The flame graph:
profile-3

@snnn snnn added the core runtime issues related to core runtime label Sep 13, 2024
@snnn
Copy link
Member

snnn commented Sep 13, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@snnn
Copy link
Member

snnn commented Sep 13, 2024

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

@snnn
Copy link
Member

snnn commented Sep 13, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

@snnn
Copy link
Member

snnn commented Sep 13, 2024

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@neNasko1
Copy link
Contributor Author

@snnn can we rerun the pipelines again? Seems like there was some conversion issue that existed.

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

@snnn
Copy link
Member

snnn commented Sep 16, 2024

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@snnn
Copy link
Member

snnn commented Sep 17, 2024

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

@snnn
Copy link
Member

snnn commented Sep 17, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

@snnn
Copy link
Member

snnn commented Sep 17, 2024

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@neNasko1
Copy link
Contributor Author

@snnn all cpu tests are passing, I don't think that the CI failures are caused by this change as they are only on the CUDA and DML pipelines and there is this line in the logs:

1: Run failed but expected success: CUDA failure 702: the launch timed out and was terminated ; GPU=0 ; hostname=7e50a0e1c000000 ; file=D:\a\_work\1\s\onnxruntime\core\providers\cuda\gpu_data_transfer.cc ; line=73 ; expr=cudaMemcpyAsync(dst_data, src_data, bytes, cudaMemcpyDeviceToHost, static_cast<cudaStream_t>(stream.GetHandle())); 

Can you comment if there is something left for me to do?

@snnn
Copy link
Member

snnn commented Sep 17, 2024

I will run the pipelines again.

@neNasko1
Copy link
Contributor Author

I think the test was failing because the tensor was too big and it timed out. Can we rerun one last time?

@snnn
Copy link
Member

snnn commented Sep 17, 2024

/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline,

@snnn
Copy link
Member

snnn commented Sep 17, 2024

/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline

@snnn
Copy link
Member

snnn commented Sep 17, 2024

/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 6 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@snnn snnn merged commit 275eb40 into microsoft:main Sep 17, 2024
71 checks passed
@snnn
Copy link
Member

snnn commented Sep 17, 2024

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants