Combine parallel dense Optimization pass in ONNX Dialect #3123

Arkar-Hema · 2025-04-16T08:34:16Z

Combine Parallel Dense

CombineParallelDense is an optimization pass designed to merge multiple parallel ONNXGemmOp (Dense/Fully Connected) operations into a single, more efficient Dense layer. This optimization reduces redundant computations, improves memory efficiency, and enhances hardware utilization.

The pass identifies Dense (Gemm) operations that:

Share the same input tensor.
Have identical attributes such as alpha, beta, transA and transB (ensuring compatibility).
May have different output dimensions (number of neurons) but maintain compatible weight shapes for concatenation.

Lets assume a input case:

Input Shape: (1, 512)
Dense A: out_features = 256
Dense B: out_features = 128
Dense C: out_features = 64
Attributes: transB = 0, alpha = 1.0, beta = 1.0

Before Optimization (Three Parallel Gemms)

Each GEMM does one full matrix multiplication (1×512 × 512×N)
Three separate weight and bias tensors and produces three outputs
-Memory Reads: 3 times full input (one for each gemm)
Post-processing: A Concat(axis=1) merges them into one output: Y (1×448)

After Optimization (Combined Dense)

Total Output Features: 256 + 128 + 64 = 448
All weights are concatenated along output channel axis → New weight shape: (512, 448)
Biases are also concatenated
A single ONNXGemmOp computes Y (1×448) directly

Improvement in performance metrics

Latency Improvement: 7-15%
Throughput Improvement: 8-14%
Memory Usage Improvement: 10-12%

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-04-16T08:42:27Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-04-16T08:45:46Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-04-16T09:47:27Z

Can one of the admins verify this patch?

tungld · 2025-04-17T03:59:17Z

@Arkar-Hema A general question: in what kind of models have you seen this kind of pattern: multiple Gemm ops followed by a Concat op? and also similar patterns you have recently created PRs for? Just curious on how practical it is. Thanks!

Arkar-Hema · 2025-04-17T08:21:58Z

@Arkar-Hema A general question: in what kind of models have you seen this kind of pattern: multiple Gemm ops followed by a Concat op? and also similar patterns you have recently created PRs for? Just curious on how practical it is. Thanks!

Models with the CombineParallelDense pattern (Combine parallel dense Optimization pass in ONNX Dialect #3123):
These contain multiple Gemm ops, though not always followed by a Concat. I added the Concat condition to the pass so it would still handle those cases gracefully if present. Some models with this pattern include:

Bertsquad-8
Bertsquad-10
Bertsquad-12
FasterRCNN-10

Models with the Reorder ReLU and MaxPool pattern (Reorder relu to maxpool optimization pass in ONNX dialect #3109): This pattern shows up frequently in vision models like:

ResNet101-DUC-12
ResNet101-DUC-7
emotion-ferplus models
caffenet models
Densenet models
googlenet models
inception models
rcnn-ilsvrc13 models
resnet models
vgg models

In the yolo models shows up the merge concat pattern (Merge nested concat Ops optimization pass in ONNX dialect #3111)
Models with the CombineParallelConv pattern (Combine Parallel Convolution optimization pass in ONNX Dialect #3116): This shows up in many convolution-heavy models like:

retinanet models
version-RFB-320
version-RFB-640
googlenet models
inception models
resnet models
squeezenet models

Arkar-Hema · 2025-04-22T03:52:46Z

@tungld could you please verify this patch?

tungld · 2025-04-22T03:57:54Z

@Arkar-Hema thank you for the information!

I have some general comments:

I think that when multiple GEMM ops are followed by a concat, the performance in theory would be better. But, could you run with multiple input sizes to see how the performance benefit in practice?
When multiple GEMM ops are NOT followed by a concat (this is the case for the models you listed), you need a split and I think the split axis is the innermost dimension. I am not sure how slow the split is and if we can get speedup or not. Could you do a performance comparison to see if you can achieve speedup in this case?
Are you targeting optimization for CPU or it is beneficial for AI accelerators as well given that AI accelerators may use special data layout which may be not convenient for concat or split.

Thanks.

Arkar-Hema · 2025-04-22T08:40:38Z

I ran performance benchmarks across a range of input sizes for both the GEMM → Concat and the Combined GEMM → Split cases. Results show that:

In the Concat case, the optimization provides consistent Latency improvement of 2-7%, and throughput improvement of 1-5%
In the cases where it splits, the optimization provides consistent Latency improvement of 1-7%, and throughput improvement of 1-8%
I’ve currently targeted this pass primarily for CPU backends only.

tungld

Thanks @Arkar-Hema for the experiments! Did you compile your programs with -O3?

Since this parallel fusion may not work for accelerators, could you create a compile option to enable this if needed, for example -fuse-parallel-onnx-gemm?

I don't think you need to handle the case where there is a concat after multiple gemms. Just emit a split op, then later you can write a simple canonicalization rule for concat to fuse Split -> Concat.

Below are my first-round comments, most of them are for simplifying the code, making it easy to follow. However, the important thing is you need to check the input C carefully because it's broadcastable.

src/Dialect/ONNX/Transforms/Recompose.cpp

test/mlir/onnx/onnx_recompose_combine_parallel_dense.mlir

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-02T08:41:06Z

Can one of the admins verify this patch?

Arkar-Hema · 2025-05-02T08:47:46Z

Thanks @Arkar-Hema for the experiments! Did you compile your programs with -O3?

Since this parallel fusion may not work for accelerators, could you create a compile option to enable this if needed, for example -fuse-parallel-onnx-gemm?

I don't think you need to handle the case where there is a concat after multiple gemms. Just emit a split op, then later you can write a simple canonicalization rule for concat to fuse Split -> Concat.

Below are my first-round comments, most of them are for simplifying the code, making it easy to follow. However, the important thing is you need to check the input C carefully because it's broadcastable.

I have added it, Thanks

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-02T10:05:26Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-02T11:16:41Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-02T12:29:21Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-05T03:24:18Z

Can one of the admins verify this patch?

AlexandreEichenberger · 2025-05-05T14:50:27Z

@jenkins-droid test this please

src/Dialect/ONNX/DialectBuilder.hpp

src/Dialect/ONNX/Transforms/Recompose.cpp

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T04:30:35Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T04:34:10Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T06:56:00Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T12:16:41Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-08T12:20:40Z

Can one of the admins verify this patch?

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-09T04:07:37Z

Can one of the admins verify this patch?

AlexandreEichenberger · 2025-05-09T13:23:38Z

@jenkins-droid test this please

tungld · 2025-05-12T02:26:10Z

Hi @Arkar-Hema When addressing a comment, could you please provide a brief explanation of how you did so? This will make the review process easier. Thanks!

Signed-off-by: Arkar-Hema <[email protected]>

jenkins-droid · 2025-05-15T03:27:35Z

Can one of the admins verify this patch?

tungld · 2025-05-19T06:58:44Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+            mlir::cast<ShapedType>(b.getResult().getType()).getShape();
+        // Output channels is the last dim
+        if (aOutputShape.back() != bOutputShape.back())
+          return false;


Both Biases as tensor<1xf32>:
If both biases are of shape tensor<1xf32>, I now check their corresponding Gemm output shapes and ensure their output channels (last dimension) match before considering them compatible. If they differ, the function returns false, as merging them without this check would be invalid.

It does not make sense to me how this can solve the problem. You must check there is no broadcasting here, say the last dim in the output must be 1 also, for example:

if (aOutputShape.back() != 1 || bOutputShape.back() != 1) return false;

Also, please do add a lit test for this case, to make sure gemm ops are not merged.

tungld · 2025-05-19T07:00:33Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+    Type unrankedTensorType = mlir::UnrankedTensorType::get(elementType);
+    Type newWeightType = unrankedTensorType;
+    Value newWeight =
+        create.onnx.concat(newWeightType, weightValues, concatWeightAxis);


Replace newWeightType by unrankedTensorType . It's redundant to define newWeightType.

tungld · 2025-05-19T07:01:32Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+    }
+
+    Type newBiasType = unrankedTensorType;
+    Value newBias = create.onnx.concat(newBiasType, biasValues, 0);


Replace newBiasType by unrankedTensorType . It's redundant to define newBiasType .

tungld · 2025-05-19T07:05:03Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+        auto aOutputShape =
+            mlir::cast<ShapedType>(a.getResult().getType()).getShape();
+        auto bOutputShape =
+            mlir::cast<ShapedType>(b.getResult().getType()).getShape();


Replace these by:

ArrayRef<int64_t> aOutputShape = getShape(a.getY().getType()); ArrayRef<int64_t> bOutputShape = getShape(b.getY().getType());

tungld · 2025-05-19T07:11:40Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+
+    auto newGemm = rewriter.create<ONNXGemmOp>(loc, newOutputType, input,
+        newWeight, newBias, gemmOp1.getAlphaAttr(), gemmOp1.getBetaAttr(),
+        gemmOp1.getTransAAttr(), gemmOp1.getTransBAttr());


Please replace this by

Value newGemmOutput = create.onnx.gemm(unrankedTensorType, input, newWeight, newBias, gemmOp1.getAlphaAttr(), gemmOp1.getBetaAttr(), gemmOp1.getTransAAttr(), gemmOp1.getTransBAttr());

tungld · 2025-05-19T07:14:28Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+        biasValues.push_back(gemm.getC());
+      } else {
+        auto gemmShape =
+            mlir::cast<ShapedType>(gemm.getResult().getType()).getShape();


Please use: ArrayRef<int64_t> gemmShape = getShape(gemm.getY().getType());

tungld · 2025-05-19T07:21:29Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+
+      ArrayRef<int64_t> splitSizes(splitSizesVec);
+      ValueRange splitResults = onnx_mlir::emitSplitByChannels(
+          rewriter, loc, newGemm.getResult(), splitSizes, splitAxis);


Please replace this by

SmallVector<Type, 4> splitTypes(splitSizes.size(), unrankedTensorType); ValueRange splitResults = create.onnx.split( splitTypes, newGemmOutput, create.onnx.constantInt64(splitSizes), splitAxis);

tungld · 2025-05-19T07:24:30Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+        auto gemmShape =
+            mlir::cast<ShapedType>(gemm.getResult().getType()).getShape();
+        Value zeroBias = create.onnx.constant(DenseElementsAttr::get(
+            RankedTensorType::get({gemmShape[splitAxis]}, elementType), 0.0));


Check if gemmShape[splitAxis] is a static dimension in areCompatible() function. Otherwise, it fails to create a constant tensor here.

tungld · 2025-05-19T07:26:41Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+      else if (aCShape[0] != bCShape[0])
+        return false;
+    }
+    return true;


When aC is None, do check that the last dim of aOutput is static. Otherwise, it fails when you create a constant tensor of zeros in the later code that you use the last dim of aOutput.

Check the same thing for bC.

Please add a list test for the case where aC or bC is None.

tungld · 2025-05-19T07:38:09Z

src/Dialect/ONNX/Transforms/Recompose.cpp

+        newWeight, newBias, gemmOp1.getAlphaAttr(), gemmOp1.getBetaAttr(),
+        gemmOp1.getTransAAttr(), gemmOp1.getTransBAttr());
+
+    // Check for common ConcatOp


Check this earlier just after you collect all parallelGemms. The reason is you have a return failure() here which may interrupt the whole rewriting while you created new weight, new bias, and new gemm. Moving this check earlier before creating any new ops would make the IR clean.

Arkar-Hema added 2 commits April 16, 2025 04:20

Combine parallel dense optimization pass

7aa505a

Signed-off-by: Arkar-Hema <[email protected]>

Clang format modified

997d9e5

Signed-off-by: Arkar-Hema <[email protected]>

Clang format modified

15343bf

Signed-off-by: Arkar-Hema <[email protected]>

Added the unit test for the pass

4fad7ea

Signed-off-by: Arkar-Hema <[email protected]>

tungld reviewed Apr 23, 2025

View reviewed changes

AlexandreEichenberger and others added 2 commits May 1, 2025 10:59

Merge branch 'main' into combine_parallel_dense

d4d2fda

Updated test case, added compiler flag, and builder for gemm

e6d8d6c

Signed-off-by: Arkar-Hema <[email protected]>

Arkar-Hema added 3 commits May 2, 2025 05:00

Clang format fix

7b357e2

Signed-off-by: Arkar-Hema <[email protected]>

Clang fix

ab7a2aa

Signed-off-by: Arkar-Hema <[email protected]>

Added compiler option

2b466ca

Signed-off-by: Arkar-Hema <[email protected]>

Added compiler option in test case

6aff5ab

Signed-off-by: Arkar-Hema <[email protected]>

Test case updation

d8611f5

Signed-off-by: Arkar-Hema <[email protected]>

AlexandreEichenberger and others added 2 commits May 2, 2025 08:38

Merge branch 'main' into combine_parallel_dense

20cab0c

Merge branch 'main' into combine_parallel_dense

3bf58c0

Signed-off-by: Arkar-Hema <[email protected]>

tungld reviewed May 7, 2025

View reviewed changes

Added lit test for dynamic shapes

d07e896

Signed-off-by: Arkar-Hema <[email protected]>

Clang format fix

2266d90

Signed-off-by: Arkar-Hema <[email protected]>

Added unrankedtype for outputtype

8882476

Signed-off-by: Arkar-Hema <[email protected]>

Added ranked type for output type

c8d2946

Signed-off-by: Arkar-Hema <[email protected]>

Clang format fix

dd42652

Signed-off-by: Arkar-Hema <[email protected]>

AlexandreEichenberger and others added 2 commits May 8, 2025 09:11

Merge branch 'main' into combine_parallel_dense

ca09e94

Updated output type

3f66539

Signed-off-by: Arkar-Hema <[email protected]>

Arkar-Hema added 3 commits May 13, 2025 05:12

Updated Compatible function

2f0f113

Signed-off-by: Arkar-Hema <[email protected]>

clang fix

c2b3728

Signed-off-by: Arkar-Hema <[email protected]>

Resolved conflicts

81df6d9

Signed-off-by: Arkar-Hema <[email protected]>

Merge branch 'main' into combine_parallel_dense

5f132a7

Arkar-Hema requested a review from tungld May 19, 2025 03:33

tungld requested changes May 19, 2025

View reviewed changes

Merge branch 'main' into combine_parallel_dense

92ff3d4

Combine parallel dense Optimization pass in ONNX Dialect #3123

Are you sure you want to change the base?

Combine parallel dense Optimization pass in ONNX Dialect #3123

Uh oh!

Conversation

Arkar-Hema commented Apr 16, 2025

Uh oh!

jenkins-droid commented Apr 16, 2025

Uh oh!

jenkins-droid commented Apr 16, 2025

Uh oh!

jenkins-droid commented Apr 16, 2025

Uh oh!

tungld commented Apr 17, 2025

Uh oh!

Arkar-Hema commented Apr 17, 2025

Uh oh!

Arkar-Hema commented Apr 22, 2025

Uh oh!

tungld commented Apr 22, 2025

Uh oh!

Arkar-Hema commented Apr 22, 2025

Uh oh!

tungld left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jenkins-droid commented May 2, 2025

Uh oh!

Arkar-Hema commented May 2, 2025

Uh oh!

jenkins-droid commented May 2, 2025

Uh oh!

jenkins-droid commented May 2, 2025

Uh oh!

jenkins-droid commented May 2, 2025

Uh oh!

jenkins-droid commented May 5, 2025

Uh oh!

AlexandreEichenberger commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jenkins-droid commented May 8, 2025

Uh oh!

jenkins-droid commented May 8, 2025

Uh oh!

jenkins-droid commented May 8, 2025

Uh oh!

jenkins-droid commented May 8, 2025

Uh oh!

jenkins-droid commented May 8, 2025

Uh oh!

jenkins-droid commented May 9, 2025

Uh oh!

AlexandreEichenberger commented May 9, 2025

Uh oh!

tungld commented May 12, 2025

Uh oh!

jenkins-droid commented May 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tungld left a comment •

edited

Loading