Figconvnet performance improvements #822

coreyjadams · 2025-03-21T21:44:45Z

PhysicsNeMo Pull Request

Description

While investigating how to make FigConvNet and DoMINO domain parallel, I ran a profile of FigConvNet and discovered some low hanging fruit for performance improvements. This PR addresses them. I'll summarize them below but first some selected results. At Batch size 1, we see over 2x improvement on A100:

And that's a 3x improvement on Hopper:

Batch size 8 can not fit the larger image size, but for smaller images we see 2.5x improvement on A100:

And 2x improvement on Hopper:

The changes:

a number of functions, most notably grid_init, were doing data transfers during the forward pass of the network.
The layer normalization layers are significantly more efficient when leveraging transformer engine. It's now a configurable option in the network.
Emptying the cuda cache is a bottle neck in the training script. This makes memory constraints slightly harder but manageable.
Finally, the warp-based radius search introduces an unavoidable sync point. Original implementation did this once per batch-item (B times total). I've refactored it to sync once per Batch - this gives a few percent boost at larger batch sizes.

What's left on the table?

The model has significant work that is independent (point -> grid, and grid->point) and could be parallelized into streams.
- Because of the blocking CPU transfers in the warp kernel, we can't get a boost doing that yet. It would need threading, too.
The radius search could be accelerated with steam concurrency too. However, it's part torch, part warp, and I was seeing race conditions and illegal memory access in certain cases. It's not included here but it's possible in the future.
The central part of the network, performing the grid down/up blocks, suffers from poor GPU occupancy. It could be accelerated with cuda graphs and kernel fusion but the organization of the models makes it a challenge: those techniques require to take pure tensors in/out but the input/output is python classes containing tensors.

So, I stopped there, it's still better than it was!

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.

Dependencies

Also, tweak the grid_init function to not force a cuda sync from a non-paged memcopy.

…ync points in model execution.

Alexey-Kamenev · 2025-03-26T00:04:17Z

physicsnemo/models/figconvnet/warp_neighbor_search.py

+        result_count_torch = wp.to_torch(result_count)
+        torch.cumsum(result_count_torch, dim=0, out=torch_offset[1:])
+        # Allocate a pinned tensor on the CPU:
+        torch_count = torch.empty(1, dtype=torch.int32, pin_memory=True)


total_count?

coreyjadams and others added 4 commits March 18, 2025 15:07

Enable transformer engine LayerNorm in FigConvNet.

a96758a

Also, tweak the grid_init function to not force a cuda sync from a non-paged memcopy.

Adding some checks and data movement up front to minimize number of s…

62930f4

…ync points in model execution.

Merge branch 'NVIDIA:main' into figconvnet-performance-improvements

5c97e58

Merge branch 'NVIDIA:main' into figconvnet-performance-improvements

7fa0bd9

coreyjadams requested a review from Alexey-Kamenev March 21, 2025 21:44

Alexey-Kamenev reviewed Mar 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figconvnet performance improvements #822

Figconvnet performance improvements #822

coreyjadams commented Mar 21, 2025

Alexey-Kamenev Mar 26, 2025

Figconvnet performance improvements #822

Are you sure you want to change the base?

Figconvnet performance improvements #822

Conversation

coreyjadams commented Mar 21, 2025

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Alexey-Kamenev Mar 26, 2025

Choose a reason for hiding this comment