Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figconvnet performance improvements #822

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

coreyjadams
Copy link
Collaborator

PhysicsNeMo Pull Request

Description

While investigating how to make FigConvNet and DoMINO domain parallel, I ran a profile of FigConvNet and discovered some low hanging fruit for performance improvements. This PR addresses them. I'll summarize them below but first some selected results. At Batch size 1, we see over 2x improvement on A100:

image

And that's a 3x improvement on Hopper:
image

Batch size 8 can not fit the larger image size, but for smaller images we see 2.5x improvement on A100:
image

And 2x improvement on Hopper:
image

The changes:

  • a number of functions, most notably grid_init, were doing data transfers during the forward pass of the network.
  • The layer normalization layers are significantly more efficient when leveraging transformer engine. It's now a configurable option in the network.
  • Emptying the cuda cache is a bottle neck in the training script. This makes memory constraints slightly harder but manageable.
  • Finally, the warp-based radius search introduces an unavoidable sync point. Original implementation did this once per batch-item (B times total). I've refactored it to sync once per Batch - this gives a few percent boost at larger batch sizes.

What's left on the table?

  • The model has significant work that is independent (point -> grid, and grid->point) and could be parallelized into streams.
    • Because of the blocking CPU transfers in the warp kernel, we can't get a boost doing that yet. It would need threading, too.
  • The radius search could be accelerated with steam concurrency too. However, it's part torch, part warp, and I was seeing race conditions and illegal memory access in certain cases. It's not included here but it's possible in the future.
  • The central part of the network, performing the grid down/up blocks, suffers from poor GPU occupancy. It could be accelerated with cuda graphs and kernel fusion but the organization of the models makes it a challenge: those techniques require to take pure tensors in/out but the input/output is python classes containing tensors.

So, I stopped there, it's still better than it was!

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

result_count_torch = wp.to_torch(result_count)
torch.cumsum(result_count_torch, dim=0, out=torch_offset[1:])
# Allocate a pinned tensor on the CPU:
torch_count = torch.empty(1, dtype=torch.int32, pin_memory=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_count?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants