TMA inner persistent scheduler #5587

liqiangxl · 2025-11-25T13:47:12Z

No description provided.

github-actions · 2025-11-25T19:19:11Z

Review updated until commit 5cb25f3

Description

Implement TMA inner persistent scheduler for normalization operations
Split normalization inner operations into TMA and non-TMA paths
Add manual scheduling optimizations for dynamic block dimensions
Implement vectorized smem2regs operations for improved memory access
Add RMS norm and layer norm implementations with workspace support

Changes walkthrough

	Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Missing Error Handling

In the computeHeuristics method, when TMA is enabled but tma_params is nullptr, the code will crash on line 184. Consider adding a check for nullptr before dereferencing.

auto tma_params = normalization_inner::tma::getInnerPersistentHeuristics(
    fusion, runtime_info, data_cache);
NVF_ERROR(tma_params != nullptr);
return tma_params;

Potential Memory Leak

The dynamic_cast operations in schedule method could fail and return nullptr, but there's no nullptr check before using the cast result. This could lead to undefined behavior.

auto rparams = dynamic_cast<const ReductionParams*>(params);
NVF_ERROR(
    rparams != nullptr && rparams->scheduler_type == schedulerType(),
    "Incorrect parameters sent to InnerPersistentKernelScheduler::schedule",
    params);
NVF_ERROR(
    rparams->scheduler_type ==
    InnerPersistentKernelScheduler::schedulerType());
normalization_inner::non_tma::scheduleInnerPersistent(fusion, rparams);

Division by Zero Risk

In getRegisterSharing function, there's a division by padded_threads on line 162. If padded_threads is 0, this will cause a division by zero error. Add validation to prevent this.

      (reg_per_thread - tma_branch_regs) * padded_threads / computation_threads;
  if (compute_branch_regs % regs_granularity != 0) {
    compute_branch_regs -= compute_branch_regs % regs_granularity;
    tma_branch_regs = reg_per_thread -
        (compute_branch_regs - reg_per_thread) * computation_threads /
            padded_threads;
  }
  compute_branch_regs =
      std::min(compute_branch_regs, scheduler_utils::max_registers_per_thread);
  return std::make_pair(tma_branch_regs, compute_branch_regs);
}

liqiangxl added 11 commits November 23, 2025 15:23

split norm inner into tma and non tma

774f266

manual schedule

38f5f07

dynamic bdimx

9f9e88d

manual schedule

b40cdf9

vect smem2regs

69a36ef

rms norm

0bcc4a2

pad to warps

6695857

wip

8e1bf7a

opt heuristics, unroll is bad

7010b2a

layer norm

a2cb1ff

ln perf not good

f45abd8

liqiangxl added 4 commits November 25, 2025 13:13

softmax

7811883

ok

b5822a5

add ws

5cb25f3

group

7241a99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TMA inner persistent scheduler #5587

TMA inner persistent scheduler #5587

Uh oh!

liqiangxl commented Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025 •

edited

Loading

Changes walkthrough

PR Reviewer Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TMA inner persistent scheduler #5587

Are you sure you want to change the base?

TMA inner persistent scheduler #5587

Uh oh!

Conversation

liqiangxl commented Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Nov 25, 2025 •

edited

Loading