Skip to content

perf: optimize CGRANode::isOccupied with O(1) periodic lookup (~7.5x speedup)#70

Open
shiyunyao wants to merge 3 commits intotancheng:masterfrom
shiyunyao:master
Open

perf: optimize CGRANode::isOccupied with O(1) periodic lookup (~7.5x speedup)#70
shiyunyao wants to merge 3 commits intotancheng:masterfrom
shiyunyao:master

Conversation

@shiyunyao
Copy link
Copy Markdown
Contributor

Profiling the mapper using perf (with OpenMP disabled to isolate algorithmic costs) revealed that CGRANode::isOccupied consumes approximately 70-80% of the total execution time.

I replaced the O(N) linear loop with an O(1) direct index lookup
This change maintains bit-perfect consistency with the original logic but eliminates the loop overhead.

To demonstrate scalability, I tested the fir kernel with an enlarged CGRA configuration (rows=16, cols=16 in param.json):

Metric Original This PR
Time ~197s ~26s
Speedup - ~7.5x
Result (II) 16 16

After removing this bottleneck, the overhead of OpenMP thread management has become relatively significant compared to the reduced computation time. We might need to reconsider the necessity of the current OpenMP strategy in future optimizations

src/CGRANode.cpp Outdated
if (p.second == START_PIPE_OCCUPY or p.second == SINGLE_OCCUPY or m_supportDVFS) {
return true;
}
for (pair<DFGNode*, int> p: *(m_dfgNodesWithOccupyStatus[t_II+(t_cycle)%t_II])){
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this is correct? Removing this would cause only checking one specific cycle. However, we need to check cycle, cycle + II, cycle + 2 * II, cycle + 3 * II, etc. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tancheng, thanks for the review!

You are completely right that modulo scheduling theoretically requires verifying all equivalent cycles.

I double-checked the implementation of setDFGNode and confirmed that m_dfgNodesWithOccupyStatus is only modified in that function. Crucially, the population logic there is strictly periodic:

// In setDFGNode:
for (int cycle = t_cycle % interval; cycle < m_cycleBoundary; cycle += interval) {
    // This loop ensures the occupancy status is identical for ALL modulo cycles.
    m_dfgNodesWithOccupyStatus[cycle]->push_back(...);
}

Since the data is already populated identically for every cycle + k*II, checking a single valid cycle (like t_II + t_cycle % t_II) is mathematically equivalent to checking the entire loop, but much faster.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean there is no need to check cycle + k*II? Could you go through the other kernels' mapping results to ensure the correctness?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then plz leave a comment there, mentioning materializing DFG node mapping across all cycles with II interval has already been done during setDFGNode().

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MeowMJ WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MeowMJ I have performed a regression test on 5 kernels (fir, conv, nonlinear, multicycle, dvfs) to verify the correctness.

Kernel Baseline II Optimized II Result Status Routing
fir 6 6 Pass Identical
conv 4 4 Pass Identical
nonlinear 2 2 Pass Identical
multicycle 4 4 Pass Identical
dvfs 4 4 Pass Identical

The optimization is strictly bit-exact. For all tested kernels, not only did we achieve the same Initiation Interval (II), but the final Placement and Routing layouts were also identical to the baseline.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tancheng, I have checked the logic of setDFGNode and m_dfgNodesWithOccupyStatus. This change is correct and helpful. We can even update all functions related to setDFGNode and m_dfgNodesWithOccupyStatus to a simpler implementation.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can even update all functions related to setDFGNode and m_dfgNodesWithOccupyStatus to a simpler implementation.

@MeowMJ can you please elaborate on this, and let @shiyunyao try your idea?

@tancheng tancheng requested a review from MeowMJ January 14, 2026 17:26
src/CGRANode.cpp Outdated
if (p.second == START_PIPE_OCCUPY or p.second == SINGLE_OCCUPY or m_supportDVFS) {
return true;
}
for (pair<DFGNode*, int> p: *(m_dfgNodesWithOccupyStatus[t_II+(t_cycle)%t_II])){
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MeowMJ WDYT?

@MeowMJ
Copy link
Copy Markdown
Collaborator

MeowMJ commented Jan 31, 2026

@shiyunyao I found a problem. In setDFGNode, the duplicate data is placed into m_dfgNodesWithOccupyStatus[k*t_II], where k < CGRANode*t_II, or into m_dfgNodesWithOccupyStatus[t_cycle % t_II + k'*t_II], where k' < CGRANode*t_II - (t_cycle % t_II) / t_II. However, in functions that are related to m_dfgNodesWithOccupyStatus, that is, isOccupied, isStartOrInPipe, isInOrEndPipe, and isEndPipe, they fetch data in m_dfgNodesWithOccupyStatus[t_cycle + k''*t_II], where k'' < CGRANode*t_II - t_cycle / t_II. Please note that the t_cycle is a parameter of these functions.

Though we can update the data fetch pattern in four functions related to m_dfgNodesWithOccupyStatus, it may raise other issues when t_cycle changes in their callers. WDYT? Can you check the value of t_cycle in the four functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants