perf: optimize CGRANode::isOccupied with O(1) periodic lookup (~7.5x speedup)#70
perf: optimize CGRANode::isOccupied with O(1) periodic lookup (~7.5x speedup)#70shiyunyao wants to merge 3 commits intotancheng:masterfrom
Conversation
src/CGRANode.cpp
Outdated
| if (p.second == START_PIPE_OCCUPY or p.second == SINGLE_OCCUPY or m_supportDVFS) { | ||
| return true; | ||
| } | ||
| for (pair<DFGNode*, int> p: *(m_dfgNodesWithOccupyStatus[t_II+(t_cycle)%t_II])){ |
There was a problem hiding this comment.
Removing this is correct? Removing this would cause only checking one specific cycle. However, we need to check cycle, cycle + II, cycle + 2 * II, cycle + 3 * II, etc. WDYT?
There was a problem hiding this comment.
Hi @tancheng, thanks for the review!
You are completely right that modulo scheduling theoretically requires verifying all equivalent cycles.
I double-checked the implementation of setDFGNode and confirmed that m_dfgNodesWithOccupyStatus is only modified in that function. Crucially, the population logic there is strictly periodic:
// In setDFGNode:
for (int cycle = t_cycle % interval; cycle < m_cycleBoundary; cycle += interval) {
// This loop ensures the occupancy status is identical for ALL modulo cycles.
m_dfgNodesWithOccupyStatus[cycle]->push_back(...);
}Since the data is already populated identically for every cycle + k*II, checking a single valid cycle (like t_II + t_cycle % t_II) is mathematically equivalent to checking the entire loop, but much faster.
There was a problem hiding this comment.
You mean there is no need to check cycle + k*II? Could you go through the other kernels' mapping results to ensure the correctness?
There was a problem hiding this comment.
Then plz leave a comment there, mentioning materializing DFG node mapping across all cycles with II interval has already been done during setDFGNode().
There was a problem hiding this comment.
@MeowMJ I have performed a regression test on 5 kernels (fir, conv, nonlinear, multicycle, dvfs) to verify the correctness.
| Kernel | Baseline II | Optimized II | Result Status | Routing |
|---|---|---|---|---|
fir |
6 | 6 | Pass | Identical |
conv |
4 | 4 | Pass | Identical |
nonlinear |
2 | 2 | Pass | Identical |
multicycle |
4 | 4 | Pass | Identical |
dvfs |
4 | 4 | Pass | Identical |
The optimization is strictly bit-exact. For all tested kernels, not only did we achieve the same Initiation Interval (II), but the final Placement and Routing layouts were also identical to the baseline.
There was a problem hiding this comment.
@tancheng, I have checked the logic of setDFGNode and m_dfgNodesWithOccupyStatus. This change is correct and helpful. We can even update all functions related to setDFGNode and m_dfgNodesWithOccupyStatus to a simpler implementation.
There was a problem hiding this comment.
We can even update all functions related to setDFGNode and m_dfgNodesWithOccupyStatus to a simpler implementation.
@MeowMJ can you please elaborate on this, and let @shiyunyao try your idea?
src/CGRANode.cpp
Outdated
| if (p.second == START_PIPE_OCCUPY or p.second == SINGLE_OCCUPY or m_supportDVFS) { | ||
| return true; | ||
| } | ||
| for (pair<DFGNode*, int> p: *(m_dfgNodesWithOccupyStatus[t_II+(t_cycle)%t_II])){ |
|
@shiyunyao I found a problem. In Though we can update the data fetch pattern in four functions related to |
Profiling the mapper using perf (with OpenMP disabled to isolate algorithmic costs) revealed that CGRANode::isOccupied consumes approximately 70-80% of the total execution time.
I replaced the O(N) linear loop with an O(1) direct index lookup
This change maintains bit-perfect consistency with the original logic but eliminates the loop overhead.
To demonstrate scalability, I tested the fir kernel with an enlarged CGRA configuration (rows=16, cols=16 in param.json):
After removing this bottleneck, the overhead of OpenMP thread management has become relatively significant compared to the reduced computation time. We might need to reconsider the necessity of the current OpenMP strategy in future optimizations