-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
megaboom place stopped working [ERROR GPL-0307] RePlAce divergence detected. Re-run with a smaller max_phi_cof value. #6465
Comments
@maliberty @jeffng-or A standalone test case of failure on megaboom main |
Hi! Recently we have noticed divergences happening on stage 3-1 (skip io), and we made modifications to gpl regarding a bivariate normal distribution adjustment made with macros, reducing its effect (PR #6438). Here it is the design without any changes, diverging on iteration 500: I tried removing the bivariate effect locally for this megaboom package. But we also have the divergence happening. I will run again trying to make sure it is not a false positive detection. Something that called my attention is that gpl is not using the space available to the left and to the bottom to place the cells. |
I am also observing this issue on two different designs on two different non-public PDKs. Both designs are large (3.7 / 4.6M instances). One has macros, one does not. I have not seen this issue previously with these designs. We are using an OpenROAD build from December 11th: 8495fc8. We have not changed our TCL parameterization of OR within that timeframe. If there has been a regression, it may have occurred before 12/11. These designs are private and large so they would be difficult to share, but if the megaboom testcase is not sufficient to identify the issue let me know. |
Hi @mikesinouye, what stage exactly do you have the error? is it on skip io stage 3-1, or during global placement itself on stage 3-3? |
Hey @gudeh, we use our own custom flow instead of ORFS, but it is the second/final iteration of global placement with set pins/macros etc. I believe it would best align with ORFS 3-3. I noticed that the recently enabled resizer in gpl is causing large area swings, and in these cases causing the pecentage of overlap to regress:
After the final timing driven non-virtual resizing, the overflow goes from 0.150 to 0.607, which seems unexpected to me. |
It is particularly odd since the |
Indeed, that's a big jump after the last timing-driven iteration. I would have to take a look on the debug mode, it is unfortunate it is a private PDK. I believe @mikesinouye 's issue is different from the one on megaboom in the current GH issue. Either way, you can remove the non-virtual iterations with the new gpl TCL command: |
Concerning mega boom, I am investigating the issue and should provide new insights soon enough. |
@gudeh Purely for my planning purposes, do you have any idea on how long this will take to fix? Days, weeks, months? Any tips on a workaround? |
Hello @oharboe. I apologize, but we have not yet been able to determine the exact cause of the divergence. There is evidence suggesting it may be related to how we handle blockages and the limited ability of gpl to move cells outside of blockages. Unfortunately, we are still investigating and cannot confirm the root cause at this point. I estimate that finding a solution could take some time, potentially weeks or even months. In the meantime, I plan to try a workaround: maintaining a snapshot of the lowest overflow achieved. If a divergence is detected, we could deliver the saved snapshot instead of throwing an error. I hope this is not too hard to implement, I will try to use what is already done on routability mode. |
@gudeh Could it be that this problem has been there all along and that it is just bad luck that some initial conditions for the flow has changed and now it no longer works? I have a feeling that any slight rearrangement of the macros would make placement work... |
Potentially, I tried running this megaboom with a gpl version of Oct/2024, before some changes I made, and it still diverged. I think of trying older versions also.
I agree. I believe a floorplan that offers a more unified area for the instances makes it easier for the placer. The original GPL implementation allowed macros to move, but fixing the macros introduces certain implications, I understand. For example, our limited ability to move cells off of macros effectively. The divergences occurring with GPL have been happening far more frequently (if not exclusively) in designs that include macros. |
@oharboe did you try delta debug on this? |
no |
@gudeh @maliberty @jeffng-or FYI, modifying(I increased, didn't try to make it slightly smaller) the die/core area slightly gets megaboom past placement The-OpenROAD-Project/megaboom#227 |
@maliberty @tspyrou I suspect that deltaDebug.py is just timing out on read_sdc, not making progress... |
Allow me to share some thoughts. From the experiments I tried, there was two that lead to a convergence:
|
As a temporary workaround you could try adding |
Will try next time, but for now increasing the die/core area worked. |
@gudeh @maliberty I don't think deltaDebug.py will do much, it has been running for a couple of days but only got down to:
I imagine that to really make it easier to understand what is going wrong here more quickly, there would need to be tens, hundreds, maybe thousands, but not tens of thousands of instances. 10 minutes turnaround time isn't a big improvement here according to my testing. |
Hi. I implemented the idea of saving a snapshot and finishing with it instead of throwing an error. For this megaboom, it finished with the following placement. Which looks a lot better than the previous ones with clear divergences.
|
whittled down 2_floorplan.odb
|
I had been thinking the macros were the trigger for the issue but this new test case doesn't have any macros! Its a good elimination of a false track |
Nvm, the blockages are still there. |
I substituted the recent odb shared by @oharboe on the artificat previously sent. When I run the run-me for stage 3-1 I see these messages and it takes around 2 hours to get to placement. Does this happen to you too?
|
As it isn't a timing driven placment I just removed read_sdc. |
Describe the bug
The-OpenROAD-Project/megaboom#225
Tried a PLACE_DENSITY sweep with higher and lower values, no luck.
untar and run https://drive.google.com/file/d/1xbTQs6K922zpV9Qw7jSq4Il2Tgjg8ab-/view?usp=sharing
Expected Behavior
Placement should work or an an actionable error message
Environment
To Reproduce
See above
Relevant log output
No response
Screenshots
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: