Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

megaboom place stopped working [ERROR GPL-0307] RePlAce divergence detected. Re-run with a smaller max_phi_cof value. #6465

Open
oharboe opened this issue Jan 3, 2025 · 26 comments
Assignees
Labels
gpl Global Placement

Comments

@oharboe
Copy link
Collaborator

oharboe commented Jan 3, 2025

Describe the bug

The-OpenROAD-Project/megaboom#225

Tried a PLACE_DENSITY sweep with higher and lower values, no luck.

image

untar and run https://drive.google.com/file/d/1xbTQs6K922zpV9Qw7jSq4Il2Tgjg8ab-/view?usp=sharing

OpenROAD v2.0-17966-g1ffb8502d 
Features included (+) or not (-): +GPU +GUI +Python
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
global_placement -skip_io -density 0.54 -pad_left 0 -pad_right 0
[INFO GPL-0002] DBU: 1000
[INFO GPL-0003] SiteSize: (  0.054  0.270 ) um
[INFO GPL-0004] CoreBBox: (  2.052  2.160 ) ( 1247.994 1247.940 ) um
[INFO GPL-0006] NumInstances:           1072964
[INFO GPL-0007] NumPlaceInstances:      1046339
[INFO GPL-0008] NumFixedInstances:           72
[INFO GPL-0009] NumDummyInstances:        26553
[INFO GPL-0010] NumNets:                1076855
[INFO GPL-0011] NumPins:                3743460
[INFO GPL-0012] DieBBox:  (  0.000  0.000 ) ( 1250.000 1250.000 ) um
[INFO GPL-0013] CoreBBox: (  2.052  2.160 ) ( 1247.994 1247.940 ) um
[INFO GPL-0016] CoreArea:            1552169.625 um^2
[INFO GPL-0017] NonPlaceInstsArea:   937003.689 um^2
[INFO GPL-0018] PlaceInstsArea:      127453.578 um^2
[INFO GPL-0019] Util:                    20.719 %
[INFO GPL-0020] StdInstsArea:        127453.578 um^2
[INFO GPL-0021] MacroInstsArea:           0.000 um^2
[INFO GPL-0031] FillerInit:NumGCells:   2801620
[INFO GPL-0032] FillerInit:NumGNets:    1076855
[INFO GPL-0033] FillerInit:NumGPins:    3743460
[INFO GPL-0023] TargetDensity:            0.540
[INFO GPL-0024] AvrgPlaceInstArea:        0.122 um^2
[INFO GPL-0025] IdealBinArea:             0.226 um^2
[INFO GPL-0026] IdealBinCnt:            6881038
[INFO GPL-0027] TotalBinArea:        1552169.625 um^2
[INFO GPL-0028] BinCnt:      2048   2048
[INFO GPL-0029] BinSize: (  0.609  0.609 )
[INFO GPL-0030] NumBins: 4194304
[NesterovSolve] Iter:    1 overflow: 1.006 HPWL: 7705769250
[NesterovSolve] Iter:   10 overflow: 1.001 HPWL: 7068976680
[NesterovSolve] Iter:   20 overflow: 0.994 HPWL: 6657066255
[NesterovSolve] Iter:   30 overflow: 0.986 HPWL: 6113800038
[NesterovSolve] Iter:   40 overflow: 0.982 HPWL: 5754923358
[NesterovSolve] Iter:   50 overflow: 0.978 HPWL: 5642920248
[NesterovSolve] Iter:   60 overflow: 0.975 HPWL: 5630791832
[NesterovSolve] Iter:   70 overflow: 0.973 HPWL: 5628620695
[NesterovSolve] Iter:   80 overflow: 0.971 HPWL: 5648920206
[deleted]
[NesterovSolve] Iter:  480 overflow: 0.204 HPWL: 17020286863
[NesterovSolve] Iter:  490 overflow: 0.163 HPWL: 16820153887
[NesterovSolve] Iter:  500 overflow: 0.152 HPWL: 45295649114
[ERROR GPL-0307] RePlAce divergence detected. Re-run with a smaller max_phi_cof value.
Error: global_place_skip_io.tcl, 12 GPL-0307
openroad> 

Expected Behavior

Placement should work or an an actionable error message

Environment

OpenROAD v2.0-17966-g1ffb8502d

To Reproduce

See above

Relevant log output

No response

Screenshots

No response

Additional Context

No response

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 3, 2025

@maliberty @jeffng-or A standalone test case of failure on megaboom main

@maliberty maliberty added the gpl Global Placement label Jan 3, 2025
@gudeh
Copy link
Contributor

gudeh commented Jan 3, 2025

Hi! Recently we have noticed divergences happening on stage 3-1 (skip io), and we made modifications to gpl regarding a bivariate normal distribution adjustment made with macros, reducing its effect (PR #6438).

Here it is the design without any changes, diverging on iteration 500:
image

I tried removing the bivariate effect locally for this megaboom package. But we also have the divergence happening. I will run again trying to make sure it is not a false positive detection.
image

Something that called my attention is that gpl is not using the space available to the left and to the bottom to place the cells.

@mikesinouye
Copy link
Contributor

I am also observing this issue on two different designs on two different non-public PDKs. Both designs are large (3.7 / 4.6M instances). One has macros, one does not. I have not seen this issue previously with these designs.

We are using an OpenROAD build from December 11th: 8495fc8. We have not changed our TCL parameterization of OR within that timeframe. If there has been a regression, it may have occurred before 12/11.

These designs are private and large so they would be difficult to share, but if the megaboom testcase is not sufficient to identify the issue let me know.

@gudeh
Copy link
Contributor

gudeh commented Jan 6, 2025

Hi @mikesinouye, what stage exactly do you have the error? is it on skip io stage 3-1, or during global placement itself on stage 3-3?

@mikesinouye
Copy link
Contributor

Hey @gudeh, we use our own custom flow instead of ORFS, but it is the second/final iteration of global placement with set pins/macros etc. I believe it would best align with ORFS 3-3.

I noticed that the recently enabled resizer in gpl is causing large area swings, and in these cases causing the pecentage of overlap to regress:

[NesterovSolve] Iter:  600 overflow: 0.289 HPWL: 67302121984
[INFO GPL-0100] Timing-driven iteration 4/6, virtual: false.
[INFO GPL-0101] Iter: 602, overflow: 0.284, keep rsz at: 0.3
[INFO GPL-0106] Timing-driven: worst slack -3.13e-09
[INFO GPL-0103] Timing-driven: weighted 457358 nets.
[INFO GPL-0107] Timing-driven: RSZ delta area:     171131.849683
[INFO GPL-0108] Timing-driven: new target density: 1.1083497
[INFO GPL-0100] Timing-driven iteration 5/6, virtual: false.
[INFO GPL-0101] Iter: 608, overflow: 0.204, keep rsz at: 0.3
[INFO GPL-0106] Timing-driven: worst slack -3.44e-09
[INFO GPL-0103] Timing-driven: weighted 457357 nets.
[INFO GPL-0107] Timing-driven: RSZ delta area:     135257.596875
[INFO GPL-0108] Timing-driven: new target density: 1.4269857
[NesterovSolve] Iter:  610 overflow: 0.206 HPWL: 136988748908
[NesterovSolve] Iter:  620 overflow: 0.169 HPWL: 127833675226
[NesterovSolve] Iter:  630 overflow: 0.168 HPWL: 120356659720
[NesterovSolve] Iter:  640 overflow: 0.169 HPWL: 112627453336
[NesterovSolve] Iter:  650 overflow: 0.169 HPWL: 105231071433
[NesterovSolve] Iter:  660 overflow: 0.169 HPWL: 99857577668
[NesterovSolve] Iter:  670 overflow: 0.168 HPWL: 96970269792
[NesterovSolve] Iter:  680 overflow: 0.168 HPWL: 93953985280
[NesterovSolve] Iter:  690 overflow: 0.168 HPWL: 91624357017
[NesterovSolve] Iter:  700 overflow: 0.168 HPWL: 89226007808
[NesterovSolve] Iter:  710 overflow: 0.168 HPWL: 86916387174
[NesterovSolve] Iter:  720 overflow: 0.169 HPWL: 84213743906
[NesterovSolve] Iter:  730 overflow: 0.170 HPWL: 81640217219
[NesterovSolve] Iter:  740 overflow: 0.171 HPWL: 78962074863
[NesterovSolve] Iter:  750 overflow: 0.172 HPWL: 76398076537
[NesterovSolve] Iter:  760 overflow: 0.172 HPWL: 73899545408
[NesterovSolve] Iter:  770 overflow: 0.171 HPWL: 71603526945
[NesterovSolve] Iter:  780 overflow: 0.167 HPWL: 69535734015
[NesterovSolve] Iter:  790 overflow: 0.161 HPWL: 67765426870
[NesterovSolve] Iter:  800 overflow: 0.150 HPWL: 66489127872
[INFO GPL-0100] Timing-driven iteration 6/6, virtual: false.
[INFO GPL-0101] Iter: 804, overflow: 0.143, keep rsz at: 0.3
[INFO GPL-0106] Timing-driven: worst slack -3.01e-09
[INFO GPL-0103] Timing-driven: weighted 457357 nets.
[INFO GPL-0107] Timing-driven: RSZ delta area:     -112784.908768
[INFO GPL-0108] Timing-driven: new target density: 1.1612902
[NesterovSolve] Iter:  810 overflow: 0.607 HPWL: 640895754879
[NesterovSolve] Iter:  820 overflow: 0.254 HPWL: 755318003477
[NesterovSolve] Iter:  830 overflow: 0.239 HPWL: 499468799259
[NesterovSolve] Iter:  840 overflow: 0.229 HPWL: 337789212007
[NesterovSolve] Iter:  850 overflow: 0.219 HPWL: 257770740234
[NesterovSolve] Iter:  860 overflow: 0.216 HPWL: 211072788780
[NesterovSolve] Iter:  870 overflow: 0.209 HPWL: 179756633409
[NesterovSolve] Iter:  880 overflow: 0.203 HPWL: 157765978747

After the final timing driven non-virtual resizing, the overflow goes from 0.150 to 0.607, which seems unexpected to me.

@maliberty
Copy link
Member

It is particularly odd since the RSZ delta area is negative suggesting we removed more logic than we added which should tend to reduce overflow.

@gudeh
Copy link
Contributor

gudeh commented Jan 6, 2025

Indeed, that's a big jump after the last timing-driven iteration. I would have to take a look on the debug mode, it is unfortunate it is a private PDK. I believe @mikesinouye 's issue is different from the one on megaboom in the current GH issue.

Either way, you can remove the non-virtual iterations with the new gpl TCL command: keep_resize_below_overflow, the current default is 0.3, if you set it to 0 you should get only virtual timing-driven iterations, meaning the rsz work is undone.

@gudeh
Copy link
Contributor

gudeh commented Jan 6, 2025

Concerning mega boom, I am investigating the issue and should provide new insights soon enough.

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 15, 2025

@gudeh Purely for my planning purposes, do you have any idea on how long this will take to fix? Days, weeks, months?

Any tips on a workaround?

@gudeh
Copy link
Contributor

gudeh commented Jan 15, 2025

Hello @oharboe. I apologize, but we have not yet been able to determine the exact cause of the divergence. There is evidence suggesting it may be related to how we handle blockages and the limited ability of gpl to move cells outside of blockages. Unfortunately, we are still investigating and cannot confirm the root cause at this point. I estimate that finding a solution could take some time, potentially weeks or even months.

In the meantime, I plan to try a workaround: maintaining a snapshot of the lowest overflow achieved. If a divergence is detected, we could deliver the saved snapshot instead of throwing an error. I hope this is not too hard to implement, I will try to use what is already done on routability mode.

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 15, 2025

@gudeh Could it be that this problem has been there all along and that it is just bad luck that some initial conditions for the flow has changed and now it no longer works?

I have a feeling that any slight rearrangement of the macros would make placement work...

@gudeh
Copy link
Contributor

gudeh commented Jan 15, 2025

@gudeh Could it be that this problem has been there all along and that it is just bad luck that some initial conditions for the flow has changed and now it no longer works?

Potentially, I tried running this megaboom with a gpl version of Oct/2024, before some changes I made, and it still diverged. I think of trying older versions also.

I have a feeling that any slight rearrangement of the macros would make placement work...

I agree. I believe a floorplan that offers a more unified area for the instances makes it easier for the placer.

The original GPL implementation allowed macros to move, but fixing the macros introduces certain implications, I understand. For example, our limited ability to move cells off of macros effectively.

The divergences occurring with GPL have been happening far more frequently (if not exclusively) in designs that include macros.

@maliberty
Copy link
Member

@oharboe did you try delta debug on this?

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 16, 2025

@oharboe did you try delta debug on this?

no

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 16, 2025

@gudeh @maliberty @jeffng-or FYI, modifying(I increased, didn't try to make it slightly smaller) the die/core area slightly gets megaboom past placement The-OpenROAD-Project/megaboom#227

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 16, 2025

@maliberty @tspyrou I suspect that deltaDebug.py is just timing out on read_sdc, not making progress...

@gudeh
Copy link
Contributor

gudeh commented Jan 16, 2025

Allow me to share some thoughts. From the experiments I tried, there was two that lead to a convergence:

  • Increasing the bin size by 10x or 5x,
    As of today, we use the average area of movable cells divided by the target density as the bin area. I think of investigating further on this front.

  • Entirely ignoring the blockages. I did that by removing the fixed dummy cells gpl has to represent the blockages. This can be an indication of potential improvement of how gpl sees blockages. As of now, we only insert this dummy cells which are fixed and occupy the area occupied by the blockages.
    Furthermore, specially for macros, we use a bivariate adjustment to increase the gradient (higher on the center) and attempt to repel cells off of macros. This adjustment does not include the macro blockages, and this is not necessarily wrong, although I observed sometimes cells stay off macros but on top of blockages. I also tried modifying the code to use the bivariate adjustment on the blockages instead of macros, but it did not converge.

@maliberty
Copy link
Member

maliberty commented Jan 16, 2025

As a temporary workaround you could try adding -overflow 0.25 which should "converge" before the trouble starts. The detailed placer will have to worker harder so I'm not certain it will work.

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 16, 2025

As a temporary workaround you could try adding -overflow 0.25 which should "converge" before the trouble starts. The detailed placer will have to worker harder so I'm not certain it will work.

Will try next time, but for now increasing the die/core area worked.

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 17, 2025

@gudeh @maliberty I don't think deltaDebug.py will do much, it has been running for a couple of days but only got down to:

Step 100, Insts level debugging, Insts 68216, Nets 1076857, cut elements 2132, timeout 10 minutes

I imagine that to really make it easier to understand what is going wrong here more quickly, there would need to be tens, hundreds, maybe thousands, but not tens of thousands of instances.

10 minutes turnaround time isn't a big improvement here according to my testing.

@gudeh
Copy link
Contributor

gudeh commented Jan 17, 2025

Hi. I implemented the idea of saving a snapshot and finishing with it instead of throwing an error. For this megaboom, it finished with the following placement. Which looks a lot better than the previous ones with clear divergences.

Image

[NesterovSolve] Iter:  340 overflow: 0.753 HPWL: 17444905767
[NesterovSolve] Iter:  350 overflow: 0.710 HPWL: 18568474905
[NesterovSolve] Iter:  360 overflow: 0.680 HPWL: 18544460503
[NesterovSolve] Iter:  370 overflow: 0.626 HPWL: 18498658485
[NesterovSolve] Iter:  380 overflow: 0.556 HPWL: 18809251582
[NesterovSolve] Iter:  390 overflow: 0.480 HPWL: 18858795696
[NesterovSolve] Iter:  400 overflow: 0.377 HPWL: 20699462389
[NesterovSolve] Iter:  410 overflow: 0.332 HPWL: 22269671463
[NesterovSolve] Iter:  420 overflow: 0.293 HPWL: 22095512725
[NesterovSolve] Iter:  430 overflow: 0.266 HPWL: 21407360318
[NesterovSolve] Iter:  440 overflow: 0.248 HPWL: 20571769592
[NesterovSolve] Iter:  450 overflow: 0.234 HPWL: 19572800240
[NesterovSolve] Iter:  460 overflow: 0.224 HPWL: 18593654059
[NesterovSolve] Iter:  470 overflow: 0.213 HPWL: 17801015611
[NesterovSolve] Iter:  480 overflow: 0.204 HPWL: 17020286863
[NesterovSolve] Iter:  490 overflow: 0.163 HPWL: 16820153887
[NesterovSolve] Iter:  500 overflow: 0.152 HPWL: 45295649114
Divergence detected, reverting to snapshot with min hpwl.
Revert to iter:  482 overflow: 0.203 HPWL: 16284579539

@oharboe
Copy link
Collaborator Author

oharboe commented Jan 17, 2025

whittled down 2_floorplan.odb

Step 52, Nets level debugging, Insts 67151, Nets 1076857, cut elements 33652, timeout 12 minutes

@maliberty
Copy link
Member

I had been thinking the macros were the trigger for the issue but this new test case doesn't have any macros! Its a good elimination of a false track

@maliberty
Copy link
Member

Nvm, the blockages are still there.

@gudeh
Copy link
Contributor

gudeh commented Jan 21, 2025

I substituted the recent odb shared by @oharboe on the artificat previously sent. When I run the run-me for stage 3-1 I see these messages and it takes around 2 hours to get to placement. Does this happen to you too?

[WARNING STA-0363] pin 'core.FpPipeline.fpu_exe_unit.FPUUnit.fpu.fpiu_out_pipe_pipe_pipe_b_toint[30]$_DFFE_PP_/QN' not found.
[WARNING STA-0363] pin 'core.FpPipeline.fpu_exe_unit.FPUUnit.fpu.fpiu_out_pipe_pipe_pipe_b_toint[31]$_DFFE_PP_/QN' not found.
[WARNING STA-0363] pin 'core.FpPipeline.fpu_exe_unit.FPUUnit.fpu.fpiu_out_pipe_pipe_pipe_b_toint[32]$_DFFE_PP_/QN' not found.
[WARNING STA-0363] pin 'core.FpPipeline.fpu_exe_unit.FPUUnit.fpu.fpiu_out_pipe_pipe_pipe_b_toint[33]$_DFFE_PP_/QN' not found.
[WARNING STA-0363] pin 'core.FpPipeline.fpu_exe_unit.FPUUnit.fpu.fpiu_out_pipe_pipe_pipe_b_toint[34]$_DFFE_PP_/QN' not found.
[WARNING STA-0363] pin 'core.FpPipeline.fpu_exe_unit.FPUUnit.fpu.fpiu_out_pipe_pipe_pipe_b_toint[35]$_DFFE_PP_/QN' not found.
[WARNING STA-0363] message limit (1000) reached. This message will no longer print.

@maliberty
Copy link
Member

As it isn't a timing driven placment I just removed read_sdc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpl Global Placement
Projects
None yet
Development

No branches or pull requests

4 participants