Question about minimum clock period for retiming #8080

oharboe · 2025-08-21T06:01:57Z

oharboe
Aug 21, 2025
Collaborator

Why as I add more pipeline stages is the asymptotic minimum clock period for an FPU 600ps and 200ps for a 64x64 bit multiplier?

Change 10 below with the desired number of pipeline stages.

FPU:

clone https://github.com/The-OpenROAD-Project/RegFileStudy/tree/main/multiplier
Run bazelisk run //multiplier:FloatingPointUnit_10_1_retimed_synth /tmp/synth gui_synth. A 10 pipeline stage retimed Berkeley hardfloat.
Analyze the reg2reg paths and comment on whether there's something pathological going on in our use of the OpenROAD tools.
Attach analysis here, I can approve the analysis for publication to github issues/discussions if desirable.

64x64 bit integer multiplier with 10 pipeline stages has ca. 200ps lower limit

bazelisk run //multiplier:Multiplier_10_1_retimed_synth /tmp/synth gui_synth

To create a plot:

bazelisk run //multiplier:fpu_plot

maliberty · 2025-08-21T07:01:51Z

maliberty
Aug 21, 2025
Maintainer

Is the critical path of the FPU through the multiplier?

3 replies

oharboe Aug 21, 2025
Collaborator Author

Yosys erased that information... no signal names or module names are preserved...

>>> report_checks -path_group reg2reg
Startpoint: core/_095297_ (rising edge-triggered flip-flop clocked by clock)
Endpoint: core/_091420_ (rising edge-triggered flip-flop clocked by clock)
Path Group: reg2reg
Path Type: max

  Delay    Time   Description
---------------------------------------------------------
   0.00    0.00   clock clock (rise edge)
   0.00    0.00   clock network delay (ideal)
   0.00    0.00 ^ core/_095297_/CLK (DFFHQNx1_ASAP7_75t_R)
  31.51   31.51 ^ core/_095297_/QN (DFFHQNx1_ASAP7_75t_R)
  24.04   55.55 ^ core/_056261_/Y (BUFx2_ASAP7_75t_R)
  23.19   78.74 ^ core/_056264_/Y (AND3x1_ASAP7_75t_R)
  21.37  100.11 ^ core/_056265_/Y (BUFx2_ASAP7_75t_R)
  25.97  126.08 ^ core/_056266_/Y (BUFx2_ASAP7_75t_R)
  26.96  153.03 ^ core/_056267_/Y (BUFx2_ASAP7_75t_R)
  27.33  180.36 ^ core/_056268_/Y (BUFx2_ASAP7_75t_R)
  20.78  201.14 ^ core/_056269_/Y (OA21x2_ASAP7_75t_R)
  21.53  222.67 v core/_099159_/CON (FAx1_ASAP7_75t_R)
  27.78  250.45 ^ core/_057254_/Y (NAND3x1_ASAP7_75t_R)
  10.20  260.65 v core/_083546_/Y (INVx1_ASAP7_75t_R)
 145.56  406.21 v core/_102118_/SN (HAxp5_ASAP7_75t_R)
 113.01  519.22 v core/_102163_/SN (HAxp5_ASAP7_75t_R)
  27.82  547.04 ^ core/_102166_/CON (HAxp5_ASAP7_75t_R)
  12.75  559.79 v core/_102166_/SN (HAxp5_ASAP7_75t_R)
  18.45  578.24 v core/_059277_/Y (AO21x1_ASAP7_75t_R)
  13.77  592.02 v core/_059278_/Y (AO21x1_ASAP7_75t_R)
  17.98  610.00 v core/_059279_/Y (AND2x2_ASAP7_75t_R)
  17.17  627.17 v core/_070532_/Y (OA21x2_ASAP7_75t_R)
  18.58  645.74 v core/_070533_/Y (OA21x2_ASAP7_75t_R)
  55.33  701.07 v core/_099164_/SN (FAx1_ASAP7_75t_R)
  28.24  729.31 v core/_059286_/Y (AND2x2_ASAP7_75t_R)
   0.00  729.31 v core/_091420_/D (DFFHQNx1_ASAP7_75t_R)
         729.31   data arrival time

 250.00  250.00   clock clock (rise edge)
   0.00  250.00   clock network delay (ideal)
   0.00  250.00   clock reconvergence pessimism
         250.00 ^ core/_091420_/CLK (DFFHQNx1_ASAP7_75t_R)
  -8.19  241.81   library setup time
         241.81   data required time
---------------------------------------------------------
         241.81   data required time
        -729.31   data arrival time
---------------------------------------------------------
        -487.50   slack (VIOLATED)

QuantamHD Aug 21, 2025
Collaborator

That kind of looks like a multiplier, but the other suspicious thing is the large number of buffers.

oharboe Aug 21, 2025
Collaborator Author

That kind of looks like a multiplier, but the other suspicious thing is the large number of buffers.

I would expect enromous fanouts to result in many levels of buffering

maliberty · 2025-08-21T07:02:11Z

maliberty
Aug 21, 2025
Maintainer

@povik any thoughts?

1 reply

povik Aug 21, 2025
Collaborator

The timing model used is crude but still I would have expected the asymptotic delay to go closer to zero.

QuantamHD · 2025-08-21T07:03:04Z

QuantamHD
Aug 21, 2025
Collaborator

I don't know how you've implemented your FPU, but frequently the problem in FPUs is that you can't pipeline through operators +, -, /, <, >. Chisel and other HLS type things don't lower all the way to gates so you can't insert registers through it.

It's usually the rounding and ulup stuff that kills you in FPs.

7 replies

QuantamHD Aug 21, 2025
Collaborator

What's actually doing the retiming? Like what tool and when?

oharboe Aug 21, 2025
Collaborator Author

ORFS + yosys/abc.

QuantamHD Aug 21, 2025
Collaborator

Which command?

oharboe Aug 21, 2025
Collaborator Author

The-OpenROAD-Project/OpenROAD-flow-scripts#3417

QuantamHD Aug 21, 2025
Collaborator

Dump the verilog for the module before the call to abc. I see you're using the abc retime command which is a multi-bit operator retimer. Perhaps the multi-ands or something aren't being bit blasted

oharboe · 2025-08-21T07:15:17Z

oharboe
Aug 21, 2025
Collaborator Author

I quickly commented out all but 64 bit floating point add:

diff --git a/multiplier/src/main/scala/FloatingPointUnit.scala b/multiplier/src/main/scala/FloatingPointUnit.scala
index 39ea766..e6a9e85 100644
--- a/multiplier/src/main/scala/FloatingPointUnit.scala
+++ b/multiplier/src/main/scala/FloatingPointUnit.scala
@@ -71,7 +71,6 @@ class MockFloatingPointUnit(data: UInt) extends Module {
   io.dst := DontCare
   for (
     float <- Seq(
-      Architecture.Representation.F32,
       Architecture.Representation.F64
     )
   ) {
@@ -93,115 +92,6 @@ class MockFloatingPointUnit(data: UInt) extends Module {
             adder.io.out
           )
         }
-        is(FloatingPointOperation.Mul) {
-          val mul =
-            Module(new hardfloat.MulRecFN(float.expBits, float.fracBits))
-          mul.io.a := recs(0)
-          mul.io.b := recs(1)
-          mul.io.detectTininess := hardfloat.consts.tininess_afterRounding
-          mul.io.roundingMode := hardfloat.consts.round_near_even
-          io.dst := hardfloat.fNFromRecFN(
-            float.expBits,
-            float.fracBits,
-            mul.io.out
-          )
-        }
-        is(FloatingPointOperation.Compare) {
-          val comparator =
-            Module(new hardfloat.CompareRecFN(float.expBits, float.fracBits))
-          comparator.io.a := recs(0)
-          comparator.io.b := recs(1)
-          comparator.io.signaling := false.B
-          io.dst := Mux1H(
-            Seq(
-              (io.comparison === FloatingPointComparison.Eq) -> comparator.io.eq,
-              (io.comparison === FloatingPointComparison.Lt) -> comparator.io.lt,
-              (io.comparison === FloatingPointComparison.Le) -> !comparator.io.gt
-            )
-          )
-        }
-        is(FloatingPointOperation.Convert) {
-          when(io.toInt) {
-            val toInt = Module(
-              new hardfloat.RecFNToIN(
-                float.expBits,
-                float.fracBits,
-                data.getWidth
-              )
-            )
-            toInt.io.in := hardfloat.recFNFromFN(
-              float.expBits,
-              float.fracBits,
-              io.srcs(0)
-            )
-            toInt.io.roundingMode := hardfloat.consts.round_minMag
-            toInt.io.signedOut := io.toSigned
-            io.dst := toInt.io.out
-          }
-        }
-        is(FloatingPointOperation.Inject) {
-          val dst =
-            (Mux(io.xor, recs(0).head(1), io.negate) ^ recs(1).head(1)) ## recs(
-              0
-            ).tail(1)
-          io.dst := hardfloat.fNFromRecFN(float.expBits, float.fracBits, dst)
-        }
-      }
-    }
-
-    when(
-      io.toWidthHalfWordsLog2 === float.halfWordsLog2.U && io.operation === FloatingPointOperation.Convert
-    ) {
-      when(io.fromInt) {
-        val fromInt =
-          Module(
-            new hardfloat.INToRecFN(
-              data.getWidth,
-              float.expBits,
-              float.fracBits
-            )
-          )
-
-        fromInt.io.in := io.srcs(0)
-        fromInt.io.signedIn := io.fromSigned
-        fromInt.io.detectTininess := hardfloat.consts.tininess_afterRounding
-        fromInt.io.roundingMode := hardfloat.consts.round_near_even
-        io.dst := hardfloat.fNFromRecFN(
-          float.expBits,
-          float.fracBits,
-          fromInt.io.out
-        )
-      }
-
-      when(!io.fromInt && !io.toInt) {
-        val fromFloat =
-          if (float != Architecture.Representation.F32)
-            Architecture.Representation.F32
-          else Architecture.Representation.F64
-
-        val converter =
-          Module(
-            new hardfloat.RecFNToRecFN(
-              fromFloat.expBits,
-              fromFloat.fracBits,
-              float.expBits,
-              float.fracBits
-            )
-          )
-
-        converter.io.in := hardfloat.recFNFromFN(
-          fromFloat.expBits,
-          fromFloat.fracBits,
-          io.srcs(0)
-        )
-        converter.io.roundingMode := hardfloat.consts.round_near_even
-        converter.io.detectTininess := hardfloat.consts.tininess_afterRounding
-
-        io.dst := hardfloat.fNFromRecFN(
-          float.expBits,
-          float.fracBits,
-          converter.io.out
-        )
       }
     }
   }

0 replies

oharboe · 2025-08-21T07:18:34Z

oharboe
Aug 21, 2025
Collaborator Author

@QuantamHD The main motivation for using retiming is to identify performance left on the table from a lack of architectural/manual retiming and also to improve ranking accuracy with a minimum of effort(don't manually retime architectural dead ends).

0 replies

maliberty · 2025-08-21T15:43:28Z

maliberty
Aug 21, 2025
Maintainer

Have you looked at the paths coming into core/095297 and out of core/091420 to see if there is slack that could have been reapportioned by moving the register?

0 replies

oharboe · 2025-08-22T05:44:51Z

oharboe
Aug 22, 2025
Collaborator Author

to rule out the Berkeley FPU, I quickly tried with 64 bit division instead of 64 bit multiply. Obviously a combinational 64 bit division is pathological RTL, it should be a variable latency operation, but it seems to be a useful test-case to examine retiming.

diff --git a/multiplier/BUILD b/multiplier/BUILD
index 7787715..5b83542 100644
--- a/multiplier/BUILD
+++ b/multiplier/BUILD
@@ -4,7 +4,7 @@ load(":study.bzl", "study")
 LATENCIES = [i for i in range(11)]
 
 INFO = {
-    "name": "64x64 multiplier",
+    "name": "64x64 divider",
     "stage": "synth",
     "study": [
         {
diff --git a/multiplier/src/main/scala/Multiplier.scala b/multiplier/src/main/scala/Multiplier.scala
index cacea73..bf74aaf 100644
--- a/multiplier/src/main/scala/Multiplier.scala
+++ b/multiplier/src/main/scala/Multiplier.scala
@@ -30,7 +30,7 @@ class Multiplier(config: MultiplierConfig) extends Module {
     withReset(ShiftRegister(reset.asBool, config.latency)) {
       // the pipeline registers are free-running, thus not hooked up to the
       // synchronous reset
-      core_io.out := ShiftRegister(core_io.in.reduce(_ * _), config.latency)
+      core_io.out := ShiftRegister(core_io.in.reduce(_ / _), config.latency)
     }
   })

bazelisk build //multiplier:multiplier_plot

bazelisk run //multiplier:Multiplier_10_1_retimed_synth /tmp/synth gui_synth

This is from the WNS bucket:

>>> report_checks -from {core/_34028_/QN} -to {core/_34022_/D}
Startpoint: core/_34028_ (rising edge-triggered flip-flop clocked by clock)
Endpoint: core/_34022_ (rising edge-triggered flip-flop clocked by clock)
Path Group: reg2reg
Path Type: max

  Delay    Time   Description
---------------------------------------------------------
   0.00    0.00   clock clock (rise edge)
   0.00    0.00   clock network delay (ideal)
   0.00    0.00 ^ core/_34028_/CLK (DFFHQNx1_ASAP7_75t_R)
  30.52   30.52 v core/_34028_/QN (DFFHQNx1_ASAP7_75t_R)
   8.28   38.80 ^ core/_35138_/Y (INVx1_ASAP7_75t_R)
  14.22   53.02 v core/_35139_/Y (NAND2x1_ASAP7_75t_R)
  25.21   78.23 v core/_35141_/Y (AO221x1_ASAP7_75t_R)
  13.90   92.13 ^ core/_35142_/Y (AOI21x1_ASAP7_75t_R)
  17.89  110.02 ^ core/_35143_/Y (OA21x2_ASAP7_75t_R)
  16.17  126.19 ^ core/_35144_/Y (OA21x2_ASAP7_75t_R)
  20.33  146.52 ^ core/_35146_/Y (OA211x2_ASAP7_75t_R)
   9.32  155.84 v core/_35148_/Y (OAI21x1_ASAP7_75t_R)
  15.45  171.29 v core/_35150_/Y (AO21x1_ASAP7_75t_R)
  13.67  184.96 v core/_35152_/Y (AO21x1_ASAP7_75t_R)
  16.90  201.86 v core/_35153_/Y (BUFx2_ASAP7_75t_R)
   8.79  210.65 ^ core/_35159_/Y (AOI21x1_ASAP7_75t_R)
  18.24  228.89 ^ core/_35160_/Y (OA22x2_ASAP7_75t_R)
  15.32  244.21 v core/_35161_/Y (INVx1_ASAP7_75t_R)
  26.38  270.59 v core/_35162_/Y (BUFx2_ASAP7_75t_R)
  26.48  297.07 v core/_35163_/Y (BUFx2_ASAP7_75t_R)
  23.82  320.89 v core/_35276_/Y (BUFx2_ASAP7_75t_R)
  19.24  340.13 v core/_35339_/Y (AND2x2_ASAP7_75t_R)
  18.02  358.15 v core/_35340_/Y (AO21x1_ASAP7_75t_R)
  48.98  407.13 v core/_55656_/SN (HAxp5_ASAP7_75t_R)
  26.94  434.07 v core/_35361_/Y (OR2x2_ASAP7_75t_R)
  14.08  448.15 v core/_35362_/Y (AO21x1_ASAP7_75t_R)
  13.31  461.45 v core/_35363_/Y (AO21x1_ASAP7_75t_R)
  24.30  485.75 v core/_35364_/Y (OA211x2_ASAP7_75t_R)
  17.40  503.15 v core/_35365_/Y (BUFx2_ASAP7_75t_R)
  22.27  525.42 v core/_35369_/Y (OA211x2_ASAP7_75t_R)
  17.40  542.82 v core/_35370_/Y (BUFx2_ASAP7_75t_R)
  22.27  565.09 v core/_35374_/Y (OA211x2_ASAP7_75t_R)
  17.40  582.49 v core/_35375_/Y (BUFx2_ASAP7_75t_R)
  22.27  604.76 v core/_35379_/Y (OA211x2_ASAP7_75t_R)
  17.36  622.12 v core/_35380_/Y (BUFx2_ASAP7_75t_R)
  21.96  644.08 v core/_35435_/Y (OA211x2_ASAP7_75t_R)
  17.40  661.48 v core/_35436_/Y (BUFx2_ASAP7_75t_R)
  22.27  683.75 v core/_35440_/Y (OA211x2_ASAP7_75t_R)
  17.40  701.15 v core/_35441_/Y (BUFx2_ASAP7_75t_R)
  22.27  723.42 v core/_35445_/Y (OA211x2_ASAP7_75t_R)
  17.40  740.82 v core/_35446_/Y (BUFx2_ASAP7_75t_R)
  22.27  763.09 v core/_35450_/Y (OA211x2_ASAP7_75t_R)
  18.74  781.83 v core/_35451_/Y (BUFx2_ASAP7_75t_R)
   8.70  790.53 ^ core/_35467_/Y (OAI21x1_ASAP7_75t_R)
  17.35  807.88 ^ core/_35468_/Y (AO32x1_ASAP7_75t_R)
  22.86  830.73 ^ core/_35469_/Y (BUFx2_ASAP7_75t_R)
  21.87  852.61 v core/_35477_/Y (INVx1_ASAP7_75t_R)
  27.52  880.13 v core/_35498_/Y (BUFx2_ASAP7_75t_R)
  20.14  900.27 v core/_35585_/Y (AO21x1_ASAP7_75t_R)
  51.32  951.59 v core/_55697_/SN (HAxp5_ASAP7_75t_R)
  26.05  977.64 v core/_35647_/Y (OR2x2_ASAP7_75t_R)
  14.01  991.66 v core/_35648_/Y (AO21x1_ASAP7_75t_R)
  13.32 1004.98 v core/_35649_/Y (AO21x1_ASAP7_75t_R)
  24.28 1029.26 v core/_35650_/Y (OA211x2_ASAP7_75t_R)
  16.60 1045.86 v core/_35651_/Y (BUFx2_ASAP7_75t_R)
  16.92 1062.78 v core/_35652_/Y (OA21x2_ASAP7_75t_R)
  19.35 1082.13 v core/_35653_/Y (OA21x2_ASAP7_75t_R)
  17.43 1099.56 v core/_35654_/Y (OA21x2_ASAP7_75t_R)
  20.23 1119.79 v core/_35655_/Y (OA21x2_ASAP7_75t_R)
  22.80 1142.59 v core/_35659_/Y (OA211x2_ASAP7_75t_R)
  17.41 1160.00 v core/_35660_/Y (BUFx2_ASAP7_75t_R)
  22.27 1182.27 v core/_35664_/Y (OA211x2_ASAP7_75t_R)
  17.47 1199.74 v core/_35665_/Y (BUFx2_ASAP7_75t_R)
  16.99 1216.73 v core/_35674_/Y (OA21x2_ASAP7_75t_R)
  20.18 1236.91 v core/_35675_/Y (OA21x2_ASAP7_75t_R)
  22.80 1259.71 v core/_35679_/Y (OA211x2_ASAP7_75t_R)
  16.88 1276.59 v core/_35680_/Y (BUFx2_ASAP7_75t_R)
  17.04 1293.63 v core/_35681_/Y (OA21x2_ASAP7_75t_R)
  19.35 1312.98 v core/_35682_/Y (OA21x2_ASAP7_75t_R)
  17.44 1330.42 v core/_35683_/Y (OA21x2_ASAP7_75t_R)
  20.23 1350.65 v core/_35684_/Y (OA21x2_ASAP7_75t_R)
  22.80 1373.45 v core/_35688_/Y (OA211x2_ASAP7_75t_R)
  17.41 1390.86 v core/_35689_/Y (BUFx2_ASAP7_75t_R)
  22.27 1413.13 v core/_35693_/Y (OA211x2_ASAP7_75t_R)
  17.40 1430.53 v core/_35694_/Y (BUFx2_ASAP7_75t_R)
  22.27 1452.80 v core/_35698_/Y (OA211x2_ASAP7_75t_R)
  19.90 1472.70 v core/_35699_/Y (BUFx2_ASAP7_75t_R)
  18.71 1491.41 v core/_35715_/Y (OA21x2_ASAP7_75t_R)
  20.86 1512.27 v core/_35716_/Y (BUFx2_ASAP7_75t_R)
  23.89 1536.16 v core/_35748_/Y (BUFx2_ASAP7_75t_R)
  23.60 1559.76 v core/_35842_/Y (BUFx2_ASAP7_75t_R)
  16.21 1575.97 v core/_35861_/Y (AND3x1_ASAP7_75t_R)
  17.85 1593.83 v core/_35862_/Y (AO21x1_ASAP7_75t_R)
  51.09 1644.92 v core/_55754_/SN (HAxp5_ASAP7_75t_R)
  26.03 1670.95 v core/_35913_/Y (OR2x2_ASAP7_75t_R)
  14.01 1684.96 v core/_35914_/Y (AO21x1_ASAP7_75t_R)
  13.31 1698.27 v core/_35915_/Y (AO21x1_ASAP7_75t_R)
  24.28 1722.55 v core/_35916_/Y (OA211x2_ASAP7_75t_R)
  17.40 1739.95 v core/_35917_/Y (BUFx2_ASAP7_75t_R)
  22.27 1762.22 v core/_35921_/Y (OA211x2_ASAP7_75t_R)
  17.40 1779.62 v core/_35922_/Y (BUFx2_ASAP7_75t_R)
  22.27 1801.89 v core/_35926_/Y (OA211x2_ASAP7_75t_R)
  17.64 1819.52 v core/_35927_/Y (BUFx2_ASAP7_75t_R)
  22.40 1841.92 v core/_35931_/Y (OA211x2_ASAP7_75t_R)
  17.36 1859.28 v core/_35932_/Y (BUFx2_ASAP7_75t_R)
  21.96 1881.24 v core/_35966_/Y (OA211x2_ASAP7_75t_R)
  17.64 1898.88 v core/_35967_/Y (BUFx2_ASAP7_75t_R)
  22.40 1921.28 v core/_35971_/Y (OA211x2_ASAP7_75t_R)
  17.40 1938.68 v core/_35972_/Y (BUFx2_ASAP7_75t_R)
  22.27 1960.95 v core/_35976_/Y (OA211x2_ASAP7_75t_R)
  17.64 1978.59 v core/_35977_/Y (BUFx2_ASAP7_75t_R)
  22.40 2000.98 v core/_35981_/Y (OA211x2_ASAP7_75t_R)
  17.40 2018.38 v core/_35982_/Y (BUFx2_ASAP7_75t_R)
  22.27 2040.65 v core/_35986_/Y (OA211x2_ASAP7_75t_R)
  17.64 2058.29 v core/_35987_/Y (BUFx2_ASAP7_75t_R)
  22.40 2080.69 v core/_35991_/Y (OA211x2_ASAP7_75t_R)
  17.40 2098.09 v core/_35992_/Y (BUFx2_ASAP7_75t_R)
  24.79 2122.87 v core/_35996_/Y (OA211x2_ASAP7_75t_R)
  37.75 2160.63 v core/_35997_/Y (OR5x1_ASAP7_75t_R)
  26.01 2186.64 v core/_35998_/Y (OA211x2_ASAP7_75t_R)
  28.16 2214.79 v core/_35999_/Y (OR3x1_ASAP7_75t_R)
  14.60 2229.39 v core/_36000_/Y (OA21x2_ASAP7_75t_R)
  19.13 2248.52 v core/_36001_/Y (BUFx2_ASAP7_75t_R)
  10.69 2259.20 ^ core/_36002_/Y (INVx1_ASAP7_75t_R)
  22.55 2281.75 ^ core/_36003_/Y (BUFx2_ASAP7_75t_R)
  24.58 2306.33 ^ core/_36031_/Y (BUFx2_ASAP7_75t_R)
  20.87 2327.20 ^ core/_36175_/Y (AO21x1_ASAP7_75t_R)
  27.79 2355.00 ^ core/_55824_/SN (HAxp5_ASAP7_75t_R)
  15.35 2370.34 ^ core/_36179_/Y (AO21x1_ASAP7_75t_R)
  14.80 2385.15 ^ core/_36180_/Y (AO21x1_ASAP7_75t_R)
  13.68 2398.83 ^ core/_36181_/Y (AO21x1_ASAP7_75t_R)
  14.66 2413.49 ^ core/_36182_/Y (AO21x1_ASAP7_75t_R)
  15.28 2428.78 ^ core/_36187_/Y (AO21x1_ASAP7_75t_R)
  13.00 2441.78 ^ core/_36191_/Y (AO21x1_ASAP7_75t_R)
  12.54 2454.31 ^ core/_36192_/Y (AO21x1_ASAP7_75t_R)
  19.97 2474.28 ^ core/_36193_/Y (AND2x2_ASAP7_75t_R)
  16.45 2490.74 ^ core/_36194_/Y (OA21x2_ASAP7_75t_R)
  19.08 2509.82 ^ core/_36195_/Y (OA21x2_ASAP7_75t_R)
  19.86 2529.68 ^ core/_36199_/Y (OA211x2_ASAP7_75t_R)
  17.63 2547.30 ^ core/_36200_/Y (BUFx2_ASAP7_75t_R)
   7.96 2555.27 v core/_36204_/Y (OAI21x1_ASAP7_75t_R)
  15.17 2570.44 ^ core/_36206_/Y (AOI21x1_ASAP7_75t_R)
  22.76 2593.20 ^ core/_36217_/Y (OA211x2_ASAP7_75t_R)
  16.47 2609.67 ^ core/_36218_/Y (BUFx2_ASAP7_75t_R)
  19.15 2628.82 ^ core/_36222_/Y (OA211x2_ASAP7_75t_R)
  16.73 2645.55 ^ core/_36223_/Y (BUFx2_ASAP7_75t_R)
  19.30 2664.85 ^ core/_36227_/Y (OA211x2_ASAP7_75t_R)
  16.73 2681.58 ^ core/_36228_/Y (BUFx2_ASAP7_75t_R)
  19.30 2700.88 ^ core/_36232_/Y (OA211x2_ASAP7_75t_R)
  16.72 2717.60 ^ core/_36233_/Y (BUFx2_ASAP7_75t_R)
  19.30 2736.90 ^ core/_36237_/Y (OA211x2_ASAP7_75t_R)
  15.91 2752.81 ^ core/_36238_/Y (BUFx2_ASAP7_75t_R)
  15.61 2768.42 ^ core/_36239_/Y (OA21x2_ASAP7_75t_R)
  18.20 2786.62 ^ core/_36240_/Y (OA21x2_ASAP7_75t_R)
  16.17 2802.79 ^ core/_36241_/Y (OA21x2_ASAP7_75t_R)
  19.08 2821.87 ^ core/_36242_/Y (OA21x2_ASAP7_75t_R)
  22.47 2844.34 ^ core/_36246_/Y (OA211x2_ASAP7_75t_R)
  18.15 2862.49 ^ core/_36247_/Y (OR2x2_ASAP7_75t_R)
  16.30 2878.79 ^ core/_36252_/Y (AO21x1_ASAP7_75t_R)
  23.70 2902.48 ^ core/_36263_/Y (AND2x2_ASAP7_75t_R)
  21.12 2923.60 v core/_36264_/Y (INVx1_ASAP7_75t_R)
  26.90 2950.50 v core/_49405_/Y (BUFx2_ASAP7_75t_R)
  19.86 2970.36 v core/_49497_/Y (AND2x2_ASAP7_75t_R)
  19.68 2990.04 v core/_49498_/Y (AO21x1_ASAP7_75t_R)
  58.64 3048.67 v core/_57954_/SN (HAxp5_ASAP7_75t_R)
  26.88 3075.56 v core/_49575_/Y (OR2x2_ASAP7_75t_R)
  14.06 3089.62 v core/_49576_/Y (AO21x1_ASAP7_75t_R)
  13.39 3103.00 v core/_49577_/Y (AO21x1_ASAP7_75t_R)
  25.84 3128.85 v core/_49578_/Y (OA211x2_ASAP7_75t_R)
  17.34 3146.19 v core/_49579_/Y (BUFx2_ASAP7_75t_R)
  24.53 3170.72 v core/_49583_/Y (OA211x2_ASAP7_75t_R)
  25.21 3195.94 v core/_49587_/Y (OA211x2_ASAP7_75t_R)
  25.21 3221.15 v core/_49591_/Y (OA211x2_ASAP7_75t_R)
  27.66 3248.81 v core/_49595_/Y (OA211x2_ASAP7_75t_R)
  22.97 3271.78 v core/_49611_/Y (OA21x2_ASAP7_75t_R)
  17.64 3289.42 v core/_49627_/Y (OA21x2_ASAP7_75t_R)
  27.80 3317.22 v core/_49628_/Y (OR3x1_ASAP7_75t_R)
  20.00 3337.22 v core/_49629_/Y (OA21x2_ASAP7_75t_R)
  21.85 3359.07 ^ core/_49630_/Y (INVx1_ASAP7_75t_R)
  30.39 3389.45 ^ core/_49650_/Y (BUFx2_ASAP7_75t_R)
  28.23 3417.68 ^ core/_49700_/Y (BUFx2_ASAP7_75t_R)
  31.75 3449.44 ^ core/_49790_/Y (BUFx2_ASAP7_75t_R)
  10.60 3460.04 v core/_54311_/Y (AOI21x1_ASAP7_75t_R)
  39.33 3499.37 v core/_58180_/SN (HAxp5_ASAP7_75t_R)
  12.05 3511.42 ^ core/_51828_/Y (NOR2x1_ASAP7_75t_R)
   0.00 3511.42 ^ core/_34022_/D (DFFHQNx1_ASAP7_75t_R)
        3511.42   data arrival time

 250.00  250.00   clock clock (rise edge)
   0.00  250.00   clock network delay (ideal)
   0.00  250.00   clock reconvergence pessimism
         250.00 ^ core/_34022_/CLK (DFFHQNx1_ASAP7_75t_R)
 -13.71  236.29   library setup time
         236.29   data required time
---------------------------------------------------------
         236.29   data required time
        -3511.42   data arrival time
---------------------------------------------------------
        -3275.13   slack (VIOLATED)

This is from a bucket with ca. 500ps clock period:

>>> report_checks -from {REG_1\[21\]$_DFF_P_/QN} -to {core/_31299_/D}
Startpoint: REG_1[21]$_DFF_P_ (rising edge-triggered flip-flop clocked by clock)
Endpoint: core/_31299_ (rising edge-triggered flip-flop clocked by clock)
Path Group: reg2reg
Path Type: max

  Delay    Time   Description
---------------------------------------------------------
   0.00    0.00   clock clock (rise edge)
   0.00    0.00   clock network delay (ideal)
   0.00    0.00 ^ REG_1[21]$_DFF_P_/CLK (DFFHQNx1_ASAP7_75t_R)
  37.51   37.51 ^ REG_1[21]$_DFF_P_/QN (DFFHQNx1_ASAP7_75t_R)
   7.51   45.03 v _270_/Y (INVx3_ASAP7_75t_R)
  28.00   73.02 v core/_47159_/Y (OR3x1_ASAP7_75t_R)
  25.10   98.12 v core/_47160_/Y (OR2x2_ASAP7_75t_R)
  48.59  146.71 v core/_47161_/Y (OR5x1_ASAP7_75t_R)
  35.76  182.48 v core/_47162_/Y (OR4x1_ASAP7_75t_R)
  22.26  204.74 v core/_47164_/Y (OR2x2_ASAP7_75t_R)
  39.19  243.93 v core/_47167_/Y (OR5x1_ASAP7_75t_R)
  29.74  273.67 v core/_47169_/Y (OR3x1_ASAP7_75t_R)
  11.31  284.97 ^ core/_47171_/Y (INVx1_ASAP7_75t_R)
  21.78  306.75 ^ core/_47172_/Y (BUFx2_ASAP7_75t_R)
  18.29  325.04 ^ core/_47174_/Y (AND2x2_ASAP7_75t_R)
  13.37  338.41 ^ core/_47175_/Y (AO21x1_ASAP7_75t_R)
  32.36  370.77 ^ core/_57589_/SN (HAxp5_ASAP7_75t_R)
  15.68  386.45 ^ core/_47250_/Y (AO21x1_ASAP7_75t_R)
  12.76  399.21 ^ core/_47252_/Y (AO21x1_ASAP7_75t_R)
  16.51  415.72 ^ core/_47253_/Y (OR3x1_ASAP7_75t_R)
   9.87  425.59 v core/_47254_/Y (INVx1_ASAP7_75t_R)
  27.97  453.56 v core/_47255_/Y (BUFx2_ASAP7_75t_R)
  32.31  485.87 v core/_47256_/Y (BUFx2_ASAP7_75t_R)
   0.00  485.87 v core/_31299_/D (DFFHQNx1_ASAP7_75t_R)
         485.87   data arrival time

 250.00  250.00   clock clock (rise edge)
   0.00  250.00   clock network delay (ideal)
   0.00  250.00   clock reconvergence pessimism
         250.00 ^ core/_31299_/CLK (DFFHQNx1_ASAP7_75t_R)
 -12.21  237.79   library setup time
         237.79   data required time
---------------------------------------------------------
         237.79   data required time
        -485.87   data arrival time
---------------------------------------------------------
        -248.08   slack (VIOLATED)

Could it be that the BUF are giving retiming constipation? If retiming doesn't know what a BUF is, then it leaves it alone?

3 replies

povik Aug 27, 2025
Collaborator

Retiming takes place before mapping to technology and before buffering. It doesn't see the buffer tree delay on high fanout nets and won't take it into account when moving registers to balance delay.

oharboe Aug 27, 2025
Collaborator Author

Silly question: could I feed the netlist post retiming back into synthesis to help retiming take high fanout buffering into account?

povik Aug 27, 2025
Collaborator

If you are thinking retiming on the mapped and buffered netlist, it won't work out of the box. The infrastructure isn't there for the mapped cells to be visible to the retime command in use.

povik · 2025-08-27T13:32:02Z

povik
Aug 27, 2025
Collaborator

It looks like retiming isn't able to attain the optimum if it's too far off from the starting point.

E.g. take this synthetic example:

module top(input wire [4:0] x, output wire [4:0] yr);
	reg [4:0] y;
	always @(*) begin
		y = x;
		for (integer i = 0; i < 100; i++)
			y = y + 1;
	end

	reg [4:0] y_shift [99:0];
	always @(posedge clk) begin
		y_shift[0] <= y;
		for (integer i = 0; i < 99; i++) begin
			y_shift[i + 1] <= y_shift[i];
		end
	end
	assign yr = y_shift[99];
endmodule

I'm skipping any optimizations which would fuse the additions. Theoretically the 100 registers at the end can be evenly distributed to get a 100x reduction in the critical path.

Abc's retime command starts by seeing a depth of 204.

ABC: Performing analysis:
ABC: Fwd Iter =   0. Delay = 204. Latches =   500. Delta =   0.00. Ratio = 0.00 %

and it achieves final depth of 20.

ABC: Bwd Iter = 379. Delay =  23. Latches =   552. Delta =   4.00. Ratio = 0.72 %
ABC: Bwd Iter = 388. Delay =  22. Latches =   554. Delta =   2.00. Ratio = 0.36 %
ABC: Bwd Iter = 398. Delay =  21. Latches =   559. Delta =   5.00. Ratio = 0.89 %
ABC: Bwd Iter = 408. Delay =  20. Latches =   559. Delta =   0.00. Ratio = 0.00 %
ABC: Backward : Starting delay = 204.  Final delay =  20.  IterBest = 408 (out of 408).

If I iterate it 7 times (I need to use retime -M 4 -v -o for the follow-up invocations so that it takes the previous retiming result as a starting point), I can squeeze it down to 11.

At the same time, another option on the retime command (-M 6) tells me the optimum would be a depth of 3, which would match the 100x reduction after rounding.

ABC: Period = 204.  Iterations = 101.      Feasible
ABC: Period = 102.  Iterations = 101.      Feasible
ABC: Period =  51.  Iterations = 101.      Feasible
ABC: Period =  25.  Iterations = 101.      Feasible
ABC: Period =  12.  Iterations = 101.      Feasible
ABC: Period =   6.  Iterations = 101.      Feasible
ABC: Period =   3.  Iterations = 101.      Feasible
ABC: Period =   1.  Iterations = 100.    Infeasible 
ABC: Period =   2.  Iterations = 100.    Infeasible 
ABC: Period =   3.  Iterations = 101.      Feasible
ABC: The best clock period is   3. (Currently, network is not modified.)

This option looks to be informational and doesn't seem to support actually transforming the netlist to attain a depth of 3.

1 reply

povik Aug 27, 2025
Collaborator

@oharboe @maliberty I think the asymptotic delays observed by Oyvind are explained by this (the retiming algorithm not attaining the best balancing for long pipelines, i.e. beyond length 10), and by the crude timing model in use by retiming which doesn't consider buffer tree delays and distinct delays of cells.

Question about minimum clock period for retiming #8080

Uh oh!

Uh oh!

oharboe Aug 21, 2025 Collaborator

Replies: 8 comments · 15 replies

Uh oh!

maliberty Aug 21, 2025 Maintainer

Uh oh!

oharboe Aug 21, 2025 Collaborator Author

Uh oh!

QuantamHD Aug 21, 2025 Collaborator

Uh oh!

oharboe Aug 21, 2025 Collaborator Author

Uh oh!

maliberty Aug 21, 2025 Maintainer

Uh oh!

povik Aug 21, 2025 Collaborator

Uh oh!

Uh oh!

QuantamHD Aug 21, 2025 Collaborator

Uh oh!

Uh oh!

QuantamHD Aug 21, 2025 Collaborator

Uh oh!

oharboe Aug 21, 2025 Collaborator Author

Uh oh!

QuantamHD Aug 21, 2025 Collaborator

Uh oh!

oharboe Aug 21, 2025 Collaborator Author

Uh oh!

QuantamHD Aug 21, 2025 Collaborator

Uh oh!

oharboe Aug 21, 2025 Collaborator Author

Uh oh!

oharboe Aug 21, 2025 Collaborator Author

Uh oh!

maliberty Aug 21, 2025 Maintainer

Uh oh!

Uh oh!

oharboe Aug 22, 2025 Collaborator Author

Uh oh!

povik Aug 27, 2025 Collaborator

Uh oh!

oharboe Aug 27, 2025 Collaborator Author

Uh oh!

Uh oh!

povik Aug 27, 2025 Collaborator

Uh oh!

povik Aug 27, 2025 Collaborator

Uh oh!

povik Aug 27, 2025 Collaborator

oharboe
Aug 21, 2025
Collaborator

Replies: 8 comments 15 replies

maliberty
Aug 21, 2025
Maintainer

oharboe Aug 21, 2025
Collaborator Author

QuantamHD Aug 21, 2025
Collaborator

oharboe Aug 21, 2025
Collaborator Author

maliberty
Aug 21, 2025
Maintainer

povik Aug 21, 2025
Collaborator

QuantamHD
Aug 21, 2025
Collaborator

QuantamHD Aug 21, 2025
Collaborator

oharboe Aug 21, 2025
Collaborator Author

QuantamHD Aug 21, 2025
Collaborator

oharboe Aug 21, 2025
Collaborator Author

QuantamHD Aug 21, 2025
Collaborator

oharboe
Aug 21, 2025
Collaborator Author

oharboe
Aug 21, 2025
Collaborator Author

maliberty
Aug 21, 2025
Maintainer

oharboe
Aug 22, 2025
Collaborator Author

povik Aug 27, 2025
Collaborator

oharboe Aug 27, 2025
Collaborator Author

povik Aug 27, 2025
Collaborator

povik
Aug 27, 2025
Collaborator

povik Aug 27, 2025
Collaborator