Skip to content

Improve the instruction-level parallelism capability of buckyballs #23

@shirohasuki

Description

@shirohasuki

Buckyball ILP Proposal: Bank Scoreboard for Dependency Tracking

1. Background & Problem

1.1 Current State

Buckyball's Global ROB supports out-of-order issue and out-of-order completion. However, software inserts bb_fence() between every pair of instructions. A fence requires the ROB to drain completely before accepting the next instruction, effectively reducing ILP to 1 — making the out-of-order machinery dead code.

Typical pattern:

MVIN(bank0) → fence → RELU(bank0→bank1) → fence → MVOUT(bank1)

The fundamental problem fences solve is bank-level RAW/WAR/WAW data hazards. But fences use the coarsest possible granularity — a global barrier. Even instructions operating on completely independent banks are forced to serialize.

1.2 Goals

  • Implement an instruction-agnostic Bank Scoreboard in framework/frontend/scoreboard/
  • The scoreboard only cares about each instruction's read bank set and write bank set, not the instruction type
  • Retain fence instruction semantics, but redefine fence as a barrier point within the ROB — the ROB scan for issue candidates stops at the fence boundary
  • Unify all instructions' bank information encoding into designated bits of rs1
  • Enable out-of-order skip-issue: instructions on different banks can leapfrog hazard-blocked instructions

2. Unified rs1 Bank Encoding

2.1 Design Principle

All instructions encode bank operands in rs1:

rs1[7:0]   = bank_0  (1st operand bank / MVIN write bank / MVOUT read bank)
rs1[15:8]  = bank_1  (2nd operand bank, dual-operand instructions only)
rs1[23:16] = bank_2  (write/result bank)

2.2 Per-Instruction New Encoding

Mem Domain Instructions

Instruction rs1 rs2 Change Summary
MVIN (24) [7:0]=wr_bank, [63:8]=mem_addr [9:0]=depth, [28:10]=stride, [63:29]=special bank moved from rs2 to rs1[7:0]; mem_addr shifted to upper rs1
MVOUT (25) [7:0]=rd_bank, [63:8]=mem_addr [9:0]=depth, [28:10]=stride, [63:29]=special Same as MVIN
MSET (23) [7:0]=bank_id [4:0]=row, [9:5]=col, [10]=alloc, ... bank moved from rs2 to rs1[7:0]

Ball Domain Instructions

Instruction Current Encoding New Change
MATMUL_WARP16 (32) op1=rs1[7:0], op2=rs1[15:8], wr=rs2[7:0] wr moves to rs1[23:16]; rs2 repacked (iter starts at bit 0)
RELU (38) op1=rs1[7:0], wr=rs2[7:0] wr moves to rs1[23:16]
TRANSPOSE (34) op1=rs1[7:0], wr=rs1[15:8] wr moves to rs1[23:16]
IM2COL (33) op1=rs1[7:0], wr=rs1[15:8] wr moves to rs1[23:16]
CONCAT (39) op1=rs1[7:0], wr=rs2[7:0] wr moves to rs1[23:16]
TRANSFER (45) op1=rs1[7:0], wr=rs2[7:0] wr moves to rs1[23:16]

BBFP_MUL(26), MATMUL_WS(27), ABFT_SYSTOLIC(42), CONV(43), CIM(44) and other dual-operand instructions follow the MATMUL_WARP16 pattern.

2.3 Unified Extraction

After unification, the GlobalDecoder only needs:

rd_bank_0 = rs1[7:0]       // always
rd_bank_1 = rs1[15:8]      // always (valid flag controls usage)
wr_bank   = rs1[23:16]     // for Ball instructions
          or rs1[7:0]      // for MVIN/MSET (write bank == bank_0)

Plus a simple valid lookup table indexed by func7, producing a fully instruction-agnostic BankAccessInfo.


3. Bank Scoreboard Design

3.1 Location

framework/frontend/scoreboard/BankScoreboard.scala

3.2 Core Interface

class BankAccessInfo(bankIdLen: Int) extends Bundle {
  val rd_bank_0_valid = Bool()
  val rd_bank_0_id    = UInt(bankIdLen.W)
  val rd_bank_1_valid = Bool()
  val rd_bank_1_id    = UInt(bankIdLen.W)
  val wr_bank_valid   = Bool()
  val wr_bank_id      = UInt(bankIdLen.W)
}

The scoreboard only receives BankAccessInfo — it has no knowledge of instruction types, domains, or func7 codes.

3.3 Internal Data Structure

val bankRdCount = RegInit(VecInit(Seq.fill(bankNum)(0.U(cntWidth.W))))
val bankWrCount = RegInit(VecInit(Seq.fill(bankNum)(0.U(cntWidth.W))))
// cntWidth = log2Ceil(rob_entries + 1)

3.4 Hazard Detection Rules

New instruction reads bank X  → requires bankWrCount[X] == 0     (RAW hazard)
New instruction writes bank X → requires bankRdCount[X] == 0     (WAR hazard)
                                 AND     bankWrCount[X] == 0     (WAW hazard)
def hasHazard(info: BankAccessInfo): Bool = {
  val rd0 = info.rd_bank_0_valid && bankWrCount(info.rd_bank_0_id) =/= 0.U
  val rd1 = info.rd_bank_1_valid && bankWrCount(info.rd_bank_1_id) =/= 0.U
  val wr  = info.wr_bank_valid && (
    bankRdCount(info.wr_bank_id) =/= 0.U ||
    bankWrCount(info.wr_bank_id) =/= 0.U
  )
  rd0 || rd1 || wr
}

3.5 Counter Updates

  • On issue (issue.fire): increment rd_count/wr_count for each accessed bank
  • On complete (complete.fire): decrement rd_count/wr_count for each accessed bank

Completion requires retrieving BankAccessInfo from the ROB entry, so GlobalRobEntry must store BankAccessInfo.

3.6 MSET Precise Tracking

MSET is marked as wr_bank_valid=true, wr_bank_id=rs1[7:0]. It serializes only with instructions accessing the same bank, without blocking other banks.

3.7 GP Domain (RVV) Instructions

GP instructions don't access scratchpad banks. Their BankAccessInfo has all valid flags set to false — they are never blocked by the scoreboard and don't block others.


4. Fence Instruction — New Semantics

4.1 Design

Fence enters the ROB (currently it does not). But it is not dispatched to any execution domain.

Fence acts as an issue barrier within the ROB. When the ROB scans from head to find issuable instructions, it stops scanning at the first unresolved fence — all instructions after the fence are invisible to the issue logic.

4.2 Fence Lifecycle

  1. Fence enters ROB, receives an entry, marked with a fence flag
  2. ROB issue scan uses fence as boundary — only instructions before fence are candidates
  3. When fence reaches head (all prior instructions committed), fence auto-completes
  4. Head advances past fence, subsequent instructions become visible for issue

4.3 Relationship with Scoreboard

Instructions before the fence are still managed by the scoreboard — different-bank instructions can issue out of order. The fence only limits the scan window boundary, ensuring post-fence instructions cannot leapfrog the fence.

4.4 Practical Effect

  • Without fence: All instructions in ROB are candidates based on bank dependencies — maximum parallelism
  • With fence: Instructions before and after fence are strictly ordered; instructions within each group can still reorder freely
  • Software can selectively use fence for forced ordering (debugging, special semantics, etc.)

5. ROB Issue Logic Modification

5.1 Current Logic

// Scan from head, find first valid && !issued && !complete instruction
scanValid(i) := robValid(ptr) && !robIssued(ptr) && !robComplete(ptr)

5.2 New Logic

// Add two conditions: (1) no bank hazard (2) not behind a fence
scanValid(i) := robValid(ptr) && !robIssued(ptr) && !robComplete(ptr)
                && !hasHazard(robEntries(ptr).bankAccess)
                && !isBehindFence(i)

isBehindFence computation:
Scan from head; once a valid, uncompleted fence entry is encountered, all subsequent positions are marked isBehindFence = true.

val fenceBarrier = Wire(Vec(rob_entries, Bool()))
var seenFence = false.B
for (i <- 0 until rob_entries) {
  val ptr = (headPtr + i.U) % rob_entries.U
  val isFence = robValid(ptr) && isFenceEntry(ptr) && !robComplete(ptr)
  seenFence = seenFence || isFence
  fenceBarrier(i) := seenFence  // fence itself and everything after are masked
}

5.3 Fence Auto-Completion

When fence becomes head (all prior instructions committed), it auto-marks as issued + complete:

when (robValid(headPtr) && isFenceEntry(headPtr)) {
  robIssued(headPtr)   := true.B
  robComplete(headPtr) := true.B
}

6. Files to Modify

Hardware (Scala)

File Change
NEW framework/frontend/scoreboard/BankScoreboard.scala BankAccessInfo definition, BankScoreboard module
framework/frontend/decoder/GobalDecoder.scala Add BankAccessInfo to PostGDCmd; add bank extraction logic with valid lookup table
framework/frontend/globalrs/GlobalROB.scala Add BankAccessInfo + isFence to GlobalRobEntry; integrate BankScoreboard; modify issue logic (hazard + fence barrier); fence auto-completion
framework/frontend/globalrs/GlobalReservationStation.scala Remove old fence logic (fenceActive etc.); fence now enters ROB instead of stalling until drain
framework/memdomain/frontend/cmd_channel/decoder/DomainDecoder.scala Extract bank_id from rs1[bankIdLen-1:0]; mem_addr from rs1 upper bits; iter/stride from new rs2 positions
examples/toy/balldomain/DomainDecoder.scala wr_bank from rs1[23:16]; update decode table

Software (C instruction macros)

File Change
bb-tests/.../isa/24_mvin.c bank_id to rs1[7:0], mem_addr to rs1 upper bits
bb-tests/.../isa/25_mvout.c Same as MVIN
bb-tests/.../isa/23_mset.c (or equivalent) bank_id to rs1[7:0]
bb-tests/.../isa/32_mul_warp16.c wr_bank from rs2[7:0] to rs1[23:16]
bb-tests/.../isa/38_relu.c wr_bank from rs2[7:0] to rs1[23:16]
bb-tests/.../isa/34_transpose.c wr_bank from rs1[15:8] to rs1[23:16]
bb-tests/.../isa/33_im2col.c wr_bank from rs1[15:8] to rs1[23:16]
bb-tests/.../isa/45_transfer.c wr_bank from rs2[7:0] to rs1[23:16]
Other Ball instruction macro files Similar treatment
All test .c files Remove bb_fence() calls

7. Implementation Steps

  1. Define BankAccessInfo Bundle; create BankScoreboard module in framework/frontend/scoreboard/
  2. Modify software instruction macros (MVIN/MVOUT/MSET/Ball instructions) — unify bank encoding to rs1
  3. Modify MemDomainDecoder to adapt to new rs1/rs2 field positions
  4. Modify BallDomainDecoder — wr_bank from rs1[23:16]
  5. Modify GlobalDecoder — add BankAccessInfo extraction logic
  6. Modify GlobalROB — integrate BankScoreboard, implement hazard detection + fence barrier scan
  7. Modify GlobalReservationStation — remove old fence logic, fence enters ROB
  8. Remove bb_fence() calls from tests
  9. Compile + run tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions