title | subtitle | date | author | mainfont | monofont | fontsize | toc | documentclass | header-includes |
---|---|---|---|---|---|---|---|---|---|
CS151: Computer Architecture |
Professor Reinman |
Winter 2021 |
Thilan Tran |
Libertinus Serif |
Iosevka |
14pt |
true |
extarticle |
\definecolor{Light}{HTML}{F4F4F4}
\let\oldtexttt\texttt
\renewcommand{\texttt}[1]{
\colorbox{Light}{\oldtexttt{#1}}
}
\usepackage{fancyhdr}
\pagestyle{fancy}
\usepackage{circuitikz}
|
\newpage{}
- eight central ideas of computer architecture design:
- design for Moore's law
- use abstraction to simplify design
- make the common case fast
- performance via parallelism
- eg. memory operations
- performance via pipelining
- ie. using an assembly line of specialized tasks
- performance via prediction
- eg. predicting control flow branches
- memory hierarchy
- dependability via redundancy
\newpage{}
-
general hardware components of a computer:
- processor with:
- datapaths that perform operations on data
- control operations that sequence the datapath
- cache memory
- main system memory
- other devices eg. I/O with input devices, external storage devices, network adaptors, etc.
- processor with:
-
stack of layers between a user app and the hardware of a computer:
- user apps
- system apps
- Java API framework
- native C/C++ libraries
- hardware abstraction layer
- Linux kernel
- the actual silicon hardware eg. may be a system on chip (SoC)
- need to balance specificity of hardware with portability of applications
-
in a SoC, instead of just a single component, the chip contains multiple components packaged together that interact with each other:
- eg. CPU, GPU, memory, I/O, media, etc.
- pros:
- better integration and lower latency vs. separated components that do I/O through pins
- on-chip components are embedded and stacked together more tightly
- better integration and lower latency vs. separated components that do I/O through pins
- cons:
- may lead to new problems managing heat
- manufacturing silicon chips:
- slice a silicon ingot into wafers
- process the wafers with patterns through a lithography process
- dice the wafers into dies
- package dies
- at each step, failures can occur
- yield is the proportion of working dies per wafer
-
layers of program code:
- high-level language
- very abstracted, portable across different systems
- assembly language:
- textual representation of instructions
- specific to a system ie. machine-specific instructions, eg. MIPS vs. x86
- hardware representation:
- encoded instructions and data in binary
- what the machine actual reads
- high-level language
-
the instruction set architecture (ISA) exposes a set of primitives to the layers above it in order to work with the silicon hardware below the ISA:
- ie. the interface or template for interactions between the CPU and the drivers, OS, and apps above it
- defines different instructions like
add
ormovsql
, calling protocols, register conventions, etc. - in this class: how do we implement the silicon hardware to actually implement an ISA, using the building blocks of a microprocessor?
-
evaluating performance:
- performance is one aspect of evaluating a microprocessor:
- other areas include power management, reliability, size, etc.
- can be evaluated in terms of latency or response time (delay), vs. throughput (bandwidth)
- big-picture constraints on performance are power, instruction-level parallelism, and memory latency
- performance is one aspect of evaluating a microprocessor:
-
consider the following equations for response time:
$$\textit{ET} = \textit{IC} \times \textit{CPI} \times T_C$$ $$\textit{ET} = \frac{\textit{IC} \times \textit{CPI}}{T_R}$$ - performance is inversely related to response time
- ET is the execution time
- IC are the instruction counts
- dynamic runtime count considering loops and recursion, rather than the static count of a program
- CPI are the clock cycles per instruction:
- clocks need to be used due to the synchronous nature of memory in circuits
- but since different instruction types will have different cycle counts, depends on both:
- mix of instructions used
- hardware implementation of these instructions
- thus to find the CPI as a weighted average, need to compute the percentages and cycles of the mix of instructions that make up the IC:
$$\textit{CPI} = \sum_{i=1}^n \textit{CPI}_i \times f_i$$ - where
$f_i$ is the relative frequency of instruction type$\textit{CPI}_i$
- where
- note that the CPI is a count dependent on the particular program and the particular hardware it is running on
-
$T_C$ is the cycle time ie. clock period:- inverse to clock rate
$T_R$ - at a shorter clock period, CPI may increase since there are fewer cycles for a more complex instruction to complete
- direct tradeoff between CPI and
$T_C$
- direct tradeoff between CPI and
- eg.
$T_R$ is commonly between 3.5-4 GHz- recently topping out due to power efficiency limitations
- inverse to clock rate
- note that while total elapsed time records processing, I/O, and idle times, CPU time only records the time spent processing a single job
- discounts the time of other jobs' shares, I/O, etc.
-
affect of stack layers on performance:
- algorithm affects IC, possibly CPI
- could change the number or type of instructions used eg. multiplication vs. addition
- similar logic for programming language and compiler choice
- could change the number or type of instructions used eg. multiplication vs. addition
- programming language affects IC, CPI
- compiler affects IC, CPI
- ISA affects IC, CPI,
$T_C$ - ISA may limit the expressiveness of higher-level languages as well as the expressiveness of underlying hardware implementations
- Hardware affects CPI,
$T_C$ - eg. change clock period by reducing wire delay, change CPI by improving implementation of an instruction
- algorithm affects IC, possibly CPI
-
performance pitfalls:
-
Amdahl's law explains how improving one aspect of a computer affects the overall performance:
$$T_{\textit{improved}} = T_{\textit{unaffected}} + \frac{T_{\textit{affected}}}{\textit{improvement factor}}$$ - thus, we should strive to make the common case fast
- using millions of instructions per second (MIPS) as a performance metric can be extremely misleading:
- ie. ignores instruction count when considering execution time
- doesn't account for different ISAs and different instruction complexities
-
Amdahl's law explains how improving one aspect of a computer affects the overall performance:
-
consider the following equation for power:
$$\textit{power} = \textit{capacitive load} \times \textit{voltage}^2 \times \textit{frequency}$$ - recently hitting a "wall" of power, ie. voltage has not been scaling down enough
- problems with removing heat caused by so much power
- there is a need to find other ways to decrease power usage
- eg. dynamic voltage scaling or dynamic frequency voltage to decrease power proportionally depending on the usage of the chip
- recently hitting a "wall" of power, ie. voltage has not been scaling down enough
-
while performance has been steadily improving, power demands for chips have similar increased dramatically:
- thus, rather than pushing for more performant, power-hungry, monolithic cores, instead design towards multi-core designs in a many-core revolution
- in a multi-core design, many cores can work together or alternatively multitask
- put more than one processor per chip
- pros:
- can dramatically improve performance, beyond the power wall limitation
- cons:
- requires explicitly parallel programming by programmers
- most programs are not "embarrassingly" parallel
- vs. instruction level parallelism:
- where the hardware executes multiple instructions at once
- hidden from programmers
- requires explicitly parallel programming by programmers
\newpage{}
-
the instruction set architecture (ISA) is the repertoire of instructions of a computer:
- different computers have different ISAs
- early computers had simple ISAs
- some modern computers continue to use simple ISAs
-
key ISA decisions:
- number, length, type of operations
- number, location, type of operands
- also how to specify operands
- instruction format, eg. limits on size
-
main classes of ISAs are CISC and RISC
-
complex instruction set computers (CISC) provide a large number of instructions
-
pros:
- many specialized complex instructions with dedicated hardware components
- can optimize for specialized operations
- good for tight memory restrictions where program size was optimized
- eg. VAX, x86
- eg. for
a*b + c*d
, may be implemented with a single multiply-add-multiply (MAM) instruction:- need more overall opcodes to account for many different instructions
- a MAM would take 5 operands
-
pros:
-
reduced instruction set computers (RISC) have relatively fewer instructions
-
pros:
- enables pipelining and parallelism because of simplicity of hardware:
- smaller, simpler pieces
- no need to individually parallelize the hardware for many complex instructions
- easier to optimize and reuse hardware for efficiency
- easier to validate hardware
- enables pipelining and parallelism because of simplicity of hardware:
- eg. MIPS, PowerPC, ARM
- this class uses MIPS
- eg. for
a*b + c*d
, may be implemented with two multiplies and an add instruction:- reusing basic instructions, but uses 3 opcodes total for the 3 instructions
- the 3 isntructions would altogether take
$3 \times 3 = 9$ total operands
-
pros:
-
blurring distinction between CISC and RISC:
- on the CISC side, x86 reduces instructions to micro-ops that look like RISC instructions
- downside is that it is more difficult for compiler to optimize these complex instructions that will later break down into micro-ops
- on the RISC side, ARM is also performing a process called micro-op fusion that takes smaller instructions and fuses them into more complex operations, until they can be broken apart again:
- with some more complex instructions, may take up less space even in the hardware to implement
- in part motivated by the limited instruction window available in out-of-order execution optimization
- on the CISC side, x86 reduces instructions to micro-ops that look like RISC instructions
-
performance tradeoff of RISC vs. CISC:
- RISC typically has a higher IC
- less expressive instructions available
- RISC can have a lower cycle time or lower CPI:
- TC can decrease since latency for more complex operations is gone
- average CPI can drop since expensive complex operations are broken down
- RISC typically has a higher IC
-
considerations with ISA instruction lengths:
- variable length (VL) instructions have different sizes, eg.
ret
instruction may only need 8 bits, while a MAM instruction may need 64 bits (with some alignment rules)- contrasted to fixed length (FL) instructions
- typically, CISC uses VL and RISC uses FL
- performance tradeoff:
- less memory fragmentation with VL
- more flexibility and less bit-width related limits with VL
- thus VL can have a lower IC, but may also have a higher TC and average CPI, due to increased decoding time in the hardware
- easier instruction decoding with FL
- variable length (VL) instructions have different sizes, eg.
-
ISA instruction operands can generally refer to different values:
- an immediate, stored directly in the instruction code
- a register in a register file:
- latency of 1-3 cycles
- a 5 bit register number specifies one of the 32 MIPS registers (one level of indirection)
- less bits required than the full register address
- memory:
- latency of 100s of cycles
-
$2^{32}$ addressable locations in MIPS - one way to form an effective address to memory is to refer to a register that then contains a memory address (two levels of indirection)
- could also supply an offset added to an address in a register to form an effective address
- note that branch and jump instructions treat the addresses passed as immediates at a word offset (since neither instruction type accepts a full 32-bit immediate)
-
the types of operands a MIPS instruction can take (ie. the addressing mode) is based off its type:
-
R-type instructions take 3 registers eg.
add
-
0
is the single R-type opcode, but many possible instructions are specified through afunct
field eg.add, sub, or
-
-
I-type instructions take 2 registers and an immediate eg.
addi, lw, sw
- different opcodes for each I-type
-
J-type instructions just take an immediate eg.
j, jal
:- different opcodes for each J-type
-
R-type instructions take 3 registers eg.
-
on the other hand, in the addressing mode of x86, every instruction can refer into memory:
- leads to increased complexity for all instructions
- more difficult to isolate memory loads and stores in instructions and perform optimizations like prefetching
- thus in x86-like addressing, IC can go down, but the CPI and TC could go up
- in addition, in RISC compared to CISC, there is more pressure on the register file since the only way to load memory is to use
lw
to load it into a register first:- can lead to thrashing and smaller working sets
- RISC thus typically has larger register files
- leads to increased complexity for all instructions
-
instruction format field tradeoffs:
- the 5-bit fields for register numbers is dependent on the total registers
- eg. if we increased the register file size, may instead require 6 bits in the instruction format for all registers
- with more registers, there is less spilling and fewer
lw
instructions, which have the most latency:- in a
lw
, have to go to memory (after effective address computation from immediate and register) and afterwards back to register file- thus lower IC and CPI
- but must take the extra bits for the register fields from somewhere else
- in a
- could reduce the number of bits for immediates
- may increase IC, since instructions may need more instructions to build the same immediates
- could reduce the number of bits for an opcode:
- may lose the number of functions we can specify
- in addition, every format's opcode size must change, since they must be equal
- may increase IC
- could reduce the number of bits for a shift
- less expressiveness, may increase IC
- thus if we don't use a certain feature while performing many spills, it may be beneficial to adjust field bit widths
- eg. if not many shifts are performed while often spilling to memory, it may be a good idea to expand the register fields
- on the other hand, a hardware impact of increasing register file size is increased latency ie. potentially increased TC
- more hardware in order to implement the register file
- the 5-bit fields for register numbers is dependent on the total registers
-
pipelining and parallelism sidenote:
- pipelining in hardware
- pipelining the actual hardware components, eg. staggering frontend / backend instruction execution at the same time
- pipelining in software:
- software pipelining acts more as a compiler optimization
- although the hardware isn't being overlapped, the code is, and the compiler has access to more layered code to optimize
- parallelism in hardware
- designing hardware components that can run at the same the time
- parallelism in software
- using multithreading
- pipelining in hardware
- in three-op code, two sources and one destination are given:
- all arithmetic operations follow this form
- eg. in
add a, b, c
,a
getsb + c
Translating f = (g+h) - (i+j)
:
; pseudocode that abstracts memory locations of identifiers
add t0, g, h ; uses temporary registers
add t1, i, j
sub f, t0, t1
- register operands are often used in arithmetic operations:
- as the top of memory hierarchy, registers are used for frequently accessed data:
- smaller memory is typically faster, related to physical design of the wired delay
- much faster than main memory
- very important for compilers to perform register allocation to use registers as much as possible
- in MIPS, registers are numbered from 0 to 31 in a 32 by 32-bit register file
t0...t9
for temporaries,s0...s7
for saved variables- note that 5 bits (to store 0 to 31) are used to store the location a desired 32-bit value is in
- double indirection if 32 bits hold an address
- note that 5 bits (to store 0 to 31) are used to store the location a desired 32-bit value is in
- as the top of memory hierarchy, registers are used for frequently accessed data:
-
main memory is primarily used for large data (eg. arrays, structures):
- memory values must be loaded into registers
- results must then be stored from registers to memory
-
MIPS specifics:
- memory is byte addressed
- though larger values may be pulled in from memory than only a byte (eg. a word or 4 bytes)
- addresses must be a multiple of 4
- MIPS is big endian
- memory is byte addressed
-
memory reference operands take in a static offset:
- I-type instructions take two registers and an immediate
- eg. in
lw t0, 32(s3)
,rt = t0
gets the contents of the address atrs = s3
offset by the immediatei
(which has been sign-extended from 16 to 32 bits) - eg. while
sw t0, 16(s3)
, the contents ofrt = t0
are stored into the address atrs = s3
offset by the sign-extended immediatei
- unlike most instructions,
lw, sw
have non-deterministic latency since they have to go into main memory- introduces a lot of different problems with scheduling, etc.
Translating g = h + A[8]
:
; g in s1, h in s2, base address of A in s3
lw t0, 32(s3) ; 4 bytes per word, index 8 requires offset of 32
add s1, s2, t0
Translating A[12] = h + A[8]
:
lw t0, 32(s3) ; load word
add t0, s2, t0
sw t0, 48(s3) ; store word back into memory
- immediates are constant data values specified in an instruction:
- ie. literal values that will not change
- eg.
addi s3, s3, 4
oraddi s2, s1, -1
- no subtract immediate instruction (minimizing instructions)
- has a limit of 16 bits, in order to make the common case fast and avoid a load
-
logical ops:
<<
issll
shamt
instruction field specifies how many positions to shift
>>
issrl
&
isand, andi
- 3-op code with two sources and one destination
|
isor, ori
~
isnor
-
sometimes, a 32-bit constant is needed instead of the typical 16-bit immediate:
lui rt, constant
is the load-upper-immediate instruction:- copies the 16-bit constant to the left 16 bits of
rt
- while clearing the right 16 bits of
rt
to 0
- copies the 16-bit constant to the left 16 bits of
ori rd, rt, constant
is the or-immediate instruction:- can use to load the lower 16-bits of the 32-bit constant
- together, can be used to form a 32-bit constant in a register
- can now be used for arithmetic operations or for jumping to a larger 32-bit address
-
with branch instructions, control of the program is conditioned on data:
- ie. modifying the program counter out of sequence
beq rs, rt, L1
will branch to the instruction labeledL1
ifrs == rt
bne rs, rt, L1
will branch to the instruction labeledL1
ifrs != rt
j L1
is an unconditional jump to the instruction labeledL1
-
the branch instructions
beq, bne
are both I-type instructions that only take a 16-bit immediate:- but the PC holds a 32-bit address
- most branches don't go very far, so PC-relative addressing is used (forward or backward):
- not-taken address is
PC + 4
- target ie. taken address is
(PC + 4) + offset * 4
whereoffset
is the immediate value:- instructions are always on a 32-bit granularity, so
offset * 4
allows an 18-bit address to be formed from 16 bits of space using word-level addressing (rather than the byte-level addressing oflw, sw
) - using a sign extension with a shift by two
- instructions are always on a 32-bit granularity, so
- initial
PC + 4
is a convention since the program counter has already been incremented
- not-taken address is
- note that the labels given to the instructions become encoded as immediates during linking based on the layout in memory
- if the branch target is too far to encode in a 16-bit immediate, assembler will rewrite the code using a jump
Branching far away:
beq s0, s1, L1
; becomes rewritten as
bne s0, s1, L2
j L1
L2:
-
the jump instructions
j, jal
are J-type instructions that take a 26-bit immediate:- no need to test registers in a branch test
- need 32-bit address for PC:
- like the branch instructions, the address is a word address ie. really a 28-bit address
- all PC-related instructions use word-level addressing
- upper 4 bits are taken from the current PC
- this is called pseudodirect addressing
- like the branch instructions, the address is a word address ie. really a 28-bit address
jal
sets registerra
toPC + 4
as part of its execution:- this is an example of an explicit operand
- alternatively,
jr
is an R-type instruction that jumps to the 32-bit address in a register:- allows for a full 32-bit PC jump
- requires a
lw
orlui, ori
to build the full address up before thejr
- requires a
- but may unnecessarily pollute registers, so
j
is provided as another way to jump
- allows for a full 32-bit PC jump
-
summary of addressing modes:
- immediate addressing uses an immediate in the instruction
- register addressing uses the address stored in a register
- base addressing adds a register address base with an immediate address offset
- PC-relative addressing offsets the current PC with an immediate address
- pseudodirect addressing concatenates the current PC with an immediate address
-
a basic block is a sequence of instructions with:
- no embedded branches (except at the end)
- no branch targets (except at beginning)
- compiler can treat this block in a coarser granularity as a basic unit for optimizations
Translating C if-else
:
if (i==j) f = g+h;
else f = g-h;
To MIPS:
bne s3, s4, Else
add s0, s1, s2
j Exit ; assembler will calculate addresses of labels
Else: sub s0, s1, s2
Exit: ...
Translating C while
:
while (save[i] == k) i+=1;
To MIPS:
; i in s3, k in s5, base address of save in s6
Loop: sll t1, s3, 2 ; array elements are each 4 bytes
add t1, t1, s6
lw t0, 0(t1)
bne t0, s5, Exit
addi s3, s3, 1
j Loop
Exit: ...
- other conditional operations:
slt rd, rs, rt
will setrd = 1
ifrs < rt
elserd = 0
slti rd, rs, constant
will setrd = 1
ifrs < constant
elserd = 0
- also
sltu, sltui
for unsigned comparisons
- also
- often used in combination with branch equations
blt, bge
are not real instructions since the hardware would be much slower, penalizing with a slower clock:- want to instead optimize for the common case
- these are pseudo instructions that will later be compiled down into real instructions
-
procedure call steps:
- place parameters in registers
- transfer control to the procedure
- acquire storage for procedure on stack
- execute procedure code
- place return value in register for caller
- return to address of call
-
some register usages:
a0-a3
are arguments- rest of arguments are stored on the stack
v0, v1
are result valuest0-t9
are temporariess0-s7
are callee-savedgp
is the global pointer for static datasp
is the stack pointerfp
is the frame pointerra
is the return address
-
procedure call instructions:
jal ProcedureLabel
jump and links:- address of the following instruction is put in
ra
- jumps to target address at procedure label
- address of the following instruction is put in
jr ra
jumps to a register:- copies
ra
to program counter (could use another register) - can also be used for computed jumps eg.
switch
statements
- copies
Translating C function:
int foo(int g, h, i, j) {
int f;
f = (g+h) - (i+j);
return f
}
To MIPS:
foo:
addi sp, sp, -4 ; push stack
sw s0, 0(sp) ; saving callee-saved register
add t0, a0, a1
add t1, a2, a3
sub s0, t0, t1
add v0, s0, zero ; $zero register is always 0
lw s0, 0(sp)
addi sp, sp, 4 ; pop stack
jr ra
Translating recursive C function:
int fact(int n) {
if (n < 1) return 1;
else return n * fact(n-1);
}
To MIPS:
fact:
addi sp, sp, -8
sw ra, 4(sp) ; recursive, need to save return
sw a0, 0(sp) ; save arg
slti t0, a0, 1 ; test n < 1
beq t0, zero, L1
addi v0, zero, 1 ; return 1
addi sp, sp 8
jr ra
L1: addi a0, a0, -1
jal fact
lw a0, 0(sp) ; restore arg
lw ra, 4(sp) ; restore return
addi sp, sp, 8
mul v0, a0, v0 ; multiply
jr ra
-
instructions are encoded in binary ie. machine code
-
MIPS instructions:
- encoded as 32-bit instructions
- specific format for opcode and register numbers (5-bit)
- supporting different formats complicates decoding, but MIPS keeps formats as similar as possible
- very regular
-
MIPS R-type instruction fields:
- 6-bit
op
for opcode:- specifies the format of the instruction
- always
0
in R-type
- 5-bit
rs
for first source register number - 5-bit
rt
for second source register number - 5-bit
rd
for destination register number - 5-bit
shamt
for shift amount - 6-bit
funct
for function code (extends opcode):- specific function to perform when in R-type
- allows the opcode field to be as short as possible, while allowing a specific instruction format to have extra functionality:
- increase decoding complexity while allowing for more expressiveness
- possible since R-types don't use a large immediate field
- used for all arithmetic and logical instructions
- 6-bit
-
MIPS I-type instruction fields:
- 6-bit
op
- 5-bit
rs
- 5-bit
rt
- 16-bit immediate value
- used for branches, immediates, data transfer
- 6-bit
-
MIPS J-type instruction fields:
- 6-bit
op
- 26-bit target address
- used for jump instructions
- 6-bit
Representing the instruction add $t0, $s1, $s2
:
000000 10001 10010 01000 00000 100000
^R-type opcode ^no shift
^s1 ^s2 ^t0 ^32 represents add function (part of ISA)
- segments in memory layout:
- reserved segment
- text segment containing program code:
- instructions themselves lie in memory
- program pointer points somewhere here
- static data segment for global variables eg. static variables, constants, etc.
- global pointer points here
- heap for dynamic data eg. created by
malloc
in C ornew
in Java - stack for function frames ie. automatic storage
- stack and frame pointers point here
\newpage{}
-
integer addition:
- addition is done bitwise
- overflow may occur if the result is out of range:
- adding a positive and negative operand never causes overflow
- adding two positive operands overflows if the result sign is 1
- adding two negative operands overflows if the result sign is 0
-
integer subtraction:
- comparable to addition, just add the negation of second operand (flip all bits and add 1)
- overflow can again occur:
- subtracting two operands of the same sign never causes overflow
- subtracting positive from negative operand overflows if result sign is 0
- subtracting negative from positive operand overflows if result sign is 1
-
dealing with overflow:
- some languages such as C ignore overflow
- directly uses MIPS
addu, addui, subu
- directly uses MIPS
- while other languages like Ada, Fortran raise exceptions
- uses MIPS
add, addi, sub
- uses MIPS
- some languages such as C ignore overflow
-
the arithmetic logic unit (ALU) in the CPU handles arithmetic in the computer:
- eg. addition, subtraction, multiplication, devision
- must handle overflow as well as floating point numbers
- in the CPU pipeline, ALU is used during instruction execution
- after instruction fetch / decode and operand fetch
-
general ALU inputs and outputs:
- two inputs
a
andb
- some way to specify a specific ALU operation
- this ALU
op
is interpreted from both the instructionopcode
andfunct
field by a secondary ALU controller
- this ALU
- four outputs
CarryOut, Zero, Result, Overflow
Zero
is operation independent and is high whenever the ALU result is zero, typically used for branch conditions
- two inputs
-
Figure \ref{1bit} indicates a 1-bit implementation of an ALU:
- contains an
AND
gate,OR
gate, and a 1-bit adder - note that all three operations are performed at once, and a multiplexer is used to select one result based on the specified
op
operation
- contains an
- a 1-bit full adder AKA a (3,2) adder:
- adds two 1-bit inputs
- a half adder would have no
CarryIn
input
From its truth table, the full adder follows the boolean equations:
CarryOut = (b & CarryIn) | (a & CarryIn) | (a & b) // sum of products
Sum = (!a &!b & CarryIn) | (!a & b & !CarryIn) |
(a & !b & !CarryIn) | (a & b & CarryIn)
// Alternatively, use a 3 input XOR (outputs 1 when odd number of input 1s).
// Can be implemented with 2 XORs as: Sum = a XOR b XOR CarryIn
- we can chain together 1-bit ALUs to create a 32-bit ALU:
- called a ripple carry ALU
- each 1-bit ALU handles one of the places of the computation
- each
CarryOut
gets propagated to the next place'sCarryIn
- there is an overall delay associated with this propagation
- gives an overall
CarryOut
and 32-bit result
-
to handle subtraction:
- as in Figure \ref{sub}, a
Binvert
signal can select the flipped bits of inputB
- but still need to add 1 to properly negate the input, so ALU0 would have a
CarryIn
set to 1 during subtraction - adding an
Ainvert
signal for invertingA
as well allows the ALU to perform aNOR
ie.~A & ~B
- as in Figure \ref{sub}, a
-
to detect overflow in an
$n$ bit ALU:- need to compare the carry-in and out of the most significant bit
Overflow = CarryIn[n-1] XOR CarryOut[n-1]
-
to detect zero:
Zero = (Res_0 + Res_1 + ... + Res_{n-1})
- use one large
NOR
gate
- to implement the
slt
or set-on-less-than instruction as in Figure \ref{slt}:- need to produce a 1 in
rd
ifrs < rt
, else 0- all but least significant bit of the result in a
slt
is always 0
- all but least significant bit of the result in a
- to compare registers, can use subtraction through
rs - rt < 0
- add additional input
Less
and outputSet
:Less
acts as a passthrough that can be selected as another operationSet
appears as an output only on the most significant bit and gives the output of the adder- ie. after the subtraction,
Set
will hold the sign bit of the result
- ie. after the subtraction,
set
in MSB ALU is then connected back to theLess
of the LSB ALU:- where as all the other
Less
inputs are hardwired to zero- the flexibility of the ALU to perform various operations requires every bit to be specified
- thus only the least significant bit in
slt
will vary depending on the result ofrs - rt
- note this leads to a long delay before
slt
result becomes stable- exposes a race condition when pipelining
- where as all the other
- another possibility is to just perform a subtraction and defer the
slt
:- would require addition instruction logic like
beq, bne
- would require addition instruction logic like
- need to produce a 1 in
- thus the full 32-bit ALU can be implemented as in Figure \ref{final}:
Bnegate
hooks upBinvert
to automatically set theCarryIn
of ALU0 to high in order to perform negation- can still perform a
NOR
, since setting the carry-in is a no-op when not using the adder's output
- can still perform a
- altogether, can perform the operations
add, sub, AND, OR, NOR, slt
-
the adder has the longest delay in the ALU:
- particularly, want to examine the carry chain in the adder:
- each
CarryOut
calculation must go through 2 gate delays (sum of products calculation) - leads to a delay of
$2n$
- each
- how can we lessen this delay?
- particularly, want to examine the carry chain in the adder:
-
timing analysis of a normal ripple-carry adder:
-
C1
has delay$5T$ due to sum of products calculation with a triple-OR:- to actually perform the sum at bit 1, requires two more
XOR
calculations, one of which usesC1
, so calculating the sum requires an additional$2T$ - the other
A XOR B
calculation can be performed earlier, since it only has delay$2T$ - thus sum at bit 1 has delay
$7T$
- to actually perform the sum at bit 1, requires two more
-
C2
has delay$10T$ , with the same structure asC1
-
C3
has delay$15T$ - etc.
- assuming gate delays are proportional to their fan-in
-
A | B | CarryOut | Type |
---|---|---|---|
0 | 0 | 0 | kill |
0 | 1 | CarryIn | propagate |
1 | 0 | CarryIn | propagate |
1 | 1 | 1 | generate |
Table: Carry Look Ahead Logic \label{cla} |
- in a carry look ahead adder, generate the
CarryOut
ahead of time:- use logic on the side ie. trade area for parallel performance
- eg. as seen in Table \ref{cla}:
- when
A, B
are both 0, regardless of what the carry-in is, a carry-out cannot be generated - similarly, when
A, B
are both 1, a carry-out will always be generated - when
A, B
are different, then the carry-in will be propagated
- when
- thus we can calculating the carry-out without considering the immediate carry-in in some scenarios
- augment each adder with two signals
G = A & B
andP = A XOR B
:- the
CarryIn
of one adder is no longer connected to theCarryOut
of the next - there is no more delay between the adders, since each
CarryOut
can be calculated purely in terms ofG, P
which depend only onA, B
and the earliest carry-in
- the
- pros:
- all
G, P
can be calculated in parallel, and then all carry-outs can be calculated in parallel
- all
- cons:
- the growing fan-in of the sum of products calculation seen below can become quickly expensive area-wise
Carry-out calculations in a look ahead adder:
C1 = G0 + C0*P0 // generate at 0, or propagate earliest carry-in
C2 = G1 + G0*P1 + C0*P0*P1 // generate at 1, or propagate at 1 AND generate at 0,
// or propagate at 1 AND propagate earliest carry-in
C3 = G2 + G1*P2 + G0*P1*P2 + C0*P0*P1*P2
// etc.
// Note how C0 is the only carry-in that appears in calculations.
// Otherwise, the calculations rely solely on G, P.
-
timing analysis of a flat CLA adder:
- every
G, P
have delay$2T$ , since they only depend onA, B
-
C1
has delay$6T$ , since it sums upG0
andC0*P0
, which have delay$2T$ and$4T$ respectively, and the sum at bit 1 has delay$8t$ -
C2
has delay$8T$ , since it has a triple-OR with triple-AND along the slowest path, and the sum at bit 2 has delay$10T$ -
C3
has delay$10T$ , and the sum at bit 3 has delay$12T$ - vs. in ripple carry, this sum requires
$15T + 2T = 17T$
- vs. in ripple carry, this sum requires
- etc.
- assuming gate delays are proportional to their fan-in
- in reality, a larger fan-in incurs a nonlinear increase of delay
- every
-
another way to chain together lookahead adders is using hierarchical CLA (HCLA):
- create more layers of generate and propagate values
G_a, P_a
- eg. to get the overall
G_a, P_a
for a larger granularity unit of four 1-bit adders, we can use the following equations:-
G_a = G0*P1*P2*P3 + G1*P2*P3 + G2*P3 + G3
ie. generate at one of the adders, and propagate through the rest -
P_a = P0*P1*P2*P3
ie. propagate through all adders - note that
G_a, P_a
are completely independent of carry-ins
-
- then we can calculate a higher level of carry-ins that feed back into the units of four 1-bit adders:
- eg. can calculate
C4, C8, C12, ...
fromG_a, P_a, G_b, P_b, ...
- eg.
C4 = G_a + C0*P_a
andC8 = G_b + G_a*P_b + C0*P_a*P_b
, etc. - note that at this higher level, the delays of
G_a, G_b, ...
are all the same and the delays ofP_a, P_b, ...
are all the same, similarly to theG, P
delays at lower levels
- eg. can calculate
- thus we can add another hierarchy of carry lookaheads
- identical logic to a normal CLA, but using a different layer of
G, P
- identical logic to a normal CLA, but using a different layer of
-
pros:
- helps mitigates the expensive area cost of the normal CLA fan-in growth
- create more layers of generate and propagate values
Pass 1 | Pass 2 | Pass 3 | Pass 4 | Pass 5 |
---|---|---|---|---|
Table: Granularity of Dependencies for a HCLA with 4-bit Units
-
can also use a partial CLA:
- multiple loohakead adders can be connected directly together, carry-out to carry-in:
- eg. four 8-bit carry lookahead adders can form a 32-bit partial CLA
- ie. a rippling of lookahead adders
- pros:
- simpler design with less area
- cons:
- not as fast as hierarchical CLA
- since they are rippled together, there is a dependence between each CLA that causes additional delay
- multiple loohakead adders can be connected directly together, carry-out to carry-in:
-
in a carry select adder, trade even more area for parallel performance:
- calculate adder results for both possible carry-ins, 1 and 0
- twice as many ALUs used for redundant calculations
- then, select one of the results depending on the actual carry-in
- the delay of the multipexer is relatively small
- more useful when performed in conjunction with HCLA or partial CLA, and/or at the end of a sum chain
- calculate adder results for both possible carry-ins, 1 and 0
-
timing analysis of a CSA, given the delay of a gate is proportional to its fan-in:
- have to use MUXs to choose between the carry-in calculations:
- need to choose correct carry-out and correct sum
- sum calculation will usually dominate the carry-out calculation
- however the main carry-in will have a delay of 0 since both possibilities are calculated
- this delay reduction is emphasized when CSA is used at the end of an addition chain
- have to use MUXs to choose between the carry-in calculations:
- starting with the long multiplication approach for binary:
- analagous to long multiplication in base-10 ie. follows a sequence of adds and shifts
- the length of the product is the sum of operand lengths
- eg. 32-bit multiplier would use a 64-bit multiplicand and have a 64-bit product
-
datapath for Figure \ref{mult64}, using a 64-bit ALU:
- product is initially zero
- test the least significant bit of the multiplier
- if 1, add multiplicand to product and place the result in the
Product
register
- if 1, add multiplicand to product and place the result in the
- shift the
Multiplicand
register left by 1 bit - shift the
Multiplier
register right by 1 bit- gets new least significant bit
- if this is the 32nd repeititon, we are done, otherwise continue to loop
-
can we optimize the ALU down from a 64-bit ALU to a 32-bit one?
- would save latency and power on the ALU
- yes, as seen in Figure \ref{mult32}, use a 32-bit ALU and fix the multiplicand at 32 bits, instead of shifting it to the left:
- instead, equivalently shift the product register (which remains at 64 bits) to the right
- ie. add into the upper bits of the product, and then shift that to the right
- in addition, remove the multiplier shifting and load it directly into the lower 32 bits of the product:
- while still zeroing out upper bits of the product
- thus as we shift the product register right, we strip off the individual bits of the multiplier on the right side of the product and use them in the control test
- the control test thus handles:
- whether we write to the register (if stripped bit is 1)
- performs a right-shift of the product every iteration
- MIPS multiplication details:
- there are two 32-bit registers for the product,
HI, LO
respectively mult rs, rt
andmultu rs, rt
builds a 64-bit product inHI/LO
mfhi rd
andmflo rd
move fromHI/LO
tord
- can test
HI
to see if product overflows 32 bits
- can test
mul rd, rs, rt
setsrd
to the least-significant 32 bits of the product
- there are two 32-bit registers for the product,
-
different approaches:
- make both positive, and complement final product if necessary
- need to sign-extend partial products and subtract at the end
- using Booth's algorithm
- uses the same hardware as before and also saves cycles
- make both positive, and complement final product if necessary
-
motiviation behind Booth's algorithm:
- want to change the sequence of additions in multiplication with some subtractions
- eg. can compose a run of adds
14 = 2 + 4 + 8
as14 = -2 + 16
:- ie. shifting the highest number by 1 and then subtracting the lowest number
- reduces multiple additions into a single add and a subtract
-
ex. To calculate
0b0010 * 0b0110
:- would add
0b0100
and0b1000
(after shifts) - in Booth's algorithm, would instead first subtract
0b0100
and then add0b010000
in an extra shift - in other examples, would end up saving more adds
- eg. with a longer run of 1s in the multiplier
- would add
-
Booth's algorithm in full:
- in addition to keeping track of the current least significant bit, also need to track the "bit to the right" ie. the previous bit from the last iteration
- different sequences and their meanings:
10
indicates beginning a run of 1s, so subtract11
indicates middle of the run, so don't add or subtract01
indicates end of the run, so add- this is already one order of magnitude past the original multiplicand would have been, so already accounts for the extra shift
00
indicates middle of a run of 0s, so don't add or subtract
-
IEEE floating point format:
- sign bit (1 bit)
- exponent (8 bits single, 11 bits double)
- a bias ensures exponent is unsigned (127 single, 1203 double)
- fraction (23 bits single, 52 bits double)
- normalized significand has a leading hidden bit (
$1 + F$ )
- normalized significand has a leading hidden bit (
$x = (-1)^S \times (1 + F) \times 2^{E-B}$
-
complicates addition:
- 3 discrete parts
- much repackage result after addition
-
ex.
$9.999{\times}10^1 + 1.610{\times}10^{-1}$ - must align decmal points ie. shift number with smaller exponent
$9.999{\times}^1 + 0.016{\times}10^1$
- add significands
$10.015{\times}10^1$
- normalize result and check for over/underflow:
- assuming can only have one digit followed by three digits of precision
$1.0015{\times}10^2$
- round and renormalize if necessary
$1.002{\times}10^2$
- must align decmal points ie. shift number with smaller exponent
\newpage{}
-
multiple ways to optimize processor design:
- focus on CPI ie. a single-cycle implementation
- focus on cycle time ie. a pipelined implementation
-
logic design basics:
- information is encoded in binary
- one wire per bit
- multi-bit data is encoded on multiple wires
- in a combinational element, output is a function of input:
- ie. operates on data
- note that there is always some output from the element, but it may not stabilize until a certain point in the clock cycle
- state ie. sequential elements store information:
- eg. a register or flip-flop
- uses a edge-triggered clock signal to determine when to update the stored value
- may have an additional write control input that updates only when write-enable is high
- combinational logic transforms data during clock cycles:
- ie. between clock edges
- input flows from state element, through combination element, to another state element
-
instruction execution steps:
- fetch instruction from where program counter points in instruction memory
- decode the instruction and read values from the register file based on the register operands
- depending on the instruction class:
- use ALU to calculate arithmetic result, memory addresses for load / stores, branch target addresses, etc.
- access data memory for load /store
- set PC to target address or increment to
PC + 4
- components of the processor design in Figure \ref{proc}:
- datapath components:
- PC
- instruction memory with an address input and instruction output
- register component with 3 register numbers as input and several outputs
- ALU with two numeric inputs and an output
- input may come from the register file or directly from instruction memory as an immediate
- data memory with an address and data input
- may perform either a read or write
- data flows along the datapath
- control path components:
- various MUXs and operation controls
- controls where the data goes and what operations are performed
- datapath components:
-
the datapath as seen in Figure \ref{proc} is composed of the elements that process data and addresses in the CPU
- eg. registers, ALUs, memories, etc.
-
instruction fetch components:
- the PC is a 32-bit register pointing to the next instruction to execute
- the PC is fed to an adder that increments it by 4 for the next instruction, unless a branch otherwise changes the PC
- the instruction memory component:
- takes the 32-bit address on its address port and produces its corresponding instruction within the same cycle
- ie. is an unclocked component
- these operations occur all at once
- the PC is a 32-bit register pointing to the next instruction to execute
-
the register component:
- takes in 3 5-bit register numbers
- two register numbers refer to registers whose values are produced on the two outputs
- the last register number refers to a writable register:
- depending on the
RegWrite
control input, analogous to the write-enable signal on flip-flops - the value at the write-data port is written the to register
- write-data port is MUXed by the
MemtoReg
control between a value from data memory or an ALU result
- write-data port is MUXed by the
- depending on the
-
the ALU component is the same as previously discussed in Figure \ref{final}:
- controlled by a 4-bit operation control signal
- note that one input is MUXed between either a register value or an immediate that has been sign-extended
-
the data memory component contains:
- two inputs:
- an address port to specify into memory
- a write data port to optionally specify data to write into memory
- one optional read data output port
- a
MemWrite
control to specifies specify to write a value on the write data port into memory - a
MemRead
control to specify whether to pull a value out of memory onto the read data port- note that we did not use a control signal to read registers since they will be read in almost every operation
- whereas main memory is not always read in every instruction
- note that we did not use a control signal to read registers since they will be read in almost every operation
- two inputs:
-
handling branch instructions eg.
beq
:- read register operands
- compare operands with the ALU and check zero output to see if they are equal
- calculate target address:
- if jumping, sign-extend displacement, shift left 2 places, and add to
PC + 4
- MUXed by the
Branch
control
- if jumping, sign-extend displacement, shift left 2 places, and add to
- note that we need additional components to adjust the instruction offset:
- a sign-extend component that replicates the sign-bit wire
- a shift-left-2 component that reroutes wires
-
the datapath in Figure \ref{proc} executes a single instruction in one clock cycle:
- each datapath element can only do one function at a time
- thus instruction and data memory is separate, so we can read an instruction and work with data memory in the same cycle
- note that it is fine to read and write from the same source in one cycle:
- write occurs only on the rising clock edge, after a value has been read out
- eg. reading and writing the same register, updating PC, etc.
- need to use multiplexers to handle alternate data sources:
- data flows no matter what through the datapath, so we need to use control to indicate which parts are useful
- eg. immediates vs. registers into ALU
- in this single-cycle implementation CPI is constant, but the cycle time will be quite high since especially long instructions eg.
lw, sw
must all complete in a single cycle
- each datapath element can only do one function at a time
- the ALU control specifies its different functions:
- baseline 6 MIPS functions are
AND
,OR
, add, subtract, slt, andNOR
(all R-types)- each specified by a different ALU control code
- for load and stores, need to function as an add for effective address computation
- note that this hardwired addition operation is set by a different
ALUOp
control than the other possible R-type addition, since loads and store instructions do not have space to specify afunct
field
- note that this hardwired addition operation is set by a different
- for branch, need to function as a subtraction to compare register values
- for other R-types, the function depends on the
funct
field of the instruction - Table \ref{ops} indicates how the ALU control codes may be extracted from an opcode
- specifically, controlled by a 2-bit
ALUOp
embedded within the instruction opcode
- specifically, controlled by a 2-bit
- baseline 6 MIPS functions are
Opcode | ALUOp |
Operation | funct |
ALU Function | ALU Control |
---|---|---|---|---|---|
lw |
00 |
load word | XXXXXX |
add | 0010 |
sw |
00 |
store word | XXXXXX |
add | 0010 |
beq |
01 |
branch equal | XXXXXX |
subtract | 0110 |
R-type | 10 |
add | 100000 |
add | 0010 |
R-type | 10 |
subtract | 100010 |
subtract | 0110 |
R-type | 10 |
AND | 100100 |
AND | 0000 |
R-type | 10 |
OR | 100101 |
OR | 0001 |
R-type | 10 |
slt | 101010 |
slt | 0111 |
Table: ALU Control Codes Based on Opcode and Operation \label{ops} |
-
the main control unit must generate control signals based on three main types of instructions:
- R-type instructions that specify three registers, a
funct
field, etc.- the third register
rd
will be written to
- the third register
- load / store instructions that specify two registers and an address immediate:
- for loads, the second register
rt
will be written to - note that we want a control signal to specify whether data memory should be read into a register:
- instead of always reading on every cycle and corrupting our memory cache
- note that this is not needed for the register file due to its speed and the fact that registers are read in almost every operation
- for loads, the second register
- branch instructions that specify two registers and an address immediate
- the first register
rs
is always read in all three types
- R-type instructions that specify three registers, a
-
main control unit signals:
RegDst
specifies whether the second or third register should be written to as a destination throughRegWrite
Branch
specifies whether the PC will branchMemRead
determines whether memory is being readMemtoReg
specifies whether the memory or ALU output is sent to the write register- in this datapath,
MemtoReg
should only be high whenMemRead
is
- in this datapath,
ALUOp
specifies the operation for the ALU controller to use- if R-type, ALU controller needs to check the
funct
field as well
- if R-type, ALU controller needs to check the
MemWrite
determines whether memory is written toMemRead
andMemWrite
should not be high at the same time in a single-cycle datapath
ALUSrc
specifies whether a sign-extended immediate or register value is fed into the ALURegWrite
determines whether the write register specified byRegDst
is written to
-
control signals for the R-type instruction:
RegDst
should be set to 1 to choose the third register as the write destinationBranch
should be 0MemRead
should be 0- R-type instructions do not access memory
MemtoReg
should be 0 to send ALU instead of memory output to the write registerALUOp
should be0b10
to specify the ALU operation depends on thefunct
fieldMemWrite
should be 0ALUSrc
should be 0 since we want to use the second register value in the ALURegWrite
should be 1 since we want to write to the register file
-
control signals for load instruction:
RegDst
should be set to 0 to choose the second register as the write destinationBranch
should be 0MemRead
should be 1MemtoReg
should be 1 to send the memory output to the write registerALUOp
should be hardwired to0b00
to specify an additionMemWrite
should be 0ALUSrc
should be 1 since we want to use the sign-extended immediate to build up an effective addressRegWrite
should be 1 since we want to write to the register file
-
control signals for branch instruction:
RegDst
is a don't-care since we are not writing to registersBranch
should be 1MemRead
should be 0MemtoReg
is a don't-care since we are not writing to registersALUOp
should be0b01
to specify a subtractionMemWrite
should be 0ALUSrc
should be 0 since we want to compare register valuesRegWrite
should be 0 since we do not write to the register file
R-type | lw |
sw |
beq |
j |
|
---|---|---|---|---|---|
Opcode | 000000 |
100011 |
101011 |
000100 |
000010 |
RegDst |
1 | 0 | X | X | X |
ALUSrc |
0 | 1 | 1 | 0 | X |
MemtoReg |
0 | 1 | X | X | X |
RegWrite |
1 | 1 | 0 | 0 | 0 |
MemRead |
0 | 1 | 0 | 0 | 0 |
MemWrite |
0 | 0 | 1 | 0 | 0 |
Branch |
0 | 0 | 0 | 1 | 0 |
ALUOp1 |
1 | 0 | 0 | 0 | X |
ALUOp2 |
0 | 0 | 0 | 1 | X |
Jump |
0 | 0 | 0 | 0 | 1 |
Table: Main Controller Outputs |
-
to implement jumps:
- jump instruction has a 26-bit address immediate
- to generate new PC, need to concatenate top 4 bits of old PC with the left-shifted immediate
- requires an extra control signal
Jump
decoded from the opcode- feeds into an additional cascaded MUX with
Branch
to determine the next PC value
- feeds into an additional cascaded MUX with
-
considerations on this single-cycle implementation:
- pros:
- relatively simple design
- CPI is 1
- cons:
- cycle time must be long enough for every instruction to complete
- large disparity between the amount of time different instruction types will take to execute
- cycle time must be long enough for every instruction to complete
- eg. branches only require instruction fetch, register access, and ALU:
- while loads require instruction fetch, register access, ALU, memory access, another register access for write back
- loads /stores have to go through the expensive memory hierarchy
- more disparity as well with slower floating point operations
- pipelined implementations are the more modern alternative that address this instruction time disparity
- pros:
Ex. Implement the I-type instruction law
with the following pseudocode:
R[rt] = M[R[rs]] + SE(i)
- some elements of the
law
implementation already exist:- writing output of ALU to register
rt
withRegDst
andMemtoReg
both set low - bringing
SE(i)
into the second ALU port withALUSrc
set high
- writing output of ALU to register
- need to address:
- having the memory read address specified entirely by a single register
- note that we cannot simply add the address at
R[rs]
with zero to generate the address since we need the ALU in a single-cycle datapath to perform the later addition with the immediate
- note that we cannot simply add the address at
- sending data memory back to the ALU
- also have to ensure for every new instruction implementation that the new opcode for the instruction is unique
- having the memory read address specified entirely by a single register
Control | New Control | ||
---|---|---|---|
RegDst |
0 | LawC |
1 |
ALUSrc |
1 | ||
ALUOp |
00 |
||
MemtoReg |
0 | ||
RegWrite |
1 | ||
MemRead |
1 | ||
MemWrite |
0 | ||
Branch |
0 | ||
Jump |
0 | ||
Table: Main Control Signals for law |
- thus we need a new signal
LawC
to control two new MUXed modifications as seen in Figure \ref{law}:LawC
control is the lowest output from the main control unit
- shortcircuit the data memory address with the contents of register
rs
- send the read data memory port into the top ALU port
- note that this MUX must read out the data memory before the
MemtoReg
MUX since the output of the ALU will eventually be sent through the MUX to write into registerrt
, as intended
- note that this MUX must read out the data memory before the
Ex. Implement the I-type instruction blt
with the following pseudocode:
if (R[rs] < R[rt])
PC = PC + 4 + SES(i)
else
PC = PC + 4
- very similar to existing logic for
beq
instruction:rs
andrt
are already compared at the ALU in abeq
- can also reuse the existing calculation of
PC + 4 + SES(i)
Control | New / Updated Control | ||
---|---|---|---|
RegDst |
X | BranchLt |
1 |
ALUSrc |
0 | ALUOp |
11 |
MemtoReg |
X | ||
RegWrite |
0 | ||
MemRead |
0 | ||
MemWrite |
0 | ||
Branch |
X | ||
Jump |
0 | ||
Table: Main Control Signals for blt |
-
need a new mechanism to perform the
slt
without specifying thefunct
field- the R-type
slt
cannot be used because there is no space for thefunct
field in an I-type instruction.- unless we hardwire the
funct
field using a MUX to perform aslt
when ablt
is executed
- unless we hardwire the
- the R-type
-
augment the main control unit with a new signal
BranchLt
as seen in Figure \ref{blt}:- we can check
R[rs] < R[rt]
by passing a newALUOp
value of11
to the ALU controller- also program the ALU controller to tell the ALU to perform an
slt
when passedALUOp = 11
- also program the ALU controller to tell the ALU to perform an
- finally, the lowest bit of the output of the ALU can be ANDed with the
BranchLt
signal and then ORed with the other ANDed branch calculation:- alternatively we can add another MUX controlled by
BranchLt
- alternatively we can use the negated zero output to check less than condition
- alternatively we can add another MUX controlled by
- note that the
Branch
signal is a don't-care since it will be overriden by the OR gate
- we can check
Ex. Implement the R-type instruction cmov
with the following pseudocode:
if (R[rt] != 0)
R[rd] = R[rs]
- existing mechanisms:
- can write to
rd
withRegDst
set high
- can write to
- need to address:
- checking
R[rt]
is nonzero - conditionally writing a register value
R[rs]
back to the write data port ie.RegWrite
needs to be toggleable
- checking
- importantly, note that all R-type instructions will have the same main control signals:
- since they have the same opcode
- eg.
ALUSrc
will already be 0 in order to sendR[rt]
to the ALU,RegDst
will already be 1 to write tord
- instead, we can only take advantage of the
funct
field to change the ALU control behavior- the
funct
field should be the only unique difference between R-types
- the
Control | ALU Control | ALU Function | ||
---|---|---|---|---|
RegDst |
1 | cmov function field |
subtract | 0110 |
ALUSrc |
0 | |||
ALUOp |
10 |
|||
MemtoReg |
0 | |||
RegWrite |
1 | |||
MemRead |
0 | |||
MemWrite |
0 | |||
Branch |
0 | |||
Jump |
0 |
Table: Main Control and ALU Control Signals for cmov
-
based on some specific, unique
funct
field in thecmov
instruction:- augment ALU control to choose an ALU operation of subtraction
- could alternatively perform an
AND
- could alternatively perform an
- augment ALU control with a new outgoing signal
CmovC
- should be high whenever the
funct
field indicates acmov
- should be high whenever the
- augment ALU control to choose an ALU operation of subtraction
-
using the new
CmovC
control signal as seen in Figure \ref{cmov}:CmovC
controls a MUX that feeds either the register contentsR[rs]
or the ALU output (no memory because of R-type restriction) back into the write dataCmovC
also controls a MUX that feeds either 0 orR[rs]
into the top ALU port- allows us to test
R[rt] - 0 == 0
- allows us to test
- finally, we additionally need a mechanism to control
RegWrite
though it is fixed for an R-type:- negate the
Zero
output after the subtraction to see if we need to perform the move - use
CmovC
to MUX theRegWrite
signal with the negated zero so that the write occurs only ifR[rt]
equals 0, instead of depending solely onRegWrite
- negate the
Ex. Implement the I-type instruction jalm
with the following pseudocode:
PC = PC + 4 + M[R[rs]]
R[r31] = PC + 4
-
note that the immediate field is not used for
jalm
, even though it is an I-type instruction -
need to address:
- adding a memory read
M[R[rs]]
toPC + 4
- already have an adder that is used in
beq
forPC + 4 + SES(i)
, can we use this?
- already have an adder that is used in
- writing the original PC value to a hardwired register
r31
- adding a memory read
Control | New Control | ||
---|---|---|---|
RegDst |
X | JalmC |
1 |
ALUSrc |
X | ||
ALUOp |
XX |
||
MemtoReg |
X | ||
RegWrite |
1 | ||
MemRead |
1 | ||
MemWrite |
0 | ||
Branch |
X | ||
Jump |
0 | ||
Table: Main Control Signals for jalm |
- using the new
JalmC
control signal as seen in Figure \ref{jalm} to control several differentjalm
-specific MUXs:- a MUX that sets the writing register destination to a hardwired
r31
when enabled - a MUX that passes through
R[rs]
past the ALU directly into the memory read address port - a MUX that sets the write data port of the registers component to the already computed address
PC + 4
when enabled - a MUX that sets the lower input of an auxiliary adder to the read data from the memory component:
- there are multiple ALU adders available, so we are exploiting the adder already wired to perform
PC + 4 + SES(i)
frombeq
- feeding in the read data from memory into the adder computes the desired
PC + 4 + M[R[rs]]
- there are multiple ALU adders available, so we are exploiting the adder already wired to perform
- a MUX that sets the writing register destination to a hardwired
- finally,
JalmC
is also ORed with the previous ANDed branch calculation:- thus the
Branch
signal is a don't care since it is overridden by this gate - again we are reusing some
beq
branching logic
- thus the
\newpage{}
-
issues with the single-cycle datapath:
- longest delay determines clock period
- the critical path in this case is the load instruction, which must go through every component in the datapath
- not feasible to vary period for different instructions
- violates the design principle of making the common case fast (common case is usually not memory loads)
- instead, use pipelining to improve performance
- longest delay determines clock period
-
pipelining is a type of parallelism where we overlap execution:
- improves performance by overlapping discrete steps
- eg. for some process that takes four discrete steps in a single load, with each step taking half an hour:
- the naive approach would just to do loads front to back, which takes 8 hours to complete four loads
- in the parallel approach, start the second load as soon as the first load goes into the second discrete step, etc.
- would only take 3.5 hours to complete four loads, which is a 2.3 times speedup
- in the steady state, every stage of the pipeline is full and thus every stage delay produces one unit of work:
- eg. for the same example, the speedup will approach 4 times since a load can be completed at every step once the pipeline is full
- over time, the steady state throughput will overcome the initial pipeline warmup state ie. pipeline startup before steady state
- note that while the delay for a particular instruction is its latency, the CPI is a weighted average that gives a measure of throughput:
- when using pipelining, the total delay of a particular instruction may even increase due to added latches
- however, the overall CPI ie. average throughput will approach 1 in the steady state, since an instruction is completed every stage
- single-cycle also has a CPI of 1, but in pipelining the cycle time can also dramatically drop by a factor of the number of stages
-
MIPS ISA is designed for pipelining:
- all instructions are 32-bits
- can fetch and decode quickly in a single cycle
- few and regular instruction formats
- can decode and read registers in one step
- load / store addressing becomes simpler
- alignment of memory operands allows memory accesses to only take one cycle
- all instructions are 32-bits
-
in the MIPS pipeline, there are five stages:
- IF is the instruction fetch from memory, involves the instruction memory
- ID is the instruction decode and register read, involves reading the register file
- EX is the operation execution or address calculation, involves the ALU
- MEM is the actual memory operand access, involves main memory ie. data memory
- WB is the write back of the result to register, involves writing the register file
- want to achieve a steady state where a different instruction is being executed in each stage of the pipeline
- vertical slices indicate which and how many instructions are being pipelined
- although we are reading and writing to the register file in the same cycle when the pipeline is in steady state:
- a transparent latch is used that separates the reading and writing to registers within the same cycle
- ie. writing occurs in the first half of the cycle and reading occurs in the second half
- note that there are no structural hazards to the conflict (different ports are used), just timing conflicts
-
comparing pipeline performance with the single-cycle datapath:
- assume time for stages is 100 ps for register reads or writes and 200 ps for other stages
- ex. For three loads in a row:
- in single-cycle, need a
$T_C = 800$ ps since every load requires 800 ps - when pipelined, need a
$T_C = 200$ ps since the maximal latency of an individual stage is 200 ps- instructions just need to move in lock-step fashion
- in single-cycle, need a
- if all stages are balanced ie. take the same time,
$T_{piped} = \frac{T_{non-piped}}{stages}$ - if not balanced, speedup is less, need to take the maximal delay
- when there are mixed instructions in the pipeline, complications can occur:
- eg. a load word accesses a data memory before writing back to registers, while an addition goes straight to registers after the ALU
- this causes a conflict in the
lw, add
instruction sequence, since thelw
takes one more stage to get to writeback
- this causes a conflict in the
- need to regularize the path that instructions take through the pipeline
- adds cannot skip data memory in our current implementation
- eg. a load word accesses a data memory before writing back to registers, while an addition goes straight to registers after the ALU
-
pipelining principles:
- all instructions that share a pipeline must have the same stages in the same order
- eg.
add
does nothing during the MEM stage andsw
does nothing during WB stage
- eg.
- all intermediate values must be latched ie. saved each cycle:
- ensures instructions don't interfere with each others' signals
- need registers ie. flip-flops to act as banks between stages as seen in Figure \ref{pllatch}
- registers hold information produced in previous cycle
- latches acts as a separator between the pipelined stages so that the inputs and outputs of each stage can stabilize
- there can be no functional block reuse
- all instructions that share a pipeline must have the same stages in the same order
-
some data hazards to consider when pipelining:
- control hazards:
- after a branch, the next instruction that should be run is dynamic
- this is where branch prediction comes into play
- some instructions have dependence on a previous instruction:
- ie. checking for a value after a memory load before changing control
- eg. in
lw, add, sub
, the addition and subtraction wait on thelw
result- since the EX stage occurs before WB of the memory read
- after a branch, the next instruction that should be run is dynamic
- structural hazards
- cannot skip over stages ie. paths have to be ordered and regularized
- these hazards prevent the CPI from ideally approaching 1
- control hazards:
-
instruction flow through the pipeline for
lw
:- during IF, instruction corresponding to the PC will be stored into the IF/ID latch, along with the
PC + 4
calculation- note that we will have to deal with the hazard of instructions changing the PC eg.
beq
- note that we will have to deal with the hazard of instructions changing the PC eg.
- during ID:
- the IF/ID latches are read to get register numbers
- the two corresponding register values are then saved into the ID/EX latch
- note that the contents of only one register is used in
lw
, but the reads still occur
- note that the contents of only one register is used in
- the 16-bit immediate will be sign-extended are stored into the ID/EX latch
PC + 4
is also be propagated into the ID/EX latch
- during EX:
- sign-extended immediate from the ID/EX latch is added to
PC + 4
and stored into the EX/MEM latch- this value is not used in
lw
, but the calculations still occur
- this value is not used in
- sign-extended immediate from the ID/EX latch is MUXed into the ALU with the top register value
- ALU result is saved into EX/MEM latch
- the lower register value from the ID/EX latch is also propagated into the EX/MEM latch
- this value is not used by
lw
, but the latch still occurs
- this value is not used by
- sign-extended immediate from the ID/EX latch is added to
- during MEM:
- the result of the ALU from the EX/MEM latch is used to address into data memory
- the value read in from data memory is stored into the MEM/WB latch
- the result of the ALU itself is also stored into the MEM/WB latch
- the result of the ALU from the EX/MEM latch is used to address into data memory
- during WB:
- the value from memory in the MEM/WB latch is MUXed into the write data port of the registers
- but this is now the write register destination of a different instruction!
- to fix this, the write register destination must be propagated down the pipeline to WB alongside the memory address calculation and access
- as seen in Figure \ref{pllatch}
- the value from memory in the MEM/WB latch is MUXed into the write data port of the registers
- placing latches in specific places may affect the delay in each pipeline stage
- eg. which stage ALU controller or MUXs lie between which latches
- during IF, instruction corresponding to the PC will be stored into the IF/ID latch, along with the
-
instruction flow through the pipeline for
sw
:- same steps as
lw
for IF through EX
- during MEM:
- the result of the ALU from the EX/MEM latch is still used to address into data memory
- but the
MemWrite
signal will be high, and the value of the lower register from the EX/MEM latch will be passed into the write data port of the data memory component
- but the
- the result of the ALU from the EX/MEM latch is still used to address into data memory
- during WB,
RegWrite
is low forsw
, so nothing occurs in this pipeline stage
- same steps as
-
the control inputs for the pipelined datapath are identical to the ones in the single-cycle datapath:
- however, we now have multiple instructions flowing through the pipeline at the same time
- could have up to 5 set of control oepration signals at the same time, one for each pipeline stage
-
to allow for this, the control signals are now also latched between stages alongside the instruction data:
- the instruction itself is still read out of the IF/ID latch into an identical main control unit
- controls are divided ie. clustered into the stages control signals refer to:
- EX, MEM, WB specific control signals are passed down the pipeline until they reach their specific stage
- eg.
AluOp
is only stored in the ID/EX latch since it is immediately consumed by the ALU in the EX stage
- each stage can now hold insulated control signals for different instructions
- hazards are situations in the pipeline where performance drops and CPI goes above 1:
- ie. situations that prevent starting the next instruction in the next cycle
- in a data hazard, need to wait for previous instructions to complete their data read or writes
- in a control hazard, deciding on the next control action depends on a previous instruction
- in a structure hazard, a required resource is busy:
- in the given pipeline, there is no conflict for instructions since each instruction is in exactly one stage
- eg. if we only had one single memory for data and instructions, instruction fetch may have to stall for a cycle
- creates a pipeline "bubble" where the pipeline could open up and certain stages cannot move forward
-
when a result is needed in the pipeline before it is available, a data hazard occurs:
- note that it is possible to write and then read from a register in the same cycle by using transparent latches
- ie. in the best case, we can line up register reads and the writes they are dependent on in the same cycle
- a read after write (RAW) is an example of a true data dependency:
- while write after read (WAR) and write after write (WAW) are antidependencies that affect scheduling instead of being true data hazards
- antidependencies affect reordering of instructions, but can usually be addressed through register renaming
-
ex. Consider the instructions sequence
add r3 r10 r11
,lw r8 50(r3)
, andsub r11, r8, r7
:- this is a straightline dependency where each instruction produces a value used by the next
- ie. each instruction depends on the one before it
- when the
add
is in EX adding two register values together, thelw
is in ID already reading out from registerr3
:- but
r3
has not been written to with the result of the ALU lw
is reading a stale value to form an incorrect address- similarly for when the
sub
moves to ID and reads another incorrect value forr8
- but
- this is an example of a data hazard
- this is a straightline dependency where each instruction produces a value used by the next
-
to deal with data hazards, can use one of multiple software or hardware techniques
-
in software, can insert independent instructions or no-ops:
- simply insert enough no-ops to effectively stall so that the data hazard doesn't occur:
- eg. shifting a register by 0 is a no-op
- occupies space in the pipeline, but doesn't affect the functionality of the program
- pros:
- hardware thus does not need to expect any hazards:
- no complexity required to add to the hardware
- much easier to patch software
- CPI stays close to 1, since an instruction is still executed every cycle
- hardware thus does not need to expect any hazards:
- cons:
- ruins the portability of the software since pipelines may be extremely machine-dependent
- the no-op insertion or reordering operation depends on length of pipeline on mahine, etc.
- increased software complexity
- instruction count does increase and performance decreases
- note that the instruction cache may need to grow to keep track of additional instructions, which can increase latency and thus the CPI
- ruins the portability of the software since pipelines may be extremely machine-dependent
- simply insert enough no-ops to effectively stall so that the data hazard doesn't occur:
-
alternatively, we can reorder code to avoid use of written results in the next instruction:
- this optimization is possible, but difficult when reordering distant instructions
- eg. moving loads above stores since they may alias one another, moving instructions around branches, etc.
- pros:
- same CPI advantages as no-op approach, without any of the IC or latency downsides
- cons:
- not always possible
- even more software complexity
- software vs. hardware handling of hazards is analogous to static vs. dynamic optimization
Example with adding no-ops and reordering:
; assuming the delay to write and then read a value is 2 instructions
sub r2, r1, r3
and r4, r2, r5
or r8, r2, r6
add r9, r4, r2
slt r1, r6, r7
; only adding no-ops
sub r2, r1, r3
sll, r0, r0, 0
sll, r0, r0, 0
and r4, r2, r5
or r8, r2, r6
sll, r0, r0, 0
add r9, r4, r2
slt r1, r6, r7
; reordering instructions removes one no-op
sub r2, r1, r3
slt r1, r6, r7
sll, r0, r0, 0
and r4, r2, r5
or r8, r2, r6
sll, r0, r0, 0
add r9, r4, r2
-
in hardware, can insert bubbles ie. stalling the pipeline without inserting a no-op or alternatively perform data forwarding:
- stalls reduce performance, but are required to get correct results
- stalls are necessary even if we are using data forwarding
-
inserting bubbles:
- similar to injecting no-ops, but in hardware
- when an instruction realizes a previous instruction still has a register value that it needs, it waits in its current stage
- independently control pipe stages such that later stages may continue executing, but the current stage can stall
- stalling a stage will also stall the previous instructions behind it
-
to ensure proper pipeline execution through stalls, we must be able to:
- detect the hazard
- stall the pipeline:
- in the current pipeline, need to prevent the IF and ID stages from making progress
- note that the bubble will only start in IF or ID but the bubbled instruction will still proceed down the pipeline
- ID needs to read the dependent instruction write and we don't want the IF stage to lose any instructions
- in the current pipeline, need to prevent the IF and ID stages from making progress
-
detecting hazards ie. what comparisons tell us when to stall between two instructions A and B?
- need to compare the register destination where A will write against the two source operands for B
- can be either
rs
orrd
depending on instruction
- can be either
- as well as the
RegWrite
signal for A to see if A even needs to write - may also want to check type of B
- eg. B is a J-type and doesn't use the second read register, so there is no need to stall
- at minimum, need to check
RegWrite
for A and the register destination of A matching one of the read registers for B
- need to compare the register destination where A will write against the two source operands for B
Initial stalling conditions:
if ((ID/EX.RegWrite &&
((ID/EX.RegisterDest == IF/ID.RegisterRs)) ||
(ID/EX.RegisterDest == IF/ID.RegisterRt))
(EX/MEM.RegWrite &&
((EX/MEM.RegisterDest == IF/ID.RegisterRs)) ||
(EX/MEM.RegisterDest == IF/ID.RegisterRt)))
{
PCWrite = 0
IF/IDWrite = 0
... insert control no-ops ...
}
-
to stall ie. prevent the IF and ID stages from proceeding:
- skip writing to the PC
- add a new control signal
PCWrite
that can lock down the instruction in the IF stage
- add a new control signal
- skip writing the IF/ID latch
- add a new control signal
IF/IDWrite
that can lock down the instruction in the ID stage
- add a new control signal
- insert no-ops to inject the bubble by setting all control signals propagating to EX/MEM/WB to zero
- eg. doesn't read or write from memory, branch, write to registers, etc.
- pros:
- IC stays the same
- cons:
- CPI increases due to bubbled parts of the pipeline
- hardware complexity
- skip writing to the PC
-
to perform data forwarding AKA register bypassing:
- use the result when it is computed instead of waiting for it to be stored in a register:
- eg. in a
add
,sub
sequence, just pass the ALU result of the addition back to the input of the ALU in order to be used just in time for the next subtract instruction - requires extra connections in the datapath
- eg. in a
- specifically, forward values stored in the EX/MEM latch as input into the ALU
- as well as possibly the MEM/WB latch
- pros:
- IC still stays the same
- less increase in CPI by forwarding over stalling
- cons:
- CPI still increases due to inavoidable bubbles
- hardware complexity
- use the result when it is computed instead of waiting for it to be stored in a register:
Example with forwarding:
sub r2, r1, r3
and r12, r2, r5 ; r2 has not been written, but available in EX/MEM latch
or r13, r8, r2 ; r2 has not been written, but available in MEM/WB latch
add r14, r2, r2 ; r2 has now been written, no more forwarding needed
sw r15, 100(r2)
-
detecting the need to forward:
- need to know the register input sources for later instructions to detect if they should receive a forward
- also need to compare these input sources with the register destination for older instructions
- need to pass register numbers along the pipeline
- eg.
ID/EX.RegisterRs
is the register number forrs
sitting in the ID/EX pipeline register
- eg.
- data hazards occur when:
EX/MEM.RegisterDest == ID/EX.RegisterRs
- ie. check currently executing instruction in the ID/EX latch with instruction currently in the memory stage (previous instruction)
EX/MEM.RegisterDest == ID/EX.RegisterRt
MEM/WB.RegisterDest == ID/EX.RegisterRs
- ie. check currently executing instruction in the ID/EX latch with instruction currently in the writeback stage (two instructions ago)
MEM/WB.RegisterDest == ID/EX.RegisterRt
- also need to check if the forwarding instruction will even write to a register
EX/MEM.RegWrite
andMEM/WB.RegWrite
- in addition, only if
rd
for that instruction is not the zero register, since the zero register value never changes
- need to know the register input sources for later instructions to detect if they should receive a forward
-
actually forwarding paths:
- add additional MUXs in front of each ALU input:
- each MUX has three register inputs:
- original register value that was read and latched in ID/EX
- register value from the EX/MEM latch, from instruction in the MEM stage
- register value from the MEM/WB latch, from instruction in the WB stage
- needs to come after the
MemtoReg
MUX to either get the result from memory or the ALU
- needs to come after the
- in addition to the existing
ALUSrc
MUX
- each MUX has three register inputs:
- add a new control unit ie. the forwarding unit:
- takes in
rs
andrt
from the ID/EX latch- as well as
EX/MEM.RegisterDest
andMEM/WB.RegisterDest
- as well as
- controls the two ALU input MUXs with control signals
ForwardA
andForwardB
- takes in
- add additional MUXs in front of each ALU input:
-
the double data hazard:
- in the sequence
add r1, r1, r2
,add r1, r1, r3
, andadd r1, r1 ,r4
, both previous instructions are hazards for the third instruction- could receive the forwarded register
r1
from either previous instruction as an MEM or WB hazard
- could receive the forwarded register
- need to make sure to only use the most recent hazard
- revise the WB hazard condition to only forward if the MEM hazard forward condition is not true
- in the sequence
Forwarding conditions:
// EX hazard ie. forwarding instruction in MEM stage
if (EX/MEM.RegWrite && (EX/MEM.RegisterDest != 0)
&& (EX/MEM.RegisterDest == ID/EX.RegisterRs))
ForwardA = 0x10
if (EX/MEM.RegWrite && (EX/MEM.RegisterDest != 0)
&& (EX/MEM.RegisterDest == ID/EX.RegisterRt))
ForwardB = 0x10
// MEM hazard ie. forwarding instruction in WB stage
if (MEM/WB.RegWrite && (MEM/WB.RegisterDest != 0)
&& (MEM/WB.RegisterDest == ID/EX.RegisterRs))
ForwardA = 0x01
if (MEM/WB.RegWrite && (MEM/WB.RegisterDest != 0)
&& (MEM/WB.RegisterDest == ID/EX.RegisterRt))
ForwardB = 0x01
// WB hazard with double data hazard fix
if (MEM/WB.RegWrite && (MEM/WB.RegisterDest != 0)
&& !(EX.MEM.RegWrite && (EX/MEM.RegisterDest != 0)
&& (EX/MEM.RegisterDest = ID/EX.RegisterRs))
&& (MEM/WB.RegisterDest == ID/EX.RegisterRs))
ForwardA = 0x01
if (MEM/WB.RegWrite && (MEM/WB.RegisterDest != 0)
&& !(EX.MEM.RegWrite && (EX/MEM.RegisterDest != 0)
&& (EX/MEM.RegisterDest = ID/EX.RegisterRt))
&& (MEM/WB.RegisterDest == ID/EX.RegisterRt))
ForwardB = 0x01
-
however, we cannot always avoid stalls by forwarding:
- for load instructions, the value read out of memory is not ready until the end of the MEM stage
- if we have a load-use instruction sequence, the load will thus not be ready in time for the EX stage of the next instruction
- if the value is not computed when needed, cannot forward back in time
- impossible to avoid adding a bubble to stall the pipeline
- for load instructions, the value read out of memory is not ready until the end of the MEM stage
-
load-use hazard detection:
- need to check when the "using" instruction is decoded in the ID stage:
- in ID stage, need to pass control signals of all zero in order to perform the no-op and stall
- check
IF/ID.RegisterRs
andIF/ID.RegisterRt
against the load destination- destination register for a load is always
rt
- destination register for a load is always
- note that the mechanism to forward the data from an instruction in the WB stage to the EX stage is already in place with the WB hazard condition
- same process as before for performing a stall on the pipeline:
- set all control values in ID/EX latch to 0
- prevent update of PC and IF/ID latch
- note that the "using" instruction will be decoded again and the following instruction will also be fetched again
- need to check when the "using" instruction is decoded in the ID stage:
Load-use hazard detection:
if (ID/EX.MemRead &&
((ID/EX.RegisterRt == IF/ID.RegisterRs)) ||
(ID/EX.RegisterRt == IF/ID.RegisterRt))
... stall ...
- add a new control unit ie. the hazard detection unit:
- takes in
ID/EX.MemRead
andID/EX.RegisterRt
- as well as
IF/ID.RegisterRs
andIF/ID.RegisterRt
- as well as
- controls a big MUX over the control signals that can set them all to zeros as a no-op
- controls the
PCWrite
andIF/IDWrite
signals to initiate a stall - Figure \ref{pldh} shows the complete pipeline with data hazard handling
- does not include all control signals and ignores
ALUSrc
- does not include all control signals and ignores
- takes in
-
branches determine flow of control:
- the fetching of the next instruction depends on the branch outcome
- pipeline can not always fetch the correct instruction
-
handling control hazards in the software:
- if software is aware of the underlying pipeline ISA, software can insert instructions that are not guarded by a branch directly after it, to fill in the branch resolution delay
- ie. have 2 free branch delay slots for instructions after each branch
- can fill with no-ops as well if there are no available independent instructions
- could take an instruction from before the branch, or from one of the paths after the branch:
- above the branch instructions will always be executed vs. instructions after the branch may not always get taken
- gives more options for instructions to pull in
- restrictions on the liveness of these chosen instructions as well
- above the branch instructions will always be executed vs. instructions after the branch may not always get taken
- pros:
- never have to flush incorrectly predicted instructions
- compiler will select independent instructions to pull in that do not affect correctness either path the branch takes
- cons:
- software is tightly coupled with specific machine ISA
- adds control complexity
-
handling control hazards in the hardware:
- just stall until we know which direction the branch will take
- guess which direction and start executing a chosen path:
- must be prepared to undo any mistakes
- how often and how much time is lost when the prediction is incorrect
- with static branch prediction, base the guess on instruction type
- with dynamic branch prediction, base the guess on execution history
- must be prepared to undo any mistakes
- reduce the branch delay
- difficult in reality to achieve
-
stall on branch:
- simply wait until the branch outcome is determined before fetching the next instruction
- need to stall the instruction fetch rather than the instruction decode stage
- forces a stall on every branch
- stall penalty becomes unacceptable with more deeper, longer pipelines
-
branch prediction:
- predict outcome of branches and only have a penalty if prediction is incorrect
- note that prediction was not used with data hazards since there is a large range of values to consider
- with branch prediction, only have to predict between two options
- in the MIPS pipeline:
- can more easily predict branches not taken, just take
PC + 4
- fetch instruction after the branch, with no delay
- can more easily predict branches not taken, just take
- more realistic branch predictions:
- in static branch prediction, we predict based on typical branch behavior:
- predict backward branches are taken eg. in loops
- note that branch-not-taken is an example of static branch prediction
- in dynamic branch prediction, we measure actual branch behavior:
- eg. record recent history of each branch
- assumes future behavior will continue trends
- when incorrect, need to also update the history
- in static branch prediction, we predict based on typical branch behavior:
- have two consider two drawbacks impacting the CPI:
- how often a branch is incorrect
- impacted by software and hardware
- how expensive is the penalty to correct erroneous predictions
- only impacted by the hardware implementation
- how often a branch is incorrect
-
when a prediction is incorrect, we need to perform a flush:
- assume that the branch outcome is determined in the MEM stage
- in order to flush, we set the control signals to 0 of instructions that should have been skipped:
- want to perform no-ops for the remaining stages in these instructions
- but what about the operations that were already performed for these invalid instructions?
- could not have possibly reached the data memory or writeback stages yet
- thus these operations have not done anything to the pipeline that cannot be removed by setting the control bits to 0
- after the flush, instructions will not write to memory or the register file
- the instruction we should have jumped to will execute normally, after the PC is updated when the branch is determined
- not affected by the flush
- flush essentially acts the same as a stall in the pipeline
- but when the prediction is correct, we do not see any penalty at all
- to check if the branch outcome was correct:
- check
Branch AND 0
is equal to whether the branch was taken or not - implemented as a comparator on the branch outcome, and only write the PC if the prediction was incorrect
- check
-
reducing branch delay penalty:
- there is no reason to wait until the MEM stage to determine the branch outcome
- can we move the hardware to determine a branch outcome to an earlier stage?
- eg. can move to the EX stage and save an instruction flush
- eg. could even move the hardware all the way to the ID stage:
- add a register comparator and a target address adder in this earlier stage
- reduces the branch delay to just a single cycle
- only have to pay a single cycle penalty when incorrect
-
issues with branch prediction using a single cycle delay:
- if a comparison register for the branch is a destination of a 2nd or 3rd preceding ALU instruction:
- there is no current forwarding of data values to the ID stage, only the EX stage
- thus need additional forwarding logic
- if a comparison register for the branch is a destination of a preceding ALU instruction or 2nd preceding load instruction
- cannot use forwarding alone, need an additional stall cycle
- if a comparison register for the branch is a destination of a preceding load instruction
- need two additional stall cycles
- if a comparison register for the branch is a destination of a 2nd or 3rd preceding ALU instruction:
-
in more advanced pipelines, branch penalty is much more significant:
- thus, we want to rely on dynamic branch prediction
- uses a branch prediction buffer (BPB) AKA branch / prediction history table:
- indexed by recent branch instruction addresses
- may only index by part of the PC for space reasons eg. lower 12 bits with 2 bottom zeros for a word addresses granularity
- stores outcome ie. taken or not taken branch
- indexed by recent branch instruction addresses
- to execute a branch, check table and expect the same outcome:
- start fetching from the fall-through or target
- need additional mechanism to keep track of target address
- if wrong, flush pipeline and flip prediction
- start fetching from the fall-through or target
- with a 1-bit predictor:
- simply stores whether this branch was taken the last time it was taken
- not enough hysteresis ie. inertia to track branch outcomes
- eg. with an if-statement conditioned on whether a loop iteration is even, the 1-bit predictor would never be correct
- eg. with inner and outer loops, mispredictions will occur when both leaving and reentering the inner loop
- instead, use a 2-bit predictor:
- each entry in the table has 2 bits, and the most significant bit is used to predict the branch outcome
- adds hysteresis such that the prediction will only change on two successive mispredictions
- fixes issues with 1-bit predictor
- can also reference global history stored in a shift buffer:
- hashing this with the PC allows us to distribute it along multiple branch locations
- solves the even-odd looping branch
-
even with a direction predictor, still need to calculate the target address:
- the earliest we can have the effective address calculation done is the ID stage, after IF
- no easy way to extract out the address calculation earlier
- thus, use another table called the branch target buffer (BTB) that caches target addresses:
- indexed by PC when instruction is fetched
- if hit and instruction is branch predicted taken, can fetch target immediately
- the first time we hit a branch, it will not be in the BTB
- afterwards, the branch target addresses will be recorded into this cache
- for indirect branches, we can save the target address taken, but this may not always be correct because the address is held in a register that could have been updated
- the earliest we can have the effective address calculation done is the ID stage, after IF
-
note that this simple directional branch prediction ignores more complex problems that must be solved in reality
- eg. predicting indirect branches like jumps that go to an address in a register or function returns
-
exceptions and interrupts are more forms of control hazards:
- ie. unexpected events requiring changes in flow of control
- exceptions are said to generally arise within the CPU
- eg. undefined opcode, overflow, system call errors
- interrupts are said to generally arise instead from external components:
- are usually asynchronous
- eg. external I/O controllers
- difficult to handle these without sacrificing performance
-
handling exceptions:
- in MIPS, exceptions are managed by a system control coprocessor
- need to save the PC of the offending or interrupted instruction
- in MIPS AKA the exception program counter (EPC)
- save indication of the problem:
- in MIPS a code is stored in the cause register
- eg. 1 may indicate overflow, 0 may indicate some undefined opcode
- generally more than a single bit
- jump to the handler at a strict handler address
- alternatively, with vectored interrupts, can have different specific handlers whose addresses are determined by the cause by indexing into a handler table
- thus, exceptions are another form of a control hazard
- eg. need to consider overflow in the EX stage and change the PC accordingly
-
handler responsibilities:
- need to read the cause and possibly transfer to a relevant second-level handler
- if restartable, take corrective action and use the EPC to return to the program
- technically, need to adjust EPC since
PC + 4
is saved
- technically, need to adjust EPC since
- otherwise, terminate the program and report error with the EPC
-
ex. Handling overflow in the EX stage from the add instruction
add r1, r2, r1
:- need to prevent
r1
from being clobbered - complete previous instructions
- flush the add and any subsequent instructions still in the pipeline
- set
Cause
and EPC register values - transfer control to handler by setting the PC
- very similar to mispredicted branch, except for writing to extra registers:
- uses much of the same hardware
- also need to stop the current instruction and allow it resume later after the exception
- need to prevent
-
since the pipeline overlaps multiple instructions, could have multiple exceptions at once:
- eg. overflow along with undefined opcode in another pipeline stage
- in the simplest approach, just deal with the exception from the earliest instruction ie. issue precise exceptions
- flush subsequent instructions
- in more complex pipelines, multiple instructions are issued per cycle or out-of-order completion is utilized
- maintaining precise exceptions is very difficult
- can alternative issue imprecise exceptions:
- stop the pipeline completely and save its state, including exception causes
- let the handler work out:
- which instructions had exceptions
- which to complete or flush
- simplifies hardware, but much more complex handler software
- not feasible for complex multiple-issue out-of-order pipelines
- there are several possibilities for extending the pipeline in an attempt to decrease cycle time:
- pipelining more deeply may drop the cycle time, but additional forwarding logic could increase cycle time as well if the critical path becomes slower
- in addition, the fundamental tradeoff between TC and CPI comes into play, since a lower TC results in increased CPI due to data and control hazards
- split a single add into two staggered adds, separated by latches, eg. 32-bit into two 16-bit adders:
- want to decrease TC while deepening the pipeline
- most functionality in the split adder will still work:
- pass carry-out from lower 16 bits to carry-in of upper 16 bits
- need to pass both parts of the immediate to reach both ALU stages
- SLT complicates things, since the MSB is fed backwards back to the
less
passthrough of the LSB- need to use an additional MUX at the second ALU to overwrite the least significant bit in the case of an SLT
- in the case of a load-add dependency, now have to stall two cycles instead of just one
- memory results become delayed with the extra ALU stage
- there is a new dependency hazard in the case of an SLT with a dependent following instruction:
- need additional stall cycle since lower bits of an SLT result are not finalized til the MUX at the second adder
- unlike any other arithmetic instruction that can correctly forward the calculated lower bits
- branch hazard recovery must flush an additional stage
- data must be forwarded from multiple latches to both the 16-bit lower and upper ALUs:
- to lower 16-bit ALU, need 3 additional forwarding paths for the lower 16 bits from E1/E2, E2/MEM, MEM/WB after
RegDst
- to upper 16-bit ALU, need 2 additional forwarding paths for the upper 16 bits from E2/MEM, MEM/WB:
- but there is no way to access the previous instruction that used to be in MEM/WB
- value from MEM/WB after
RegDst
needs to be latched into E1/E2 so that it is accessible even after the original instruction leaves the pipeline - additional 3rd forwarding path to upper ALU
- not an issue anywhere else because those forwarding paths all still exist
- to lower 16-bit ALU, need 3 additional forwarding paths for the lower 16 bits from E1/E2, E2/MEM, MEM/WB after
- split the IM stage into two parts:
- data hazards are not affected since the loop between EX and WB has not been extended
- have to flush extra stage during branch hazards
- split the MEM into two parts:
- load use dependency now requires two cycle delay
- there is also an additional load use dependency on the other MEM component
- additional data forwarding required
- load use dependency now requires two cycle delay
-
in instruction level parallelism (ILP), execute independent instructions at the same time:
- in pipelining, already executing multiple instructions in parallel
- some level of ILP
- to increase the ILP:
- can make the pipeline deeper:
- less work per stage and a shorter clock cycle
- but increased hazards
- multiple-issue:
- replicate pipeline stages or add multiple pipelines
- start multiple instructions per clock cycle
- CPI should ideally become less than 1, so instructions per cycle (IPC) may be a better metric
- in reality, dependencies and hazards reduce this CPI
- can make the pipeline deeper:
- how effective is multiple-issue?
- works, but has a certain limit and starts to give diminishing returns:
- programs will have real dependencies that limit ILP
- certain dependencies are very hard to eliminate eg. pointer aliasing
- some parallelism is hard to expose eg. limited window size during instruction issue
- difficult to keep pipelines full due to memory delays and limited bandwidth
- in general, seeing a turn towards multicore and mulithread CPUs and applications to avoid the diminishing return of optimizing ILP
- however, most machines are still multiple-issue and dynamically scheduled
- works, but has a certain limit and starts to give diminishing returns:
- in pipelining, already executing multiple instructions in parallel
-
for multiple-issue, how do we choose which instructions to issue together?
- in static multiple-issue, compiler groups independent instructions together:
- compiler packages them into issue slots
- compiler detects and avoids hazards
- drawbacks of ISA specificity and non-portability
- compiler may have to use no-ops, which increases IC
- in dynamic multiple-issue, the hardware will examine the instruction stream and chooses instructions to issue in each cycle:
- the more common approach
- CPU has to resolve hazards using complex techniques
- in static multiple-issue, compiler groups independent instructions together:
-
similarly to branch prediction, in speculation, we can guess what to do with an instruction:
- start an operation as soon as possible
- later, check whether guess was correct:
- if so, complete the operation
- otherwise, roll-back and re-do the correct operation
- unlike in branch prediction, we allow the instructions to actually impact the state of the processor
- more complex roll-back operation
- compiler can perform speculation and reorder instructions:
- eg. move load before branch
- would need "fix-up" instructions to recover from incorrect guesses
- hardware speculation:
- look ahead for instructions to execute
- buffer results until hardware determines they area actually needed
- flush buffers on incorrect speculation
- if an exception occurs during a speculatively executed instruction:
- in software, needs ISA support for deferring exceptions
- in hardware, needs to buffer exceptions until instruction completion
-
compiler groups instructions into issue packets ie. a group of instructions that can be issued in a single cycle:
- essentially creates one very long instruction that specifies multiple concurrent operations:
- AKA very long instruction word (VLIW)
- instructions must be completely independent since they travel down the pipeline together and cannot forward
- determined by pipeline resources required
- pros:
- avoids complex hardware design and implementation
- cons:
- very difficult to predict all stalls in software
- eg. may depend on dynamic input data besides just the static program alone
- can't always schedule around branches since branch outcome is dynamically determined
- eg. limitations on reordering distance
- limited portability, requires very ISA and even machine-specific compiler implementations
- very difficult to predict all stalls in software
- essentially creates one very long instruction that specifies multiple concurrent operations:
-
compiler must also remove some or all of the hazards:
- reorder instructions in issue packets
- no dependencies within a packet
- possibly some dependencies between packets
- pad with no-ops if necessary
-
ex. MIPS with static dual-issue:
- allow for two-issue packets:
- 1 ALU / branch instruction and 1 load / store instruction
- 64-bit aligned, back to back instructions
- pad unused instruction with a no-op
- note that the paths are specialized:
- only one instruction can access memory
- the memory instruction will make use of an auxiliary adder to perform its address calculations during EX
- specialization reduces the duplicated resources in the pipeline
- most of the MIPS pipeline is the same in the dual-issue implementation:
- register file functionalities, MUXs, etc. are all extended to support an additional instruction
- heavy cost on extended register file ports
- additional auxiliary adder
- still only one memory component
- note that in the ALU / branch instruction, the ALU output always goes to the register file, while in the load / store the memory output always goes to the register file, so there is some simplification in control complexity
- register file functionalities, MUXs, etc. are all extended to support an additional instruction
- allow for two-issue packets:
-
hazards with dual-issue MIPS:
- more instructions execute in parallel
- forwarding was used to avoid some stalls in single-issue:
- now, cannot use ALU result in load / store in the same packet
- need to split these into two packets, effectively a stall
- more aggressive scheduling required
Ex. Schedule the following for dual-issue MIPS:
Loop: lw t0, 0(s1)
addu t0, t0, s2
sw t0, 0(s1)
addi s1, s1, -4
bne s1, $0, Loop
Cycle | ALU / branch | load / store |
---|---|---|
1 | nop |
lw t0, 0(s1) |
2 | addi s1, s1, -4 |
nop |
3 | addu t0, t0, s2 |
nop |
4 | bne s1, $0, Loop |
sw t0, 4(s1) |
Table: Scheduling Instructions into Packets \label{issuepack} |
-
scheduling for dual-issue MIPS as seen in Table \ref{issuepack}:
- note that we will always have a load-use dependency between the
lw
andaddu
, so they must be separated by a full cycle - note that it is possible with our duplicated pipeline to have the
bne
andsw
in the same cycle - note that the
addi
can be moved anywhere before thebne
, because we can adjust the load and store offsets accordingly:- ie. the offset is known at compiler time because an add with an immediate was performed to
s1
- in our table, the
sw
has an anti-dependency that has to be fixed with the extra offset4(s1)
- ie. the offset is known at compiler time because an add with an immediate was performed to
- the IPC is
$\frac{5}{4} = 1.25$ with a peak IPC of 2
- note that we will always have a load-use dependency between the
-
loop unrolling is one possible advantage that the static ILP approach can benefit from:
- in loop unrolling, the loop body is replicated in order to expose more parallelism:
- reduces loop-control overhead
- allows more instructions to be packed together
- need to use different registers per iteration through register renaming
- determining how much to unroll a loop:
- how many registers are being used so that renaming is possible
- stop after desired parallelism has been extracted
- have an idea of how many times the loop should run, or if it should run a multiple of a number times
- sometimes reordering an unrolled loop is not possible due to memory aliasing eg. in C
- in loop unrolling, the loop body is replicated in order to expose more parallelism:
Ex. Unrolling the previous loop example 3 times:
Loop: lw t0, 0(s1)
addu t0, t0, s2
sw t0, 0(s1)
addi s1, s1, -4
lw t0, 0(s1)
addu t0, t0, s2
sw t0, 0(s1)
addi s1, s1, -4
lw t0, 0(s1)
addu t0, t0, s2
sw t0, 0(s1)
addi s1, s1, -4
lw t0, 0(s1)
addu t0, t0, s2
sw t0, 0(s1)
addi s1, s1, -4
bne s1, $0, Loop ; gives 17 instructions
; however, all the addi's can be condensed to reduce the number of instructions
Loop: lw t0, 0(s1)
addu t0, t0, s2
sw t0, 0(s1)
lw t0, -4(s1)
addu t0, t0, s2
sw t0, -4(s1)
lw t0, -8(s1)
addu t0, t0, s2
sw t0, -8(s1)
lw t0, -12(s1)
addu t0, t0, s2
sw t0, -12(s1)
addi s1, s1, -16
bne s1, $0, Loop ; gives 14 instructions
Ex. Performing register renaming to reduce dependencies:
Loop: lw t0, 0(s1)
addu t0, t0, s2
sw t0, 0(s1)
lw t1, -4(s1)
addu t1, t1, s2
sw t1, -4(s1)
lw t2, -8(s1)
addu t2, t2, s2
sw t2, -8(s1)
lw t3, -12(s1)
addu t3, t3, s2
sw t3, -12(s1)
addi s1, s1, -16
bne s1, $0, Loop
Cycle | ALU / branch | load / store |
---|---|---|
1 | addi s1, s1, -16 |
lw t0, 0(s1) |
2 | nop |
lw t1, 12(s1) |
3 | addu t0, t0, s2 |
lw t2, 8(s1) |
4 | addu t1, t1, s2 |
lw t3, 4(s1) |
5 | addu t2, t2, s2 |
sw t0, 16(s1) |
6 | addu t3, t3, s2 |
sw t1, 12(s1) |
7 | nop |
sw t2, 8(s1) |
8 | bne s1, $0, Loop |
sw t3, 4(s1) |
Table: Scheduling Unrolled Instructions \label{unrolled} |
- in Table \ref{unrolled}:
- by putting the
addi
at the earliest cycle, which forces all thelw
andsw
offsets to be adjusted-
except the very first load, which occurs in the same cycle of the
addi
-
except the very first load, which occurs in the same cycle of the
- note that it would also have been possible to put the
addi
in cycle 7 -
pros:
- the IPC is
$\frac{14}{8} = 1.75$ - closer to ideal IPC of 2
- the IPC is
-
cons:
- more register usage, which can lead to more spills and more loads and stores
- in addition, the static code size is increased, which may require more instruction cache space in memory
- by putting the
Ex. Schedule the following for dual-issue MIPS, using compiler branch delay slots:
Here: lw t0, 0(s0)
add t0, t0, s1
lw t0, 0(t0)
add t0, t0, s2
sw t0, 4(s0)
addi s0, s0, 8
bne s0, s3, Here
Cycle | ALU / branch | load / store |
---|---|---|
1 | nop |
lw t0, 0(s0) |
2 | nop |
nop |
3 | add t0, t0, s1 |
nop |
4 | addi s0, s0, 8 |
lw t0, 0(t0) |
5 | bne s0, s3, Here |
nop |
6 | add t0, t0, s2 |
nop |
7 | nop |
sw t0, -4(s0) |
Table: Scheduling Instructions with Branch Delay \label{issuedel} |
- in Table \ref{issuedel}:
- 2 branch delay slots are used
- performance analysis:
- to run 100 iterations on the loop, 704 total cycles are needed:
- 4 cycles for fill time
- 7 cycles for each iteration in the dual-issue pipeline
- 14 instructions, including no-ops, are issued per iteration
- CPI approaches
$\frac{704}{1400} \approx 0.5$
- CPI approaches
- note IC of program had to increase
- to run 100 iterations on the loop, 704 total cycles are needed:
Ex. Unrolling and renaming registers from the previous loop:
Here: lw t0, 0(s0)
add t0, t0, s1
lw t0, 0(t0)
add t0, t0, s2
sw t0, 4(s0)
lw t1, 8(s0)
add t1, t1, s1
lw t1, 0(t1)
add t1, t1, s2
sw t1, 12(s0)
addi s0, s0, 16
bne s0, s3, Here
Cycle | ALU / branch | load / store |
---|---|---|
1 | nop |
lw t0, 0(s0) |
2 | nop |
lw t1, 8(s0) |
3 | add t0, t0, s1 |
nop |
4 | add t1, t1, s1 |
lw t0, 0(t0) |
5 | addi s0, s0, 16 |
lw t1, 0(t1) |
6 | add t0, t0, s2 |
nop |
7 | bne s0, s3, Here |
sw t0, -12(s0) |
8 | add t1, t1, s2 |
nop |
9 | nop |
sw t1, -4(s0) |
Table: Scheduling Unrolled Instructions with Branch Delay \label{unrolldel} |
- in Table \ref{unrolldel}:
- 2 branch delay slots are again used
- loop unrolling allowed a
bne
andaddi
to be removed ie. consolidated- registers were renamed in order to reduce dependency chains
- note the structural hazard that forces the
bne
to have two delay slots after it-
add, sw
chain has to be offset by one cycle to make space for thebne
-
- performance analysis:
- to run 100 iterations on the loop,
$50 \times 9 + 4 = 454$ total cycles are needed:- still 4 cycles for fill time
- 9 cycles to run two iterations
- 18 instructions are issued per iteration
- CPI approaches
$\frac{454}{1800} \approx 0.25$
- CPI approaches
- to run 100 iterations on the loop,
- dynamic multiple-issue processors are known as superscalar processors:
- CPU decides whether to issue multiple instructions each cycle
- responsible for avoiding structural and data hazards
- avoids the need for compiler scheduling, though it may still be beneficial
- compiler scheduling can solve longer distance optimization problems
- CPU decides whether to issue multiple instructions each cycle
Ex. Dynamically schedule the following:
lw t0, 20(s2)
addu t1, t0, t2
sub s4, s4, s3
slti t5, s4, 2
-
note that the subtraction can be performed alongside the load:
- independent instructions
- thus allow the CPU to execute instructions out of order to avoid stalls
- but still commit results to register in order
-
dynamically scheduled CPU overview:
- a phase where we fetch and decode instructions in-order
- do not yet read from registers yet
- a resource allocation phase:
- renaming logical registers to physical registers through a register allocation table (RAT):
- though the number of logical registers is bounded by the ISA eg. register number bits, the number of physical registers is only bounded by hardware limitations
- modern machines use around 100 physical registers for renaming
- note that since this is an alternative to static instruction reordering in the compiler, the hardware is essentially performing limited liveness analysis based on the stream of writes it is receiving
- overlapping instructions uses the additional register space in order to free up the antidependencies
- allows us to do more in parallel
- set up an in-order structure called the reorder buffer (ROB)
- provides the illusion to the system that instructions are executed in-order
- renaming logical registers to physical registers through a register allocation table (RAT):
- a phase of out-of-order execution:
- a series of reservation stations connected to functional units
- speculative execution that doesn't commit any changes to registers
- an in-order commit phase:
- using the ROB to commit in-order
- should handle any incorrect speculative executions and recover by flushing accordingly
- a phase where we fetch and decode instructions in-order
-
implementing out-of-order execution:
- have several reservation stations for instructions to sit and wait before being used in a functional unit eg. an ALU:
- ie. staging areas for an instruction to wait until its register operands are ready
- when an operation completes, it will notify all the reservation stations that its value is ready
- instructions are done out-of-order, depending on when dependencies are ready
- on instruction issue to reservation station:
- if operand is available in register file or reorder buffer, copy it to the station
- operand register is no longer required, can be overwritten
- if operand is not yet available, it will be provided later to the reservation station by a function unit
- register update may not be required
- if operand is available in register file or reorder buffer, copy it to the station
- have several reservation stations for instructions to sit and wait before being used in a functional unit eg. an ALU:
-
dynamic speculation:
- CPU can perform aggressive speculation due to the new commit phase
- eg. predict branches, reorder load
- avoid committing until the correct outcome is determined ie. the speculation is cleared
- ie. rampant speculation in execution stage is validated in the commit phase
- CPU can perform aggressive speculation due to the new commit phase
\newpage{}
-
the memory hierarchy is typically made up of the following tiers of memory:
-
static RAM (SRAM):
- 0.5ns to 2.5ns delay, between $2000 and $5000 per GB
- cache memory attached to the CPU
- may have multiple cache sizes in the hierarchy, eg. L1, L2, L3 caches
-
dynamic RAM (DRAM):
- 50ns to 70ns delay, between $20 and $75 per GB
- typically stored as charge in a capacitor
- memory reads are destructive, must rewrite after reading
- volatile, loses data when power is lost
- middling latency and capacity
-
magnetic disk:
- 5ms to 20ms delay, up to $2 per GB
- disk is a nonvolatile, rotating magnetic storage
- mechanical latency
- flash storage is a nonvolatile semiconductor storage:
- 100x to 1000x faster than disk and takes lower power, but more expensive per GB
- replacing magnetic disks in some scenarios
- difficult to achieve an ideal memory with access time of SRAM with the capacity and cost efficiency of disk
- no perfect memory solution, so this motivates us to take exploit the memory hierarchy to its fullest
-
static RAM (SRAM):
-
memory developments over time:
- DRAM cost has decreased significantly, but the latency not scaled as dramatically
- phenomenon known as the memory wall
- CPU transistors are scaling faster than memory development is
- bottleneck of getting data to the processor fast enough
- DRAM cost has decreased significantly, but the latency not scaled as dramatically
-
in the principle of locality:
- programs access a small proportion of their address space at any time
- in temporal locality:
- items accessed recently are likely to be accessed again soon
- eg. instructions in a loop, induction variables
- in spatial locality:
- items near those accessed recently are likely to be accessed soon
- eg. sequential instruction access, array data
- thus to take advantage of locality:
- when we access something, we can bring it in a certain granularity up a layer of the hierarchy
- addresses spatial and temporal locality
- want to maximize hits ie. accesses satisfied by upper levels and minimize hits
- note that locality depends on memory access patterns:
- greatly depends on the specific software, eg. algorithm behavior
- in addition to compiler optimizations for memory access
-
caching takes advantage of locality to give the illusion of fast, large memories
- higher levels of the memory hierarchy cache data from lower levels that would be dramatically slower to access
-
types of misses and related cache design tradeoffs:
-
compulsory misses AKA cold start misses occur at the first access to a block
- may be solved with prefetching
-
capacity misses occur due to finite cache size:
- a replaced block is later accessed again
- if a fully associative cache of the same size with the same unique accesses would have missed, it is a conflict miss
- may be solved by increasing cache size
-
conflict misses AKA collision misses occur in a non-fully associative cache:
- due to competition for entries in a set
- if a fully associative cache of the same size with the same unique accesses would have hit, it is a conflict miss
- may be solved by increasing associativity
- cache design considerations:
- increasing the cache size
- decreases capacity misses, but may increase access time and power requirements
- increasing the associativity
- decreases conflict misses, but may increase access time and power even more dramatically
- increasing block size
- decreases compulsory misses, but increases the miss penalty and may also increase miss rate due to pollution
- increasing the cache size
-
compulsory misses AKA cold start misses occur at the first access to a block
-
common principles that apply at all levels of the memory hierarchy:
- block placement:
- determined by associativity
- higher associativity reduces miss rate, but increases complexity, cost, and access time
- finding a block:
- different associativities require a different number of tag comparisons
- reduce comparisons to reduce cost
- note that in virtual memory, we have a fully associative cache that requires zero tag comparisons:
- ie. the page table itself acts as a fully associative cache
- mapping the entire virtual space to a physical address space, so no need for tag checks
- no real miss rate, just a page fault that can occur to bring a block up into main memory
- but requires an additional TLB cache on top of virtual memory, which does use tags
- replacement on a miss:
- eg. LRU, random
- virtual memory typically uses an LRU approximation that requires much hardware support
- write policies:
- in write-through, update both levels
- simpler replacement, but may require write buffer
- in write-back, update upper level only:
- update lower level when block is replaced
- need to keep more state
- with virtual memory, only write-back is feasible given disk write latency
- in write-through, update both levels
- block placement:
-
cache memory is the level of the memory hierarchy closest to the CPU:
- caches are implemented with SRAM
- usually three levels of cache L1, L2, L3
- in increasing size and delay
- how do we know if the data is even present, and then where do we look?
- cache is a subset of memory
- on cache hits, the CPU can proceed normally
- however, on a cache miss, the CPU pipeline has to be stalled while the block is fetched from the hierarchy
- misses also occur in instruction cache accesses
- note that advanced, out-of-order CPUs can execute instructions during cache miss:
- much harder to analyze, since effect of miss depends on the program data store
- typically use system simulation
-
multilevel cache considerations:
- in the primary cache, want to achieve the minimal hit time
- for lower caches such as L2:
- focus on low miss rate to avoid main memory access
- hit time has less overall impact since cache capacity is larger than primary cache
- thus L1 cache is usually smaller than other caches
- L1 block size is also typically smaller than the L2 block size
-
block size considerations:
- the block size determines the size of the byte offset field
- eg. for a 32 byte block, need
$log_2 32 = 5$ bits for offsets
- eg. for a 32 byte block, need
- with a larger block size:
- can exploit more spatial locality by bringing in more data on a miss
- however, with a fixed-sized cache, larger blocks means fewer total blocks in the cache at a time:
- leads to more competition and potentially an increased miss rate
- ie. causes pollution of the cache
- larger miss penalty when hardware has to pull in more data into memory
- critical-word-first ie. moving the requested memory words first can help mitigate this
- the block size determines the size of the byte offset field
-
example address subdivision of a 32-bit address:
- lower 2 bits are the byte offset:
- indicates which particular offset of the block we are referring to in memory we want to read out from
- not used by the cache access, since the cache holds data in a block granularity
- used later when actually reading the block
- next 10 bits are the index:
- number of bits used in the index is based on the size of the cache
- need 10 bits to index into a cache with
$2^{10} = 1024$ elements
- last 20 bits are the tag:
- uses the remaining bits
- compares against the cache tag to see if desired block is already in cache
- also need to AND with the valid bit to check cache hits
- lower 2 bits are the byte offset:
-
how do we handle stores to data held in cache?
- ie. handling write hits
- the block is held in both cache and somewhere else in the memory hierarchy
- if we only update the block in cache, cache and memory become inconsistent
- we can perform write-through ie. update memory as well:
- writes take much longer, so the CPI tanks
- but no delay on evictions
- alternatively, we can use a write buffer AKA write-back:
- hold data waiting to be written to memory:
- set a dirty bit that indicates whether the block has been updated
- ie. accumulate modifications over time
- actually write the block when it is evicted from memory
- CPU continues immediately, and only stalls on write if the write buffer is already full
- hold data waiting to be written to memory:
- different layers of the hierarchy may use different write policies
- typically higher levels like the L1 use write-through, while lower levels like the L2 use write-back
-
what happens on a write miss?
- ie. writing to something that is not in cache
- may want to bring data up into the cache, since there is not necessarily locality to exploit between loads and stores
- may unnecessarily pollute the cache
- in write-allocate:
- allocate a block in the cache on miss, bring in the data, and modify the data
- data modification will follow the chosen write hit policy for the cache ie. write-through vs. write-back
- useful when there is high storage locality between loads and stores
- allocate a block in the cache on miss, bring in the data, and modify the data
- in write-around:
- don't bring the data into the cache and instead just write to the next level of the memory hierarchy
- does not pollute the cache
- typically, write-through needs to use both write-allocate and write-around depending if the data is in cache
- while write-back will use write-allocate since it needs to mark the dirty bit in the cache
-
ex. Measuring cache performance:
- given the following:
- instruction cache miss rate of 2%
- data cache miss rate of 4%
- miss penalty of 100 cycles
- base CPI of 2
- loads and stores make up 36% of instructions
- we can can find the miss cycles per instruction:
$$\textit{memory stall cycles} = \frac{\textit{instruction}}{\textit{program}} \times \frac{\textit{misses}}{\textit{instruction}}\times \textit{miss penalty}$$ - for the instruction cache,
$0.02 \times 100 = 2$ - for the data cache,
$0.36 \times 0.04 \times 100 = 1.44$
- for the instruction cache,
- actual CPI is
$2 + 2 + 1.44 = 5.44$ - ideal CPU is
$\frac{5.44}{2} = 2.72$ times faster
- ideal CPU is
- we can see that as CPU performance increases, the miss penalty becomes much more significant
- needs to be evaluated when determining system performance
- given the following:
-
ex. Measuring the average memory access time (AMAT):
- hit time is another important metric
- to calculate AMAT:
$$\textit{AMAT} = \textit{hit time} + \textit{miss rate} \times \textit{miss penalty}$$ - given the following:
- CPU with 1ns clock
- hit time of 1 cycle
- miss penalty of 20 cycles
- instruction cache miss rate of 5%
-
$\textit{AMAT} = 1 + 0.05 \times 20 = 2$ nanoseconds- an average access time of 2 cycles per instruction
-
ex. Measuring AMAT for multilevel caches:
- given the following:
- block size of 32 bytes
- L1 cache with 20% miss rate, latency of 2 cycles
- L2 cache with 10% miss rate, latency of 12 cycles
- L3 cache with 5% miss rate, latency of 30 cycles
- memory with latency of 100 cycles
-
$\textit{AMAT} = 2 + 0.2 \times (12 + 0.1 \times (30 + 0.05 \times 100)) = 5.1$ cycles
- given the following:
-
ex. Measuring AMAT for non-unified caches:
- given the following contraints:
- L1 instruction cache with 10% miss rate, latency of 1 cycle
- L1 data cache with 15% miss rate, latency of 1 cycle
- L2 cache with 20% miss rate, latency of 15 cycles
- memory with latency of 150 cycles
- 20% of instructions are loads and stores
-
$\textit{AMAT} = (1 + 0.1 \times (15 + 0.2 \times 100) + 0.2 \times (1 + 0.15 \times (15 + 0.2 \times 100))) \times \frac{1}{1.2}$ cycles- weighted average calculation
- given the following contraints:
-
ex. Measuring the impact of multilevel caches:
- given the following:
- CPU with base CPI of 1 and a clock rate of 4GHz
- miss rate per instruction of 2%
- main memory access time of 100ns
- with just a primary L1 cache: \begin{align*} \textit{miss penalty} &= \frac{100 ns}{0.25 ns} = 400 \textit{ cycles}\ CPI &= 1 + 0.02 \times 400 = 9 \end{align*}
- with an additional L2 cache with 5ns access time and 10% miss rate:
\begin{align*}
\textit{miss penalty} &= \frac{5 ns}{0.25 ns} = 20 \textit{ cycles}\
CPI &= 1 + 0.02 \times (20 + 0.01 \times 400) = 2.2
\end{align*}
- performance increase of
$\frac{9}{2.2} = 4.1$ times
- performance increase of
- given the following:
-
ex. Full CPI calculation:
- pipeline and instruction mix details:
- 5 stage pipeline with full forwarding
- branches resolved in the EX stage, branch predictor predicts not-taken
- 50% R-type, 20% loads, 10% stores, 20% branches
- dependencies follow a load for 20% of loads
- 25% of branches are taken
- store misses do not stall the pipeline
- memory details:
- unified L1 cache for both instruction and data with 1 cycle delay
- L1-I misses 10%, L1-D misses 5%
- L2 misses 5%, has 20 cycle delay
- memory has 100 cycle delay
- need to calculate the base CPI from the pipeline and the CPI incurred by the memory hierarchy to find the total CPI:
\begin{align*}
\textit{BCPI} &= \textit{CPI}{\textit{peak}} + T\textit{load-use hazard penalty} + T_\textit{branch hazard penalty}\
&= 1 + 0.2 \times 0.2 \times 1 + 0.2 \times 0.25 \times 2 \
&= 1.14 \\
\textit{MCPI} &= \textit{MCPI}{\textit{inst}} + \textit{MCPI}{\textit{data}} \
&= 1 \times 0.1 \times (20 + 0.05 \times 100) + 0.2 \times 0.05 \times (20 + 0.05 \times 100) \
&= 2.75 \\
\textit{CPI} &= \textit{BCPI} + \textit{MCPI} = 3.89
\end{align*}
- note that the 1 cycle delay of the L1 cache defines the pipeline stages used
- if the L1 instruction cache required more cycles, IF would have to be pipelined across multiple stages
- in addition, the percentage of loads in the instruction mix is necessary to factor into the MCPI
- importantly, the AMAT calculation is not the same as the MCPI calculation
- MCPI ignores the time taken on a hit
- note that the 1 cycle delay of the L1 cache defines the pipeline stages used
- pipeline and instruction mix details:
-
ex. Full CPI calculation where the L1 cache now has a 2 cycle delay:
- the IF stage is now broken into two stages
- increased size of the cache and decreased miss rate from 10% to 5%:
- there are now 3 stages to flush for incorrect branch predictions
- load-use hazards are not affected
- given the same statistics from previous example: \begin{align*} \textit{BCPI} &= \textit{CPI}{\textit{peak}} + T\textit{load-use hazard penalty} + T_\textit{branch hazard penalty}\ &= 1 + 0.2 \times 0.2 \times 1 + 0.2 \times 0.25 \times 3 \ &= 1.19 \\ \textit{MCPI} &= \textit{MCPI}{\textit{inst}} + \textit{MCPI}{\textit{data}} \ &= 1 \times 0.05 \times (20 + 0.05 \times 100) + 0.2 \times 0.05 \times (20 + 0.05 \times 100) \ &= 1.5 \\ \textit{CPI} &= \textit{BCPI} + \textit{MCPI} = 2.69 \end{align*}
-
ex. Full CPI calculation where the MEM stage is broken into two stages:
- reduces miss rate for L1 data cache from 5% to 2%
- the load-use hazard now has a two cycle flush penalty:
- introduces a new load-independent-use hazard with a single cycle penalty
- need to know this new hazard percentage eg. 10%
- given the same statistics from earlier example: \begin{align*} \textit{BCPI} &= \textit{CPI}{\textit{peak}} + T\textit{load-use hazards penalty} + T_\textit{branch hazard penalty}\ &= 1 + 0.2 \times 0.1 \times 1 + 0.2 \times 0.2 \times 2 + 0.2 \times 0.25 \times 2 \ &= 1.2 \\ \textit{MCPI} &= \textit{MCPI}{\textit{inst}} + \textit{MCPI}{\textit{data}} \ &= 1 \times 0.1 \times (20 + 0.05 \times 100) + 0.2 \times 0.02 \times (20 + 0.05 \times 100) \ &= 2.6 \\ \textit{CPI} &= \textit{BCPI} + \textit{MCPI} = 3.8 \end{align*}
-
in a direct mapped cache:
- has the lowest overhead to implement
- the location is determined by the address:
- for a particular address, it is direct mapped to only one choice
-
block_address % blocks_in_cache
ie. similar to a hash table- here, multiple memory addresses may map to the same cache block
- how to know which particular block is stored in a cache location?
- store part of the block address as well as the data in the tag
- the valid bit determines if there is no data in a location, initially 0
- on a cache miss, we bring in the data from the memory hierarchy and update the metadata in the cache
- on a cache hit after checking the valid bit and tag, we can take the data directly from the cache
- the key limitation of a direct mapped cache:
- we may have a cache miss that evicts ie. overwrites a mapping in the cache because it shares the same mapping
- even though there may still be unused mapping entries in the cache
-
associativity determines how many entries in the cache can be used for a particular block in memory:
- in a direct map, there is only 1 location in cache for a memory cache
- less flexibility, but only have to check a single location
- in a fully associative cache:
- a given block can go in any cache entry
- total flexibility, but requires all entries to be searched at once
- requires an expensive comparator per entry as well
- typically only used with very specific, small caches
- thus the index field is not needed, since every entry will be checked
- in an
$n$ -way set associative cache:- the cache is organized into sets:
- each set contains
$n$ entries - a particular address can go into any one of these entries
- each set contains
- fully associative and direct maps lie at either end of the
$n$ -way set associativity - similar to a hash table with a fixed number of buckets
- search all entries in a given set at once
- only need
$n$ comparators
- only need
- thus the index maps to a particular set rather than a cache
- the cache is organized into sets:
- increased associativity decreases miss rate
- but with diminishing returns
- in a direct map, there is only 1 location in cache for a memory cache
-
associativity also determines the replacement policy used:
- the replacement policy chooses which entry to evict from the cache
- in a direct map, there is no choice
- must evict out the current cache mapping
- in set associative caches, there are at least two choices for what to evict:
- first and foremost, prefer non-valid entries, if there is one
- in least-recently used (LRU):
- choose the entry unused for the longest time
- simple for 2-way, manageable for 4-way, much more difficult beyond that
- can use the clock algorithm as a more efficient heuristic
- in a random replacement policy
- approximates the same performance as LRU for high associativity
- FIFO, LIFO, etc.
-
main memory can act as a cache for secondary ie. disk storage:
- managed jointly by CPU hardware and the OS
- programs are able to share main memory through a technique called virtualization:
- each has a private virtual address space holding its frequently used code and data
- an additional level of indirection
- protected from other programs
- each has a private virtual address space holding its frequently used code and data
- CPU and OS is responsible for translating virtual addresses to physical addresses:
- a virtual memory (VM) block is called a page
- distinct from the smaller block granularity
- a VM translation miss is called a page fault
- a virtual memory (VM) block is called a page
- in caching, hardware moved blocks between L1, L2 caches and memory
- in VM translation, the OS explicitly moves pages between memory and disk
-
responsibilities of VM:
- protection between programs
- possible sharing between programs
- independent compilation
- hardware independence
-
through address translation the virtual addresses assigned to a program by the linker are translated into physical addresses during program execution:
- addresses are made up of a virtual page number and a virtual page offset
- need to convert the virtual page number into a physical page number:
- page offset remains the same after address translation
- may map between different memory sizes eg. from a 32-bit virtual address to a 30-bit physical address
-
on a page fault, the page must be fetched from disk:
- takes millions of clock cycles
- handled by the OS vs. caches misses are handled by hardware
- minimizing the page fault rate is done in software:
- fully associative placements
- smart replacement algorithms
- on the other hand, hardware is responsible for making page hits faster
-
the page table (PT) stores placement information for virtual memory:
- is an array of page table entries (PTE), indexed by virtual page number
- is a memory-resident structure that is pointed to by the page table base register (PTBR) in the CPU
- the PTBR is a process-specific pointer that is different for each unique running process
- if the page is present in memory:
- PTE stores the physical page number
- alongside other status bits eg. valid, present, referenced, dirty, etc.
- to access the page table and find a specific entry (when pages are not in the TLB), a hardware structure called a page table walker is used:
- scans PT based on PTBR
- walker may find PTE in cache, or go all the way to memory
- some systems do not cache PTEs outside of the TLB, but caching PTEs in the L1 or L2 cache will cache the page table based on a block granularity, which can benefit from locality
- rather than the TLB caching on a translation level granularity
- if the page is not present, a page fault occurs
- PTE refers to location in swap space on disk
- to reduce the page fault rate, LRU replacement is typically used:
- the referenced bit AKA use bit is set to 1 on page access
- a page with a zero reference bit has not been used recently
- however, the page table itself may not always fit in memory:
- would have to page the page table
- ie. incur multiple loads for a single load operation
-
disk writes take millions of cycles:
- write-through is impractical, and write-back is usually used instead
- amortize writes by collecting them in a write buffer
-
how to we make translation fast?
- address translation appears to require extra memory references, since the PT is held in memory
- one to access the PTE, and then another to access the actual memory access
- but page table accesses have good locality:
- due to the fact that pages are huge
- use a fast cache of PTEs within the CPU in a translation lookaside buffer (TLB)
- TLBs will be typically smaller than regular caches, since they only have to store translations
- typical systems use multiple levels of TLBs
- TLB is process-specific, and will need to be flushed on context switches
- can sometimes be shared to keep the cache hot without clearing it
- TLB is like any other hardware cache, has certain associativity and size
- use index and tag to find entries
- typical TLB statistics:
- 16-512 PTEs
- 0.5-1 cycle on a hit
- 10-100 cycles on misses
- 0.01-1% miss rate
- handling TLB misses:
- if page is in memory, the page table walker needs to load the PTE from memory and retry
- a page fault will occur if page is not in memory:
- OS handles fetching the page and updating the page table
- then, restart the faulting instruction
- note that TLB accesses, page table accesses, and address translations are handled completely by hardware except for page faults
- address translation appears to require extra memory references, since the PT is held in memory
-
TLB miss handler:
- TLB miss needs to indicate to the pipeline that the PTE will come from memory
- ie. need to recognize a TLB miss before the destination register is overwritten
- stop the request coming from the TLB cache from actually writing in the case of a load, etc.
- handler then copies PTE from memory to TLB, then restarts instruction
- if page is not present, a page fault would occur
-
there are several ways to use tags with virtual and physical addresses in the TLB and regular caches:
- regular caches cache physical memory values
- regular cache tag uses a physical address ie. physically indexed physically tagged (PIPT):
- need to translate before regular cache lookup occurs
- pay latency of TLB first
- regular cache tag uses a virtual address ie. virtually indexed virtually tagged (VIVT):
- can directly lookup regular cache with virtual address and access the TLB in parallel
- need physical address in case cache misses and to see protection bits from PTE etc.
- complications due to aliasing where there are different virtual addresses for shared physical addresses
- need to augment regular cache with a process specific ID
- can directly lookup regular cache with virtual address and access the TLB in parallel
- in a hybrid approach, regular cache uses a virtual address for the index, but the physical address after translation for the tag:
- ie. virtually index physically tagged (VIPT)
- works if the regular cache index and offset field widths are subsumed within the page offset field
- page offset remains consistent across translation, so regular cache index and offset will not change between the physical and virtual address
- allows us to immediately access the cache while accessing the TLB in parallel
- essentially PIPT, but performing accesses in parallel
- when TLB completes the translation, we can compare the tags in the cache
- tag does change after translation
-
ex. Control flow on a load instruction using VIPT address translation:
- can immediately index into the L1 cache from the virtual address
- in parallel, check the TLB to get the physical address tag to compare with the L1 cache tag:
- on a TLB hit, we succesfully get the tag, and can check the L1 cache tag
- the L1 cache can hit or miss
- on a hit we are done, while a miss requires going down the memory hierarchy
- the L1 cache can hit or miss
- on a TLB miss, the PTE is not in cache, and we need to go to the page table
- on a TLB hit, we succesfully get the tag, and can check the L1 cache tag
- the page table walker will check the L1 cache for the desired page table entry:
- on a hit, we get the PTE, add it to the TLB, and get the tag to compare with the L1 cache tag
- on a miss, we need to go down the memory hierarchy, to memory
- if not in L1, the walker will check memory for the desired page table entry:
- on a hit, we get the PTE, add it to the L1 cache and then to the TLB, and get the tag to compare with the L1 cache tag
- on a miss, we have a page fault, and the OS has to handle
-
ex. Examining address field views from paging vs. TLB vs. cache perspectives:
- given the following:
-
$2^{64}$ -byte virtual address space,$2^{34}$ -byte physical address space,$2^{16}$ -byte pages - 8-way associative VIPT TLB with 64 entries
- 4-way L1 cache has
$2^{12}$ bytes of space and a block size of$2^5$ bytes
-
- we can infer:
- virtual memory thus has
$\frac{2^{64}}{2^{16}} = 2^{48}$ pages, and physical memory has$\frac{2^{34}}{2^{16}} = 2^{18}$ pages - the TLB has
$\frac{64}{8} = 8$ indices - the L1 caches has
$\frac{2^{12}}{2^5 \times 2^2} = 2^5$ indices
- virtual memory thus has
- addresses from paging perspective:
- virtual addresses have 16 bits to specify the page offset and 48 bits for a virtual page number
- physical addresses also have 16 bits to specify the page offset and 18 bits for a physical page number
- each page table entry will need to hold at least 18 bits for the PPN, along with 6 bits for protection information for 24 bits or 3 bytes total:
- PT overall has
$2^{48}$ entries and size$2^{48} \times 3$ - very large, would have to be paged
- PT overall has
- addresses from the TLB perspective:
- virtual addresses still have the lower 16 bits for the page offset
- next 3 bits are the index, and last 45 bits are the tag
- on TLB hit, TLB delivers the 18-bit PPN and the propagates the 16-bit offset to form the physical address
- virtual addresses still have the lower 16 bits for the page offset
- addresses from the L1 cache perspective:
- virtual addresses have the lower 5 bits for the block offset
- next 5 bits are the index, remaining
$48 + (16 - 10) = 54$ bits are left unused since the cache is VIPT
- next 5 bits are the index, remaining
- physical addresses have the lower 5 bits for the block offset
- next 5 bits are the index, remaining
$18 + (16 - 10) = 24$ bits are used as the tag
- next 5 bits are the index, remaining
- the cache is accessed with either 5-bit index from physical or virtual address (the same value, since the page offset is the same) together with the 24-bit tag from the physical address
- virtual addresses have the lower 5 bits for the block offset
- given the following:
-
in multiprocessor or multicore shared caches:
- data is migrated to many local caches in a multiprocessor design
- reduces bandwidth for shared memory
- read-shared data is replicated
- reduces contention for access
- these patterns cause cache coherency issues
- data is migrated to many local caches in a multiprocessor design
-
cache coherence deals with the multiple CPU case where multiple cores share a physical address space:
- each has a private cache, while sharing a lower level cache and/or memory
- note that the private caches may have different architectures, but block size will usually be the same
- the changes made by one CPU must be seen by the other, ie. trickle up levels of the memory hierarchy to cores and their private caches:
- ie. the caches must be kept consistent with one another
- eg. if the private cache is write back, it will not flush its changes on write to the lower level of the shared hierarchy
- eg. if the private cache is write through, the shared hierarchy will be updated, but if there is a copy of the value in the cache of the other core, it will not reflect the updated value
- in contrast to cache consistency which requires trickling down levels of the memory hierarchy
- thus in coherence, we want reads to always return the most recently written value:
- if
$P$ writes$X$ and then$P$ reads$X$ with no intervening writes, the read should return the written value - if
$P_1$ writes$X$ and then$P_2$ reads$X$ sufficiently later, the read should return the written value - if
$P_1$ writes$X$ and then$P_2$ writes$X$ all processors see writes in the same order- end up with the same final value for
$X$
- end up with the same final value for
- if
- each has a private cache, while sharing a lower level cache and/or memory
-
cache coherence protocols include snooping, snarping, or directory-based
-
in a snooping protocol, each cache monitors reads and writes on a snoop bus, thus placing the burden on the caches:
- caches get exclusive access to a block when it is written
- each core cache monitors the snoop bus:
- bus listening does not increase latency, we can add additonal ports into the cache to perform writes and modifications depending on inputs from the snoop bus
- has additional power and area requirements for the cache, but no increased latency
- the cache broadcasts an invalidate message for the specific written block and its block address on the snoop bus
- other caches detect the snoop bus message, and set the valid bits of that block to 0
- subsequent read in another cache misses:
- owning cache supplies updated value
- typically the private caches will be write through with snooping so that we end up writing to a shared cache
- on cache miss, can take value from a consistent shared cache level
- to prevent issues with simultaneous writes, an arbiter in the hardware is used to ensure only one core writes at a time
-
pros:
- best suited for a few number of cores and frequent sharing between cores
-
cons:
- major issues with scalability
- with very many cores, may need multiple snoop buses
-
in another technique called the snarping protocol, invalidate and update the overwritten block
- same scaling issues as snooping protocol
-
in a directory-based protocols, caches and memory record sharing status of blocks in a directory:
- the directory maps cache blocks to a state
- eg. each cache block has an MSI state of modified, shared, or invalid
- handshaking protocol occurs between the directory and its participating core:
- core can send messages like
GetM, GetR
to the directory to request to modify or read a certain address- directory must modify the corresponding cache block state accordingly, based on a state diagram
- directory can send messages like
WB, WBI, INV, ACK
- ie. write back, write back invalidate, invalidate, acknowledgement
- core can send messages like
- key difference with snooping is that in snooping, each core is responsible for their own cache to maintain their coherency:
- with directories, we have an intermediate service that has the responsibility of maintaining all cores' coherency
- the directory approach adds additional latency for shared variables compared to snooping, but makes the unshared cases faster
- the directory maps cache blocks to a state
-
ex. Cache consistency and coherence example:
- given the following:
- two L1 WT, WAR caches for two different cores
- initially both hold address
4096
and a value of42
- initially both hold address
- a L2 WB, WAL cache shared between the cores
- snooping done on the write to L2
- two L1 WT, WAR caches for two different cores
- after running
sw t0, 4(s0)
on core 0, ift0 = 30
ands0 = 4092
:- first L1 sets value in cache to
30
- L1 is write through, so L2 cache is also updated:
- L2 is write back, so we update the corresponding dirty bit
- memory is not updated
- writing to L2 is done over the snoop bus
- on the snoop bus, we send an invalidate method for address
4096
- the second L1 sets the valid bit to 0 for
4096
- the second L1 sets the valid bit to 0 for
- first L1 sets value in cache to
- after running
sw t0, 8(s1)
on core 1, ift0 = 60
ands0 = 4088
:- L1 is write around, so we write into the L2
- writing to L2 is done over the snoop bus
- on the snoop bus, we send an invalidate method again for address
4096
- the first L1 sets the valid bit to 0 for
4096
- the first L1 sets the valid bit to 0 for
- now, neither core has the correct address value in cache
- only updated in the L2
- L1 is write around, so we write into the L2
- given the following:
\newpage{}
-
the goal is to connect multiple computers to get higher performance:
- go beyond the limits of parallelism in ILP:
- instead use process-level parallelism
- high throughput for independent jobs as well as for single programs running on multiple processors
-
pros:
- scalability
- availability
- power efficiency
- reliability
- go beyond the limits of parallelism in ILP:
-
in a multiprocessor system, there are separate packaged chips that can communicate via a bus through their pinouts, etc.
- in a multicore system, there is a tighter integration onto the same chip so that multiple cores share parts of the same cache hierarchy
- ie. closer communication with other cores through their caches or through a shared cache
- in a multicore system, there is a tighter integration onto the same chip so that multiple cores share parts of the same cache hierarchy
-
with multiprocessor design comes parallel programming:
- pushes the responsibility of extracting parallelism to the programmer
- difficulties:
- partitioning
- coordination
- communications overhead
- note that the sequential portion of the program can limit the speedup, as seen in the following example using Amdahl's law
- ex. With 100 processors, can we achieve 90x speedup?
\begin{align*}
T_{\textit{new}} &= \frac{T_{\textit{parallel}}}{100}+T_{\textit{sequential}} \
90 &= \frac{1}{1-F_{\textit{parallel}}+\frac{F_{\textit{parallel}}}{100}} \
F_{\textit{parallel}} &= 0.999
\end{align*}
- only if the sequential part of the program takes up 0.1% of the original time
-
ex. Calculating scaling speedup given the following workload, assuming the processors are load balanced:
- sum of 10 scalars (sequential) and a
$10 \times 10$ matrix sum (parallelizable):- with a single processor,
$T = (10 + 100) \times t_{\textit{add}} = 110 t_{\textit{add}}$ - with 10 processors:
$T = 10 \times t_{\textit{add}} + \frac{100}{10}\times t_{\textit{add}} = 20 t_{\textit{add}}$ - this gives a speedup of
$\frac{110}{20} = 5.5$ , or 55% of potential 10x improvement
- with 100 processors:
$T = 10 \times t_{\textit{add}} + \frac{100}{100}\times t_{\textit{add}} = 11 t_{\textit{add}}$ - this gives a speedup of
$\frac{110}{11} = 10$ , or 10% of potential 100x improvement
- with a single processor,
- alternatively, with a
$100 \times 100$ matrix sum instead:- with a single processor,
$T = (10 + 10000) \times t_{\textit{add}} = 10010 t_{\textit{add}}$ - with 10 processors:
$T = 10 \times t_{\textit{add}} + \frac{10000}{10}\times t_{\textit{add}} = 1010 t_{\textit{add}}$ - this gives a speedup of
$\frac{10010}{1010} = 9.9$ , or 99% of potential 10x improvement
- with 100 processors:
$T = 10 \times t_{\textit{add}} + \frac{10000}{100}\times t_{\textit{add}} = 110 t_{\textit{add}}$ - this gives a speedup of
$\frac{10010}{110} = 91$ , or 91% of potential 100x improvement
- with a single processor,
- in strong scaling, the problem size is fixed across different numbers of processors
- in weak scaling, the problem size is proportional to number of processors
- sum of 10 scalars (sequential) and a
- in a shared memory multiprocessor:
- hardware provides a single physical address space for all processors
- ie. all processors are connected to memory via an interconnection network
- shared variables are synchronized using locks
- memory access time may be uniform (UMA) or nonuniform (NUMA)
- with nonuniform access time, load balancing becomes more difficult
- hardware provides a single physical address space for all processors
Ex. Summing 100000 numbers on 100 processors with UMA:
// partitioning step:
sum[Pn] = 0; // each processor has ID: 0 <= Pn <= 99
for (i = 1000*Pn; i < 1000*(Pn+1); i++)
sum[Pn] += A[i];
// reduction step:
half = 100;
while (half != 1) {
synch();
if (half%2 != 0 && Pn == 0)
sum[0] += sum[half-1]; // conditional sum when half is odd
half /= 2;
if (Pn < half)
sum[Pn] += sum[Pn+half];
}
-
to perform a parallel sum:
- partition numbers per processor and perform an initial summation
- then, in the reduction step, divide and conquer to add together the partial sums
- need to synchronize between reduction steps
-
in a message passing approach:
- each processor has private physical address space
- hardware sends and receives messages between processors
- pros:
- more scaling of memory capacity
- no possible monolithic memory bottleneck
- more scaling of memory capacity
- cons:
- no more shared variables
- to perform the same parallel summing of many numbers:
- have to firstly distribute the partitions of the numbers to each computer
- then after summing, have to send messages while some machines receive
- eg. half send while half receive and add, etc.
-
an example of message passing are loosely coupled clusters:
- network of independent computers, each with private memory and OS
- connected with I/O system eg. internet
- suitable for applications with independent tasks eg. web servers, databases, etc.
- pros:
- high availability
- scalable
- affordable
- cons:
- administration cost
- low interconnect bandwidth
- network of independent computers, each with private memory and OS
-
instead of running separate individual threads in as in multiprocessing, in multithreading we increase the number of threads that can run on a single processor:
- requires replicating program state such as registers, PC, etc.
- as well as fast switching between threads
- note that this is different from the context switches done between time slices of traditional OS concurrency:
- in OS, we are selecting between multiple threads for execution
- in multithreading, utilizing a faster switch among threads currently in the processor to avoid expensive context switch overhead
- ie. more than one thread executing at once on the same processor
- may decrease cache performance with increased resource utilization, but may also increase overall processor utilization
- requires replicating program state such as registers, PC, etc.
-
in fine-grain multithreading:
- switch threads after each cycle
- interleave instruction execution between multiple threads
- if one thread stalls, others are executed to exploit thread-level parallelism (TLP)
-
in course-grain multithreading:
- only switch on long stalls eg. L2 cache miss
- simplifies hardware, but does not hide short stalls eg. data hazards
- eg. to avoid out-of-order execution, branch prediction, prefetching, etc. just switch threads
-
in simultaneous multithreading AKA hyperthreading:
- use a multiple-issue dynamically scheduled processor
- schedule instructions from multiple threads in the same cycle
- even more fine-grained
- instructions from independent threads execute when function units are available
- within threads, dependencies are handled by scheduling and register renaming
- may have issues where one of the threads can slow down considerably compared to the others
-
we can also classify based on the number of instruction streams and data streams:
- a single instruction single data (SISD) machine is the sequential, default model
- one set of instruction and one stream of data
- a multi instruction multiple data (MIMD) machine is a multiprocessor machine
- can run completely different programs or cooperating threads for a single program (SMPD)
- in a single instruction multiple data (SIMD) machine, can perform the same instructions on multiple pieces of data
- a single instruction single data (SISD) machine is the sequential, default model
-
SIMD operates elementwise on vectors of data:
- all processors execute the same instruction at the same time
- simplifies synchronization
- reduced instruction control hardware
- works best for highly data-parallel applications
-
vector processors are an example of SIMD that have highly pipelined functional units:
- stream data from or to vector registers to units
- simplifies data-parallel programming
- significantly reduces instruction-fetch bandwidth
- absence of loop-carried dependencies
- eg. in the vector extension for MIPS:
- vectors are 32 64-bit registers
lv, sv
load and store vectorsaddv.d
add vectors of doubleaddvs.d
add scalar to each element of vector of double
-
graphics processing units (GPUs) are specialized processors oriented to 3D graphics tasks
- 3D graphics processing was originally only used in high-end computers
- along with Moore's law, transistor density increased greatly, and 3D graphics processing became the norm
- eg. vertex and pixel processing, shading, texture mapping, etc.
- as a trend, CPU tends to be used more for sequential code, while GPU is used for parallel code
- programming languages and APIs include DirectX, OpenGL, CUDA, etc.
- 3D graphics processing was originally only used in high-end computers
-
GPU architectures:
- processing is highly data-parallel
- GPUs are thus highly multithreaded
- use thread switching to hide memory latency
- graphics memory is wide and high-bandwidth, amenable to SIMD vectorization
- generally made up of multiple streaming multiprocessors (SM) each composed of streaming processors (SP)
-
streaming processors:
- each is fine-grained multithreaded
- a warp is Nvidia's group of 32 threads executed in parallel, SIMD style
- eg. used in their Tesla GPUs
- however GPUs do not fit exactly into a SIMD or MIMD model:
- GPUs allow for conditional execution in a thread ie. not every thread in a warp affects register state
- thus, similar to how a superscalar processor is a dynamic implementation of ILP when compared to static VLIW
- the GPU multiprocessor is a dynamic implementation of data-level parallelism when compared to static SIMD or vector processors