CS520: Instruction‐level Parallelism
Homework 3 – Spring 2020
RULES
1. You are to work individually on the homework. Cooperation is not allowed.
2. Points are assigned to each question to reflect the amount of work involved or its level of difficulty.
Your homework score will be calculated as: score = ଵ௫௨ ௧௦ ൈ .
HOMEWORK PROBLEMS
1. (20 Points) For the following instructions:
PC1: load F3, 0 (R0) // Result: F3 = 100
PC2: load F1, 45 (R1) // Result: F1 = 200
PC3: subd F2, F1, F3 // F2 = F1 – F3
PC4: adddi F3, F4, #1 // F3 = F4 + 1
PC5: multd F3, F4, F5 // F3 = F4 x F5
Use the diagram for your final answer. Show the state of the pipeline (Register File, Reservation
Stations, and Reorder Buffer including head and tail pointers) given that all of the following conditions
hold:
Instructions prior to the ones shown above have retired.
all instructions have been dispatched, and
the ADDD instruction has issued to a FU but has not completed (it is executing in a FU), and
the first load instruction has completed and retired, and
the second load instruction has completed but has not retired.
Note that R* are integer registers while F* are floating‐point registers. Show the state of the pipeline
by filling in the contents of the Register File, Reservation Stations, and Reorder Buffer as needed. Don’t
forget to complete the arrows for the ROB head and tail pointers. You can ignore the value fields except
when it contains tags. Use the following tables as a template for your answer.
Initial Register Map/Alias Table and Register File:
In RF? Tag Value
R0 yes 0x00F0A000
R1 yes 0x00F0B000
F0 yes 10
F1 yes 20
F2 yes 30
F3 yes 40
F4 yes 50
F5 yes 60
Register Map/Alias Table and Register File:
In RF? Tag Value
R0
R1
F0
F1
F2
F3
F4
F5
Reservation Stations for floating‐point units:
Tag src1 ready src1 tag/value src2 ready src2 tag/value
Reorder Buffer:
Entry Dest Result Exception Completed PC
ROB1 head/tail
ROB2
ROB3
ROB4
ROB5
ROB6
ROB7
ROB8
2. (20 Points) Branch Prediction in Superscalar Architecture.
(a) With superscalar and dynamic scheduling, it is possible to execute and complete several
instructions at once. Assume that we have a 4‐wide superscalar architecture with 25‐stage pipeline,
with instruction fetch as the first stage, and commit as the last stage. Assume an architecture in which
branch misprediction is handled like an exception (i.e., at the retire/commit stage). Also assume that
the processor is able to fill the pipeline if there is no branch misprediction. Suppose that one out of
every five instructions is a conditional branch. If the branch misprediction rate is 10%, is the pipeline
full before it needs to be flushed (due to misprediction handling)? If not, what is the required
misprediction rate needed so that the pipeline is full when a branch misprediction is handled?
(b) For the pipeline in part (a), we need to generate one branch prediction approximately every cycle.
Consider a Gshare branch predictor. When a branch prediction has to be made, the actual outcome of
the previous branches is not known yet, and thus have not updated the history register. This makes
the history register out‐of‐date and may not be used for current branch prediction accurately. A
solution to this is to update the history register early, using branch prediction results instead of the
actual outcome of the branch. Discuss the advantages and disadvantages of this scheme.
3. (30 Points) For the following instructions, give the timing diagram:
(1) set F1, avalue
(2) addi F2, F1, 1
(3) multdi F3, F2, x // F3 = F2 * x
(4) load F2, c
(5) addi F1, F2, 2
(6) addi F2, F3, 1
(7) add F1, F1, F2
Suppose that the pipeline has one‐stage fetch (IF), one‐stage decode and dispatch (ID), some execute
stages (explained later), one‐stage write back (WB), and there is no support for precise exception (i.e.
no retirement/commit stage).
Suppose that all instructions take 2 cycles in the execute stage, except the multd which take 4 cycles
in the execute stage (EX). Suppose that memory instructions (load/store) take 2 cycles in an adder unit
(EX) to calculate target addresses and 2 extra cycles in the memory unit (MEM).
Suppose that we have a 2‐way superscalar processor as follows. In each cycle, the processor can fetch
up to 2 instructions and decode up to 2 instructions. It can issue and execute more than 2 instructions
if the functional units are available.
Assume there are unlimited reservation station entries. There is a full forwarding/bypassing network.
There is no register renaming, hence WAR and WAW dependences are not eliminated. To handle WAW
hazard between two instructions, delay the write‐back of the latter instruction until the earlier one
has written back its result. To handle WAR hazard between two instructions, delay the write‐back of
the latter instruction until the earlier one has read all its operands (i.e. has passed the ID stage).
(a) Add assumption of unlimited number of functional units.
Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14
set F1, avalue IF
addi F2, F1, 1 IF
multdi F3, F2, x
load F2, c
addi F1, F2, 2
addi F2, F3, 1
add F1, F1, F2
15 16 17 18 19 20 21 22 23 24 25 26 27 28
(b) Suppose we only have 1 adder unit that performs set/addi/add, and 1 multiplier unit that performs
multd. These units are not pipelined, i.e. there can be only one instruction in a functional unit at any
given time. Also, there is no extra adder unit for memory instructions (only one adder).
Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14
set F1, avalue IF
addi F2, F1, 1 IF
multdi F3, F2, x
load F2, c
addi F1, F2, 2
addi F2, F3, 1
add F1, F1, F2
15 16 17 18 19 20 21 22 23 24 25 26 27 28
(c) Suppose we only have 1 adder unit that performs set/addi/add, and 1 multiplier unit that performs
multd. Keep the multiplier unit unpipelined, but now suppose that the adder unit is pipelined, i.e.
there can be at most 2 instructions in the adder at any given time, 1 instruction in EX1 stage, and
another in EX2 stage. An exception to this is when there is true dependence, where the first instruction
produces result only after EX2 stage, while the second instruction needs the produced value before it
can enter the functional unit.
Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14
set F1, avalue IF
addi F2, F1, 1 IF
multdi F3, F2, x
load F2, c
addi F1, F2, 2
addi F2, F3, 1
add F1, F1, F2
15 16 17 18 19 20 21 22 23 24 25 26 27 28
(d) Suppose we have unlimited number of functional units. However, assume that WAW is solved at
dispatch stage (ID). Hence, to handle WAW hazard between two instructions, delay the execution of
the latter instruction until the earlier one has written back its result.
Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14
set F1, avalue IF
addi F2, F1, 1 IF
multdi F3, F2, x
load F2, c
addi F1, F2, 2
addi F2, F3, 1
add F1, F1, F2
15 16 17 18 19 20 21 22 23 24 25 26 27 28
4. (30 Points) You are given the following instructions, the same as in the previous problem.
(1) set F1, avalue
(2) addi F2, F1, 1
(3) multdi F3, F2, x // F3 = F2 * x
(4) load F2, c
(5) addi F1, F2, 2
(6) addi F2, F3, 1
(7) add F1, F1, F2
Suppose that the pipeline has one‐stage fetch (IF), one‐stage decode‐rename‐dispatch (ID), some
execute stages (explained later), one‐stage write back (WB), and one‐stage retirement stage (RE).
Suppose that all instructions take 2 cycles in the execute stage, except the multd which takes 4 cycles
in the execute stage.
Suppose that we have a 2‐way superscalar processor as follows. In each cycle, the processor can fetch
up to 2 instructions and decode up to 2 instructions. It can issue and execute more than 2 instructions
if the functional units are available, but can commit up to 2 instructions per cycle.
Assume there are unlimited reservation station entries. There is a full forwarding/bypassing network.
There is Tomasulo+ROB style register renaming, hence all WAR and WAW dependences are eliminated
as long as there are free Reorder Buffer (ROB) entries available. If an instruction needs to be dispatched
but there is no free ROB entry available, the instruction cannot be dispatched, and the dispatch is
retried when an entry becomes available.
Suppose that all instructions take 2 cycles to execute except the multd which takes 4 cycles to execute
(i.e., requires 4 EX stages). Suppose that memory instructions (load/store) take 2 cycles in an adder
unit (EX) to calculate target addresses and 2 extra cycles in the memory unit (MEM). Show the pipeline
timing diagram for a 2‐way superscalar processor. Thus, each cycle, it can fetch 2 instructions and
decode 2 instructions. It can issue and execute more than 2 instructions if the functional units are
available, but can commit up to 2 instructions. Assume unlimited reservation station entries and
unlimited functional units.
(a) Suppose now we support register renaming using reorder buffer. Suppose we have 7 reorder buffer
entries (ROB1, ROB2, ...). Assume that all instructions have been dispatched (thus, they all occupy the
reorder buffers). Rename all the instruction destination registers to tags, and update all consumers
appropriately. Use reorder buffer entries as the tags, e.g.:
load R1, a
add R3, R1, R2
becomes
load ROB1, a
add ROB2, ROB1, R2
Show the renamed code that completely eliminate WAW and WAR dependencies.
(b) Assume that we have unlimited number of ROB entries and that no instruction causes an exception.
Show the timing diagram.
Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14
set F1, avalue IF
addi F2, F1, 1 IF
multdi F3, F2, x
load F2, c
addi F1, F2, 2
addi F2, F3, 1
add F1, F1, F2
15 16 17 18 19 20 21 22 23 24 25 26 27 28
(c) Now assume that we only have 3 reorder buffer entries. Show the timing diagram.
Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14
set F1, avalue IF
addi F2, F1, 1 IF
multdi F3, F2, x
load F2, c
addi F1, F2, 2
addi F2, F3, 1
add F1, F1, F2
15 16 17 18 19 20 21 22 23 24 25 26 27 28
(d) Based on your answer in part (c), summarize the impact of long latency instructions and reorder
buffer size to processor performance. Discuss also the case where an L2 cache miss may take hundreds
of cycles to complete.