CS520: Quantitative Design and Analysis
Homework 1
RULES
1. You are to work individually on the homework. Cooperation is not allowed.
2. Points are assigned to each question to reflect the amount of work involved or its level of difficulty.
Your homework score will be calculated as: score = ଵ௫௨ ௧௦ ൈ .
HOMEWORK PROBLEMS
1. Amdahl’s law [5pt]
Assume you have a sequential program that you want to run on a parallel computers. Suppose that 90%
of the original sequential execution time is parallelizable with a perfect speedup (i.e. n ൈ as fast when
executed on n processors), and the remaining 10% is not parallelizable. What is the minimum number of
processors needed to yield a total/overall speedup of 5?
2. Amdahl’s law [15pt]
Suppose that the typical workload that runs on your processor consists of programs with 50% floating‐
point instructions, 30% memory instructions (loads and stores), and 20% others (integer, logic, etc.).
Suppose that each floating‐point instruction takes 2 cycles to compute, memory instruction takes 4 cycles
to compute, and other instruction takes 1 cycle to compute. For the next generation of your processor,
you are evaluating between design A which improves floating‐point instruction performance by 40%, and
design B which improves memory instruction performance by 35%. Design A increases the processor die
area by 10% while Design B increases the processor die area by 12%.
(a) Compute the overall speedups for design A and design B over the base processor design. Which
design (A vs. B) gives the highest speedup and by how much?
(b) If your goal is to improve performance per increase in die area, which design (A vs. B) is closest to
your goal? By how much?
3. Performance Evaluation [10pt]
Suppose that you are considering between a pipelined processor design vs. non‐pipelined processor
design. With pipelined design, you can breakdown the processor into five pipeline stages, resulting in five
times improvement in clock frequency. However, since you need to make the instructions more
standardized and simpler, you end up with twice as many instructions to represent a typical program. In
addition, imperfections in the pipeline due to dependences, cache misses, etc. result in lower efficiency
of the throughput, resulting in a 25% increase in the number of cycles needed to execute an instruction.
How much speedup or slowdown does the pipelined design have over the non‐pipelined design?
4. Performance Evaluation [15pt]
Suppose that we have two machines A and B. The following table shows the execution times for programs
that make up a benchmark suite for machine A, machine B, and a reference machine X.
Program Exe. time of X Exe. time of A Exe. time of B
P1 100 40 20
P2 200 100 150
P3 50 40 20
P4 80 10 15
(a) Calculate the overall speedup of machine A versus reference machine X, using geometric mean.
(b) Calculate the overall speedup of machine B versus reference machine X, using geometric mean.
(c) Calculate the overall speedup of machine B versus reference machine A by computing the speedup for
each program, and then taking the geometric mean of the speedups.
(d) Calculate the overall speedup of machine B versus machine A, by taking the ratio of part (b) and part
(a).
(e) Calculate the geometric standard deviation of machine A versus machine X.
5. Power Wall [10pt]
Suppose that scaling factor in the next process generation is λ= 0.7 (meaning that transistor length is
reduced by the scaling factor.) With limited voltage scaling, suppose that we want to keep the dynamic
power consumption constant in the next generation while also shrinking the die size by 10%. How much
should we reduce clock frequency to achieve that? Note, capacitance C is also linearly scaled by the
transistor scaling factor, λ.
6. Chip Cost [10pt]
Compare three chips with the following manufacturing parameters:
Chip Die Size (mm2) Defect rate (per cm2)
A 400 0.018
B 350 0.04
C 200 0.04
In all scenario, assume that N= 13.5, a 300 mm wafer is used, the wafer costs $1000, and the wafer is the
only cost that we consider. Compute the chip cost for each scenario through the following stages:
(a) Compute number of dies that can be manufactured on one wafer
(b) Compute the number of good dies that can be manufactured on one wafer
(c) Compute the chip cost for each scenario.
7. Instruction Set Architecture [15pt]
Assume that instruction operands are represented in 8 bits, memory addresses are 64 bits, and register
addresses are 6 bits.
(a) For each instruction set architecture shown in the following figure, how many opcodes, memory
operands, and register operands are there, and what is the total code size?
Stack Accumulator Register Load‐Store
Push A Load A Load R1, A Load R1, A
Push B Add B Add R1, B Load R2, B
Add Store C Store C, R1 Add R3, R1, R2
Pop C Store, C, R3
Answer:
Style Num opcodes Num Mem Addr Num Reg Addr Total Code Size
Stack
Accumulator
Register (Reg‐Mem)
Register (load‐store)
(b) Suppose that the code sequence is to compute A = B+C, B = A+C, and D = A‐B. Assuming that all of A,
B, C, and D are initially in memory, list the sequence of instructions needed to perform that computation
for the different styles of ISA (Stack, Accumulator, Register (Reg‐Mem), and Register (Load‐Store)). After
that, count the number of codes, memory operands, and register operands and total code size. For
register machines (Reg‐Mem and Load‐Store), assume that you have unlimited number of registers.
Code:
Style Num opcodes Num Mem Addr Num Reg Addr Total Code Size
Stack
Accumulator
Register (Reg‐Mem)
Register (load‐store)
8. Instruction Set Architecture [20 pt]
Registers that are visible to the software are often referred to the architecture registers. Various
processors have different number of architecture registers. The purpose of this exercise is to assess the
impact of having various number of registers on the code complexity. Suppose that the code sequence is
to compute A=B+C, B=A+C, and D=A‐B. Assume that all of A, B, C, and D are initially in memory. List the
sequence of instructions for Register (load‐store) ISA, and compare the number of loads and stores that
you need in order to do the computation
when:
(a) There are 2 architecture registers.
(b) There are 3 architecture registers.