Part 1 Due: Friday February 12 5pm. Part 2 Due: Friday February 19 2pm No late homeworks will be accepted
Work in groups of two. Each group should turn in one homework.
Note: Part 2 is due at the beginning of class on Friday Feb 19!
From Hennessy and Patterson Computer Architecture: A Quantitative Approach Second Edition
This question examines a new application of prediction called "Load value prediction." This problem is mostly self-contained (it was an exam problem last year), but you may find it interesting to read the paper "Value Locality and Load Value Prediction" by Mikko H. Lipasti, Christopher B. Wilkerson and John Paul Shen (Carnegie Mellon University) (appeared in ASPLOS 1996)
Consider an in-order superscalar, superpipelined machine. The machine issues up to 2 instructions per cycle into an 8-stage pipeline (IF: first instruction fetch, IS: second instruction fetch, RF: register fetch, DF: first data fetch, DS: second data fetch, TC: data fetch tag check, WB: write back). As with the MIPS R4000, cache accesses are pipelined over three stages, with the result available in the second cycle and the tag checked in the third. E.g.:
instr1 | IF | IS | RF | EX | DS | TC | TC | WB | |
instr2 | IF | IS | RF | EX | DS | TC | TC | WB | |
instr3 | IF | IS | RF | EX | DS | TC | TC | WB | |
instr4 | IF | IS | RF | EX | DS | TC | TC | WB |
The following table summarizes the probabilities that the i'th instruction after a load is the first instruction after the load to use the result of the load:
Instruction after load | Use probability |
1 | 35% |
2 | 20% |
3 | 15% |
4 | 10% |
5 | 6% |
6 | 4% |
7 | 2% |
8,9,10 | 1% |
11,12,13,14,15 | 0.5% |
16..25 | 0.1% |
The base CPI of the machine is 0.5 and for the target workload branch stalls add an additional 0.52 cycles per instruction. 25% of all instructions are loads. We will not consider floating point stalls in this problem. For simplicity, we will assume all instructions and data accesses hit in the L1 cache.
a. What is the load delay for this machine?
i. In cycles
ii. In instructions
b. What is the CPI in the base machine?
One proposal to reduce the load delay in such a machine is to take advantage of the fact that a load instruction at a particular address is somewhat likely to load the same value from memory repeatedly. A Load Value Prediction table (LVP) maps the address of load instructions to a predicted value for loads.
c. Describe how you could use a LVP to reduce the load delay in this CPU. (For now, do not describe how you would recover from mispredictions; we'll talk about that later.
d. Suppose we add a LVP to the above pipeline. What is the minimum load delay that can be achieved on a LVP hit?
i. In cycles?
ii. In instructions?
e. In such a design, over how many cycles can we pipeline the LVP lookup?
Dealing with mispredictions Suppose we access the LVP to predict a load value for each instruction whether it is a load or not. By the time the value would be used, we know whether the instruction was a load, so if it was not a load, we can throw away the prediction with no harm done. However, if the instruction is a load, we must compare the prediction against the real value and roll back on a miss. To simplify this problem (and the pipeline), assume that you will flush and refetch all instructions following a mispredicted load whether or not they depend on the load.
We can compare the prediction and the result during TC. If the prediction was incorrect, squash all instructions in the pipeline that are after the load. The load instruction does its WB normally, and the two instructions immediately after the load restart in IF1. (There are quite a few ways we might make restart more efficient, but there are complexity trade-offs
g. What is the misprediction penalty when a load value prediction is found to be incorrect under this scheme? (Consider both the case when the load is in the "top half" and the "bottom half" of a cycle.
h. Assume that a 1024-entry LVP predicts loads correctly 70% of the time. WHat is the CPI of the machine using such a LVP?
To reduce the cost of mispredictions, researchers have proposed adding a load classification table (LCT) that tracks whether the load associated with an address is predictable or not. Each entry of the LCT is a two-bit saturating counter that assigns the four states 0-3 as "dont predict", "dont predict", "predict", and "predict." The counter is incremented when the predicted value is correct and decremented otherwise. A 256-entry LCT classifies 93% of unpredictable loads as "don't predict" and classifies 82% of predictable loads as "predict."
g. Using the LCT, what is the new CPI of the machine?
Read the two papers assigned for this class ( "Superspeculative Microarchtiecture for Bayond AD2000" Lipasti and Shen IEEE Computer September 1997 and "One Billion Transistors, One Uniprocessor, One Chip" Patt, Patel, Friendly, and Startk IEEE Computer September 1997)
For each of the papers, turn in a short (1/2 page max critique of the paper answering the following three questions.
a. What are the three most important concepts in the paper?
b. What is the most glaring flaw in the paper?
c. What is the main conclusion you take away from this paper regarding the future of microprocessor architectures?