Instructions and Processors in MIPS

Three instructions formats:

MIPS instruction format [18]. | Download Scientific Diagram

R-Type Instructions:

J-Type Instructions:

Screenshot 2024-05-10 at 23.55.42

I-Type Instructions:

Screenshot 2024-05-10 at 23.56.20
  • Note that the target address of beq is PC + 4 + SignExt(imm16) * 4.

Exceptions

Exception state is kept in "coprocessor 0":

  • Use mfc0 to read contents of these registers.

  • Every register is 32 bits, but may be only partially defined.

BadVAddr (register 8):

  • Register contained memory address at which memory reference occurred.

Status (register 12):

  • Interrupt mask and enable bits.

Cause (register 13):

  • The cause of the exception.

  • Bits 5 to 2 of this register encodes the exception type (e.g., undefined instruction=10 and arithmetic overflow=12).

EPC (register 14):

  • Address of the affected instruction (register 14 of coprocessor 0).

Control signals to write BadVAddr, Status, Cause, and EPC:

  • Be able to write exception address into PC (8000 0080_hex).

  • May have to undo PC = PC + 4, since we want EPC to point to offending instruction (not its successor): PC = PC - 4.

Single-cycle MIPS Processor

Storage Elements

The clock input (CLK) is a factor only during write operation. During read operation, the storage elements (register file/idealized memory) behave as a combinational logic block. The output data will be valid after "access time".

The clocking methodology of the storage elements:

Screenshot 2024-05-11 at 22.17.39
  • D value must be held stable during the decision window for some period before (setup time) and after (hold time) the clock edge.

  • Clock-to-Q is the time for the output value Q to be the same with the input value D.

  • Hold Time should >> Clock-to-Q

【TODO】

Datapath

image-20240512015905791

Critical Path (Load Operation) = PC's CLK-to-Q + InstMem's Access Time + RegFile’s Access Time + ALU to Perform a 32-bit Add + Data Memory Access Time + Setup Time for RegFile Write + Clock Skew.

Control Signals

Signals
Semantics

nPC_MUX_sel

0 => PC <- PC + 4; 1 => PC <- PC + 4 + SignExt(Im16)

ExtOp

Zero-extension or sign-extension

ALUsrc

0 => regB; 1 => imm

ALUctr

add, sub or or.

MemWr

Write to the memory if asserted.

MemtoReg

0 => ALU; 1 => Mem

RegDst

0 => rt; 1 => rd

RegWr

Write to the register if asserted.

  • The data path also yields Zero signal if the output of ALU is 0.

  • Note that in real implementation, the next PC is determined by the combination of Jump, Branch and Zero signals rather than the simplified nPC_MUX_sel here.

A Summary of the Control Signals:

Screenshot 2024-05-12 at 00.40.04

Local Decoding: A separate decoder will be used for the main control signals and the ALU control. This approach is sometimes called local decoding. Its main advantage is in reducing the size of the main controller. The ALUctr determination is a two-stage process based on the func and op fields:

Diagram of ALU Control Logic

For instance, R-type operations like add, sub, etc., use "R-type" as the symbolic representation in the ALUop field and are further defined by the func field. For other instruction types, the ALU's behavior is directly dictated by the op field.

The decoding for R-type instructions is detailed as follows:

func[5:0]

Instruction Operation

100000

add

100010

subtract

100100

and

100101

or

101010

set-on-less-than

Based on the encoding specification of op and func, we can derive that:

  • ALUctr[2] = !ALUop[2] & ALUop[0] + ALUop[2] & !func[2] & func[1] & !func[0]

  • ALUctr[1] = !ALUop[2] & !ALUop[1] + ALUop[2] & !func[2] & !func[0]

  • ALUctr0 = !ALUop[2] & ALUop[1] + ALUop[2] & !func[3] & func[2] & !func[1] & func[0] + ALUop[2] & func[3] & !func[2] & func[1] & !func[0]

Multi-cycle MIPS Processor

Motivation

  • Long cycle time: all instructions take as much time as the critical path takes.

  • Real memory is not as nice as our idealized memory: we cannot always get the job done in one (short) cycle.

  • Two memories (IM, DM) and three adders (ALU, PC and Branch): we want the design to be smaller.

The idea of multi-cycle design is to do same work in a number of fast cycles rather than the slow one.

Screenshot 2024-05-12 at 02.45.56

The FSM of the Main Controller

image-20240512032031446
States
Effects

S0

PC = PC + 4 and store the instruction to IR.

S1

Decode the instruction (i.e. op, func, rs, rt, rd, imm) Note that the target address of branch instructions can be computed at the end this state if the operation is branch, which saves one cycle.

S2

If the operation is sw or lw, calculate the address based on rs and imm. The calculated address is stored in ALUOut.

S3

If OP is lw, the IorD is asserted. Read the memory and store the result in DR.

S4

Memory write back to register from DR. RegDst=0 stands for rt while MemtoReg=1 stands for DR.

S5

Write to the address specified by the ALUOut.

...

...

Pipelining

Suppose we execute 100 instructions.

  • For a single-cycle machine, it takes 45 ns/cycle x 1 CPI x 100 inst = 4500 ns

  • For a multi-cycle machine, it takes 10 ns/cycle x 4.1 CPI (due to inst mix) x 100 inst = 4100 ns

  • For a idealized 5-stage pipelined machine, it takes 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns.

image-20240512063454626

Performance Metrics

The maximum throughput of a pipeline is constrained by its slowest stage, often referred to as the bottleneck. This can be expressed mathematically as:

TPmax=1max{Δti}T_{\text{Pmax}} = \frac{1}{\max\{\Delta t_i\}}

To complete nn tasks across mm stages, where each stage takes the same amount of time, the total time required is:

T=mΔt+(n1)ΔtT = m\Delta t + (n-1) \Delta t

Therefore, the throughput is:

Tp=nmΔt+(n1)Δt=1Δt(1+m1n)=TPmax1+m1nT_p = \frac{n}{m\Delta t + (n-1) \Delta t} = \frac{1}{\Delta t(1 + \frac{m-1}{n})} = \frac{T_{\text{Pmax}}}{1 + \frac{m-1}{n}}

To approximate the maximum throughput, set m=1m = 1 or let nn \rightarrow \infty. This minimizes the filling and draining phases, enhancing the efficiency of running the pipeline.

Two common approaches for improving the maximum throughput:

  • Parallelize the bottleneck.

  • Split the bottleneck.

Speedup: the ratio of pipelining speedup to equivalent sequential time.

Sp=TlTk=nmm+n1=m1+m1nS_p = \frac{T_l}{T_k} = \frac{nm}{m+n-1} = \frac{m}{1 + \frac{m-1}{n}}

Efficiency: the ratio of the spatiotemporal region of the sequential time to the region of pipelining:

E=nmΔtm(m+n1)Δt=TpΔtE = \frac{nm\Delta t}{m(m+n-1)\Delta t} = T_p \Delta t

Tradeoff

Cost=G+k×L\text{Cost} = G + k \times L

Where GG is the price of non-pipeline processor, kk is the pipeline stages and LL is the price of latch.

Perf=1Tk+S\text{Perf} = \frac{1}{\frac{T}{k} + S}

Where TT is the latency of non-pipeline processor, SS is the latency of latch.

CP=LT+GS+LSk+GTk\frac{C}{P} = LT + GS + LSk + \frac{GT}{k}
kopt=GTLSk_{opt} = \sqrt{\frac{GT}{LS}}
image-20240512154221080

Hazards

Structural hazards

Attempt to use the same resource two different ways at the same time.

Consider a scenario where both R-type and load instructions are processed in a pipeline. R-type instructions, which lack a Write Register (WR) stage, complete in four stages. In contrast, load instructions require all five stages. This setup can lead to structural conflicts, as illustrated below:

Screenshot 2024-05-12 at 17.43.17

Observation 1: every functional unit should be utilized only once by a instruction.

Observation 2: every functional unit should be utilized only at the same stage for all instructions.

Solution 1: insert “bubble” into the pipeline. Drawbacks: (1) the control logic can be complex. (2) lose instruction fetch and issue opportunity.

Screenshot 2024-05-12 at 17.53.15

Solution 2: delay R-type’s write by one cycle. Mem stage of R is a NOOP stage: nothing is being done.

image-20240512175534494

Control hazards

img

Solution 1: stall the pipeline until the branch decision is made (i.e., PCSrc is computed). Because the decision is made in the Memory stage, the pipeline would have to be stalled for three cycles at every branch. For J-type instructions, the decision can be made at the end of the RF stage, which can save us a cycle.

Solution 2: predict the next instruction. 0 lost cycles per branch instruction if right, 1 if wrong (right ­ 50% of time). Assume that the accuracy of predition is 90%, the average CPI of branch is 0.9×1+0.1×2=1.10.9 \times 1 + 0.1 \times 2 = 1.1. Assume that branch accounts for 20% of the instructions, the total CPI might then be 1.1×0.2+1×0.8=1.021.1 \times 0.2 + 1 \times 0.8 = 1.02.

Solution 3: delayed branch. We redefine branch behavior (takes place after next instruction).

Delayed jump / branch technique: the compiler fills the delay slot(s) with instructions that are in logical program order before the branch.The moved instructions within the slots are executed regardless of the branch outcome.

The probability of:

  • Moving one instruction into the delay slot is greater than 60%

  • Moving two instructions is 20%

  • Moving three instructions is less than 10%.

The delayed branching was a popular technique in the first generations of scalar RISC processors, e.g. IBM 801, MIPS, RISC I, SPARC.

In superscalar processors, the delayed branch technique complicates the instruction issue logic and the implementation of precise interrupts. However, due to compatibility reasons it is still often in the lSA of some of today's microprocessors, as e.g. SPARC- or MIPS-based processors.

Data hazards

  • RAW (read after write)

  • WAW (write after write)

    image-20240512165416327
  • WAR (write after read)

    image-20240512165533889

Solution 1: forwarding or bypassing can address most data hazards. However, for load instructions, which fetch data that subsequent instructions depend on, forwarding isn't feasible. In such cases, we must implement delays or stalls for instructions that depend on the load operations.

Name Dependencies

A name dependence occurs when two instructions use the same register or memory location, called a name, but there is no flow of data between the instructions associated with that name.

Write after Read (WAR/Anti/False/Name)

An antidependence between instruction i instruction j occurs when instruction j writes a register or memory location that instruction i reads. The original ordering must be preserved to ensure that i reads the correct value.

The following code has a WAR dependence on R1.

SUB R7, R1, R8
MUL R1, R5, R6

Write after Write (WAW/Output/False/Name)

An output dependence occurs when instruction i write the same register or memory location. The ordering between the instructions must be preserved to ensure that the value finally written corresponds to instruction j.

The following code has a WAW dependence on R1.

ADD R1, R2, R3
MUL R1, R5, R6

Because a name dependence is not a true dependence, instructions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instructions is changed so the instructions do not conflict.

This renaming can be more easily done for register operands, where it is called register renaming. Register renaming can be done either statically by a compiler or dynamically by the hardware.

Control Signals

image-20240512180948530

Last updated