Dynamic Scheduling

Dispatch: Act of sending an instruction to a functional unit.

In-Order Pipeline with Reorder Buffer

Motivation

Not all instructions take the same amount of time in the “execute stage” of the pipeline. We can have multiple different functional units that take different number of cycles.

Challenges

Instructions can take different number of cycles in EXECUTE stage:

  • Sequential semantics of the ISA NOT preserved.

  • Imprecise exceptions/interrupts.

Reorder Buffer (ROB)

In-order dispatch/execution, out-of-order completion, in-order retirement.

Advantages: (1) conceptually simple for supporting precise exceptions (2) can eliminate false dependencies.

False Dependencies in In-Order Pipeline with Reorder Buffer

Observation: WAR and WAW are not true dependences. The same register refers to values that have nothing to do with each other.

Solution: The register ID is renamed to the reorder buffer (ROB) entry that will hold the register’s value. This eliminates WAR and WAW dependences.

Disadvantages:reorder buffer needs to be frequently accessed to get the results that are yet to be written to the register file.

Out-of-order Dispatch (Scheduling, or Execution)

Note that ISSUE is not DISPATCH.

Motivation: in the in-order dispatch (scheduling, or execution) design, a non-ready instruction stalls dispatch of younger instructions into functional (execution) units.

What does non-ready mean?

Consider the control flow:

MUL   R3, R1, R2
ADD   R3, R3, R1
ADD   R4, R6, R7
MUL   R5, R6, R8
ADD   R7, R9, R9

In the in-order dispatch design, the first ADD stalls the whole pipeline due to a RAW dependency (true dependency).

Solution: dynamic scheduling (out-of-order dispatch/scheduling/execution). Monitor the source “values” of each instruction in the resting (waiting) area. When all source “values” of an instruction are available, “fire” (i.e., dispatch) the instruction.

Dynamic Scheduling: Scoreboard Algorithm

Out-of-order execution can be achieved in a CPU with multiple functional units. Younger instructions can potentially execute on a different functional unit while another unit is occupied by an older, time-consuming instruction.

In-order issue, out-of-order execution, and out-of-order commit are key features of scoreboard. Out-of-order commit requires careful handling of RAW and WAW hazards. To address this, the scoreboard algorithm delays the dispatch of instructions with ANY data dependencies.

Four stages of scoreboard:

  • Issue: decode instructions & check for structural hazards. WAW hazard is resolved in this stage by not issuing if instruction is output dependent on any previously issued but uncompleted instruction.

  • Read operands: wait until no data hazards, then read operands. All real dependencies (RAW hazards) are resolved in this stage, since we wait for instructions to write back data. Note that there is no forwarding of data in this model.

  • Execution

  • Write result: stall until no WAR hazards with previous instructions.

Three components of scoreboard:

  • Instruction status: which of 4 steps the instruction is in.

  • Functional unit status: indicates the state of the functional unit (FU). 9 fields for each functional unit.

  • Register result status: indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register.

Dynamic Scheduling: Tomasulo’s Algorithm

The key contribution of Tomosulo's algorithm is that it eliminates the false data dependencies with register renaming and increases the performance of OoO.

Legacy Tomasulo: OoO with register renaming invented by Robert Tomasulo. Used in IBM 360/91 Floating Point Units.

Modern Tomasulo: precise exceptions:

  • Patt, Hwu, Shebanow, “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985.

  • Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO 1985.

Tomasulo's Algorithm

If reservation station is available before renaming:

  • Insert instruction with renamed operands (source value/tag) into the reservation station.

  • Only rename if a reservation station is available.

  • Otherwise, stall.

While in the reservation station, each instruction:

  • Monitors the common data bus (CDB) for the tags of its sources.

  • When a tag is detected, grabs the value for the source and keeps it in the reservation station.

  • When both operands are available, the instruction is ready to be dispatched.

Dispatch instruction to the Functional Unit when ready.

After the instruction finishes in the Functional Unit:

  • Arbitrate and broadcast the tagged value onto the CDB.

  • The register file is connected to the CDB. Registers have tags indicating the latest writer. If the tag in a register matches the broadcast tag, write the broadcast value into the register and set the valid bit.

  • Reclaim the rename tag (ensure no valid copy of the tag exists in the system).

Exercise

Consider the program:

MUL   R3, R1, R2
ADD   R5, R3, R4
ADD   R7, R2, R6
ADD   R10, R8, R9
MUL   R11, R7, R10
ADD   R5, R5, R11

Assume one ADD inst takes 7 cycles (1F + 1D + 4E + 1W) and one MUL inst takes 11 cycles (1F + 1D + 8E + 1W). There are one adder and one multiplier.

How many cycles:

  1. In a non-pipelined machine: 50 cycles

  2. In an in-order-dispatch pipelined machine with imprecise exceptions (no forwarding and forwarding)

  3. In an out-of-order dispatch pipelined machine imprecise exceptions (forwarding)

Solution

Last updated