Intel's P6 Uses Decoupled Superscalar Design
Intel's P6 Uses Decoupled Superscalar Design
by Linley Gwennap cache and bus designs that allow even large programs to
make good use of the superscalar CPU core.
Intel’s forthcoming P6 processor (see cover story) is As Figure 1 (see below) shows, the P6 can be divided
designed to outperform all other x86 CPUs by a signifi- into two portions: the in-order and out-of-order sections.
cant margin. Although it shares some design techniques Instructions start in order but can be executed out of
with competitors such as AMD’s K5, NexGen’s Nx586, order. Results flow to the reorder buffer (ROB), which
and Cyrix’s M1, the new Intel chip has several important puts them back into the correct order. Like AMD’s K5
advantages over these competitors. (see 081401.PDF ), the P6 uses the ROB to hold results
The P6’s deep pipeline eliminates the cache-access that are generated by speculative and out-of-order in-
bottlenecks that restrict its competitors to clock speeds structions; if it turns out that these instructions should
of about 100 MHz. The new CPU is designed to run at not have been executed, their results can be flushed from
133 MHz in its initial 0.5-micron BiCMOS implementa- the ROB before they are committed.
tion; a 0.35-micron version, due next year, could push The performance increase over Pentium comes
the speed as high as 200 MHz. largely from the out-of-order execution engine. In Pen-
In addition, the Intel design uses a closely coupled tium, if an instruction takes several cycles to execute,
secondary cache to speed memory accesses, a critical due to a cache miss or other long-latency operation, the
issue for high-frequency CPUs. Intel will combine the P6 entire processor stalls until that instruction can proceed.
CPU and a 256K cache chip into a single PGA package, In the same situation, the P6 will continue to execute
reducing the time needed for data to move from the subsequent instructions, coming back to the stalled in-
cache to the processor. struction once it is ready to execute. Intel estimates that
Like some of its competitors, the P6 translates x86 the P6, by avoiding stalls, delivers 1.5 SPECint92 per
instructions into simple, fixed-length instructions that MHz, about 40% better than Pentium.
Intel calls micro-operations or uops (pronounced “you-
ops”). These uops are then executed in a decoupled su- x86 Instructions Translate to Micro-ops
perscalar core capable of register renaming and out-of- The P6 CPU includes an 8K instruction cache that
order execution. Intel has given the name “dynamic is similar in structure to Pentium’s. On each cycle, it can
execution” to this particular combination of features, deliver 16 aligned bytes into the instruction byte queue.
which is neither new nor unique, but highly effective in Unlike Pentium, the P6 cache cannot fetch an unaligned
increasing x86 performance. cache line, throttling the decode process when poorly
The P6 also implements a new system bus with in- aligned branch targets are encountered. Any hiccups in
creased bandwidth compared to the Pentium bus. The the fetch stream, however, are generally hidden by the
new bus is capable of supporting up to four P6 processors deep queues in the execution engine.
with no glue logic, reducing the cost of developing and The instruction bytes are fed into three instruction
building multiprocessor systems. This feature set makes decoders. The first decoder, at the front of the queue, can
the new processor particularly attractive for servers; it handle any x86 instruction; the others are restricted to
will also be used in high-end desktop PCs and, eventu- only simple (e.g., register-to-register) instructions. In-
ally, in mainstream PC products. structions are always decoded in program order, so if an
instruction cannot be handled by a restricted decoder,
Not Your Grandfather’s Pentium neither that instruction nor any subsequent ones can be
While Pentium’s microarchitecture carries a dis- decoded on that cycle; the complex instruction will even-
tinct legacy from the 486, it is hard to find a trace of Pen- tually reach the front of the queue and be decoded by the
tium in the P6. The P6 team threw out most of the design general decoder.
techniques used by the 486 and Pentium and started Assuming that instruction bytes are available, at
from a blank piece of paper to build a high-performance least one x86 instruction will be decoded per cycle, but
x86-compatible processor. more than one will be decoded only if the second (and
The result is a microarchitecture that is quite radi- third) instructions fall into the “restricted” category.
cal compared with Intel’s previous x86 designs, but one Intel refused to list these instructions, but they do not in-
that draws from the same bag of tricks as competitors’ clude any that operate on memory. Thus, the P6’s ability
x86 chips. To this mix, the P6 adds high-performance to execute more than one x86 instruction per cycle relies
Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT
2 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT
discuss these rules but notes that older uops are given P6 Pentium
priority over newer ones, speeding the resolution of Throughput Latency Throughput Latency
chains of dependent operations. Integer multiply 1 cycle 4 cycles 4–8 cycles 7–14 cycles
The integer units handle most calculations in a sin- Integer divide 12–36 cyc 12–36 cyc 42–84 cyc 42–84 cyc
gle cycle, but integer multiply and divide take longer. FP add 1 cycle 3 cycles 1 cycle 3 cycles
The P6 FPU executes adds in three cycles and requires FP multiply 2 cycle 5 cycles 1 cycle 3 cycles
five cycles for multiply operations. FP add and multiply FP divide 18–38 cyc 18–38 cyc 39 cycles 39 cycles
are pipelined and can be executed in parallel with long- FP sq root 29–69 cyc 29–69 cyc 70–140 cyc 70–140 cyc
latency operations. Table 1 shows the cycle times for Table 1. P6 floating-point latencies are similar to Pentium’s, but in-
teger arithmetic is much faster. Latencies are the same for single,
long-latency integer and floating-point operations. double, and extended precision except where ranges are shown.
The FXCH (floating-point exchange) instruction is
not handled by the FPU. This instruction swaps the top with its memory reorder buffer. If the access at the front
of the FP register stack with another register in the of the MOB misses, subsequent accesses will continue to
stack and is frequently used in x86 floating-point appli- execute. Since these accesses can execute out of order,
cations, since many x86 FP instructions access only the the P6 must take care to avoid incorrect program execu-
top of the stack. The P6 handles FXCH entirely in the re- tion. For example, loads can arbitrarily pass loads, but
order buffer by treating it as a renaming of two registers. stores must always be executed in order. Loads can pass
Thus, FXCH uops enter the ROB but not the reservation stores only if the two addresses are verified to be differ-
station and never enter the function units. ent. Intel would not reveal the size of the MOB or other
The dual address-generation units each contain a details of its function.
four-input adder that combines all possible x86 address The P6 data cache can process one load and one
components (segment, base, index, and immediate) in a store per cycle as long as they access different banks.
single cycle. As in Pentium, each unit also contains a sec- The cache is divided into four interleaved banks, half as
ond four-input adder to perform a segment-limit check in many as Pentium’s data cache. AMD’s K5 also imple-
parallel. The resulting address (along with, for a store, ments a four-bank data cache, and that company says
source data) is then placed in the memory reorder buffer that there is little benefit to an eight-bank design.
(MOB, in P6-ese) to await availability of the data cache. The P6 cache cannot handle two loads at once, in
The reorder buffer can accept three results per stark contrast to processors such as the K5, M1, and
cycle: one from the main arithmetic unit, one from the even Pentium. Typical x86 code generates a large num-
secondary ALU, and one from a load uop. Because of ber of memory references due to the limited register set,
long-latency operations, two or more function units and more of these references will be loads than stores.
within the main arithmetic unit can generate results in Intel points out that the entire processor is designed for
a single cycle. In this case, the function units must arbi- one load per cycle—even the decoders cannot produce
trate to write to the ROB. more than one load uop per cycle—and that its simula-
After a uop writes its result to the ROB (or, in the tions show this capability is adequate to attain the de-
case of a store, to the MOB), it is eligible to be retired. sired performance level.
The ROB will retire up to three uops per cycle, always in The data cache has a latency of three cycles (in-
program order, by writing their results to the register cluding address generation) but is fully pipelined, pro-
file. Uops are never retired until all previous uops have ducing one result per cycle. The unified level-two (L2)
completed successfully. If any exception or error occurs, cache has a latency of three cycles and, like the data
uncommitted results in the ROB can be flushed, reset- cache, is pipelined and nonblocking. The fast L2 latency
ting the CPU to its proper state, as if the instructions is achieved by implementing the L2 cache as a single
had been executed in order up to the point of the error. chip and combining it with the CPU in a single package,
as Figure 2 (see below) shows.
Nonblocking Caches Reduce Stalls Address translation occurs in parallel with the data
A key to the P6’s performance is its cache subsys- cache access. If the access misses the data cache, the
tem. The primary data cache, at 8K, is relatively small translated (real) address is sent to the L2 cache. The la-
for a processor of this generation, but the fast level-two tency of a load that misses the data cache but hits in the
cache helps alleviate the lower hit rate of the primary L2 is six cycles, assuming that the L2 cache is not busy
cache. If an access misses the data cache, the cache can returning data from an instruction cache miss or an ear-
continue servicing requests while waiting for the miss lier data cache miss.
data to be returned. This technique, called a nonblocking The P6 CPU contains a complete L2 cache con-
cache or hit-under-miss, has been used for years by troller and is Intel’s first x86 processor with a dedicated
PA-RISC processors to avoid stalls. L2 cache bus. Both NexGen’s 586 and the R4000 use this
The P6 takes advantage of the nonblocking cache design style to increase cache-bus bandwidth while
3 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT
4 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT
Load Pipeline: L1 L2 L3 L4 L5 L6 11 12
Bypass if L1 hit
Possible delay in MOB Possible delay in ROB
Figure 3. In the best case, instructions can flow through the P6 in 12 cycles, but the average is 18 cycles due to delays in the reservation sta-
tion (RS) or the reorder buffer (ROB). Load instructions take longer and can also be delayed in the memory reorder buffer (MOB).
cache, but it will take an additional three cycles (assum- a likely loop), and the target address, if available, is used
ing a cache hit) before the loaded data is available for to redirect the fetch stream. At that point, however,
use. The address-generation unit is probably a critical seven cycles have been wasted fetching and decoding in-
timing path in the P6, limiting the processor to 133 MHz; structions that are unlikely to be needed. In this case,
RISC processors can typically cycle their simpler ALUs the long fetch-and-decode pipeline saps performance.
at 200 MHz or better in 0.5-micron technology. Forward branches that miss the BTB are predicted
Once the uop executes, it writes its result to the to be not taken, so the sequential path continues to be
ROB. If all previous uops have been retired, it takes one fetched with no delay. Branches that miss the BTB are
cycle to retire a uop. If previous instructions are still mispredicted more often than those that hit and are sub-
pending, however, it may take several cycles before the ject to the same mispredicted branch penalties.
uop is retired. This delay does not impact performance.
Conditional Move Added
Branch Prediction Accuracy Is Critical The P6 instruction set is nearly identical to Pen-
The deep pipeline creates extraordinary branch tium’s and, in fact, to that of the 386. The most significant
penalties. The outcome of a conditional branch is not addition is a conditional move (CMOV) instruction. This
known until stage 10. The minimum penalty for a mis- instruction, which has been added to several RISC in-
predicted branch is 11 cycles. Intel estimates that, on av- struction sets recently, helps avoid costly mispredictions
erage, a uop spends four cycles in the reservation station, by eliminating branches. The instruction copies the con-
so a mispredicted branch will typically cause a 15-cycle tents of one register into another only if a particular con-
penalty. If the branch spends an unusually long time in dition flag is set, replacing a test-and-branch sequence.
the reservation station, the penalty could be even worse. Like all current Intel processors, the P6 implements
Thus, the P6 designers spent a lot of effort to reduce system-management mode. The new CPU also supports
the number of mispredicted branches. Like Pentium, the all the features in Pentium’s secret Appendix H. Intel
P6 uses a branch target buffer that retains both branch- says that these features—including CPU ID, large page
history information and the predicted target of the sizes, and virtual-8086 extensions—will be fully docu-
branch. This table is accessed by the current program mented when the P6 is released. The P6 also implements
counter (PC). If the PC hits in the BTB, a branch is pre- performance counters similar to those described in
dicted to the target address indicated by the BTB entry; Appendix H, but it uses a new instruction that allows
there is no delay for correctly predicted taken branches. them to be accessed more easily.
The BTB has 512 entries organized in a four-way It has been widely speculated that the P6 would in-
set-associative cache. This size is twice that of Pentium, clude new instructions to speed NSP (native signal pro-
improving the hit rate. The P6 rejects the commonly cessing) applications. While the P6’s improved integer
used Smith algorithm, which maintains four states multiplier will assist NSP, multimedia extensions such
using two bits, in favor of the more recent Yeh method[1]. as those in Sun’s UltraSparc (see 081603.PDF ) would
This adaptive algorithm uses four bits of branch history have an even bigger effect. Intel, however, denies that
and can recognize and predict repeatable sequences of the P6 has any such extensions. This oversight could
branches, for example, taken–taken–not taken. We esti- give RISC processors like UltraSparc a significant per-
mate that the P6 BTB will deliver close to 90% accuracy formance advantage when executing increasingly popu-
on programs such as the SPECint92 suite. lar multimedia software.
A second type of misprediction occurs if a branch
misses in the BTB. This situation is not detected until P6 Bus Allows Glueless MP
the instruction is fully decoded in stage 6. The branch is The P6 system bus is completely redesigned from
predicted to be taken if the offset is negative (indicating the Pentium bus. Both use 64 bits of data and operate at
5 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT
6 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT
module to be roughly $350. This cost is Intel MIPS AMD Cyrix NexGen Intel
greater than that of competitive chips, P6 R10000 K5 M1 Nx586 Pentium
but the entire L2 cache is included. In a Clock Speed 133 MHz 200 MHz 100 MHz 100 MHz 93 MHz 100 MHz
0.35-micron process, the cost of the P6 Cache Size (I/D) 8K/8K 32K/32K 16K/8K 16K 16K/16K 8K/8K
Dispatch Rate 3 instr 4 instr 2–3 instr 2 instr 1 instr 2 instr
could drop to $150 in 1997.
Function Units 5 units 5 units 7 units 2 units 4 units 3 units
Improving System Performance Predecode Bits none 4 per 32 5 per 8 none none none
Branch History 512 × 4 512 × 2 1,024 × 1 256 × 2 2,048 × 2 256 × 2
Intel designed the P6 to achieve high Out of Order 40 instr 32 instr 16 instr limited 14 instr none
performance at the system level, not just Rename Regs 40 regs 64 regs 16 regs 32 regs 22 regs none
within the CPU core. Thus, the team L2 Cache Bus? yes yes no no yes no
placed significant emphasis on the cache Glueless MP? yes yes no no no no
subsystem and the system bus as well as 0.5µ 4M 0.5µ 4M 0.5µ 3M 0.65µ 3M 0.65µ 4M 0.5µ 4M
IC Process
the CPU. The nonblocking caches, closely BiCMOS CMOS CMOS CMOS CMOS BiCMOS
coupled L2 cache, and split-transaction Logic Transistors 4.5 million 2.3 million 2.4 million 2.1 million 1.6 million 2.4 million
bus exemplify this emphasis. These fea- Total Transistors 5.5 million 5.9 million 4.3 million 3.0 million 3.5 million 3.3 million
tures put the P6 a step beyond competi- 387-pin 527-pin 296-pin 296-pin 463-pin 296-pin
Package Type
tive x86 processors. MCM-C CPGA CPGA CPGA CPGA CPGA
In contrast, AMD’s K5 appears to be Die Size 306 mm2 298 mm2 225 mm2* 394 mm2 196 mm2 163 mm2
Est Mfg Cost $350*† $320* $170* $340* $200* $120*
a P6-class CPU core trapped in a Pen-
Power (max) 20 W† 30 W 12 W* 10 W 16 W 10 W
tium pinout. The business decision to tar-
Availability 3Q95* 4Q95 3Q95* 3Q95* 3Q94 2Q94
get the Pentium interface leaves the K5
SPECint92 (est) 200 int >300 int 130 int 120 int* 110 int* 113 int
with a single bus for both cache and I/O SPECfp92 (est) 200 fp* >600 fp 75 fp* 70 fp* n/a 82 fp
traffic, hampering performance on pro-
grams that overflow the K5’s small on- Table 3. The P6 feature set stacks up well against top x86 competitors and the R10000,
a similar RISC implementation. The key differences are clock speed and performance.
chip caches. This bottleneck will become (Source: vendors except *MDR estimates) †includes L2 cache chip
more severe as AMD increases the K5’s
clock speed to 150 MHz or higher, since the secondary nearly twice as many logic transistors as the MIPS chip;
cache will continue to be restricted to 66 MHz or less. the extra logic handles x86 decode, uop translation, and
Intel’s deeper pipeline should give the P6 a clock- the foibles of the x86 instruction set. Since both chips
speed advantage over the K5 in comparable manufactur- have similar die size and transistor budgets, the R10000
ing processes. While the deeper pipeline also increases is able to include four times as much on-chip cache as the
pipeline penalties, the P6 has much more sophisticated P6, improving performance on many programs.
branch prediction and a larger reorder buffer, allowing it Second, the first P6 will run at 133 MHz, while the
to outperform the K5 on a clock-for-clock basis as well. R10000 is expected to achieve 200 MHz using a similar
The K5 will, of course, be less expensive and probably manufacturing process. To come even this close in clock
consume less power than the P6. speed, Intel uses a very deep pipeline, a concept that
Cyrix’s M1 shares the same system-interface con- MIPS tried and rejected for the R10000. The deeper
straints as the K5. Furthermore, its static two-pipeline pipeline has greater branch penalties, sapping perfor-
design is less efficient than the decoupled design of the mance. And, of course, the higher clock speed gives the
P6 (and K5). Although Cyrix has access to leading-edge R10000 an intrinsic performance advantage. As a result,
manufacturing technology from IBM, it isn’t clear that the MIPS chip should achieve at least 50% better integer
the company has the resources to quickly move its CPU performance than the P6.
to the latest processes, keeping pace with Intel and Working within the constraints of the x86 instruc-
AMD. Cyrix must start from scratch to develop a decou- tion set is always a challenge. The P6 takes a huge step
pled P6-class CPU, a process that will take years. beyond the static Pentium architecture, applying a de-
The P6 is similar to several of the next-generation coupled superscalar engine to the performance problem.
RISC processors, in particular the MIPS R10000 (see Although this design works around many of the bottle-
081403.PDF ). Table 3 compares these designs. The necks of the x86 instruction set, it doesn’t match the per-
R10000 is a four-way superscalar processor, while the P6 formance of a pure RISC chip. Compared with other x86
is three-way superscalar. The MIPS chip can execute up processors, the P6 is clearly the best of class and sets a
to 32 instructions out of order, a few less than the P6. new standard for other vendors to match. ♦
Both have a dedicated L2 cache bus and support a high-
[1] Tse-Yu Yeh and Yale Patt, “Two-Level Adaptive
bandwidth MP system bus.
Training Branch Prediction,” 24th International Sympo-
The P6’s CISC handicap shows in two places. De-
sium on Microarchitecture (Nov. 1991), pp. 51–61.
spite the similar microarchitectures, the P6 requires
7 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources