0% found this document useful (0 votes)
23 views7 pages

Intel's P6 Uses Decoupled Superscalar Design

Intel’s P6 Uses Decoupled Superscalar Design

Uploaded by

uty uaty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

Intel's P6 Uses Decoupled Superscalar Design

Intel’s P6 Uses Decoupled Superscalar Design

Uploaded by

uty uaty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

MICROPROCESSOR REPORT

Intel’s P6 Uses Decoupled Superscalar Design


Next Generation of x86 Integrates L2 Cache in Package with CPU

by Linley Gwennap cache and bus designs that allow even large programs to
make good use of the superscalar CPU core.
Intel’s forthcoming P6 processor (see cover story) is As Figure 1 (see below) shows, the P6 can be divided
designed to outperform all other x86 CPUs by a signifi- into two portions: the in-order and out-of-order sections.
cant margin. Although it shares some design techniques Instructions start in order but can be executed out of
with competitors such as AMD’s K5, NexGen’s Nx586, order. Results flow to the reorder buffer (ROB), which
and Cyrix’s M1, the new Intel chip has several important puts them back into the correct order. Like AMD’s K5
advantages over these competitors. (see 081401.PDF ), the P6 uses the ROB to hold results
The P6’s deep pipeline eliminates the cache-access that are generated by speculative and out-of-order in-
bottlenecks that restrict its competitors to clock speeds structions; if it turns out that these instructions should
of about 100 MHz. The new CPU is designed to run at not have been executed, their results can be flushed from
133 MHz in its initial 0.5-micron BiCMOS implementa- the ROB before they are committed.
tion; a 0.35-micron version, due next year, could push The performance increase over Pentium comes
the speed as high as 200 MHz. largely from the out-of-order execution engine. In Pen-
In addition, the Intel design uses a closely coupled tium, if an instruction takes several cycles to execute,
secondary cache to speed memory accesses, a critical due to a cache miss or other long-latency operation, the
issue for high-frequency CPUs. Intel will combine the P6 entire processor stalls until that instruction can proceed.
CPU and a 256K cache chip into a single PGA package, In the same situation, the P6 will continue to execute
reducing the time needed for data to move from the subsequent instructions, coming back to the stalled in-
cache to the processor. struction once it is ready to execute. Intel estimates that
Like some of its competitors, the P6 translates x86 the P6, by avoiding stalls, delivers 1.5 SPECint92 per
instructions into simple, fixed-length instructions that MHz, about 40% better than Pentium.
Intel calls micro-operations or uops (pronounced “you-
ops”). These uops are then executed in a decoupled su- x86 Instructions Translate to Micro-ops
perscalar core capable of register renaming and out-of- The P6 CPU includes an 8K instruction cache that
order execution. Intel has given the name “dynamic is similar in structure to Pentium’s. On each cycle, it can
execution” to this particular combination of features, deliver 16 aligned bytes into the instruction byte queue.
which is neither new nor unique, but highly effective in Unlike Pentium, the P6 cache cannot fetch an unaligned
increasing x86 performance. cache line, throttling the decode process when poorly
The P6 also implements a new system bus with in- aligned branch targets are encountered. Any hiccups in
creased bandwidth compared to the Pentium bus. The the fetch stream, however, are generally hidden by the
new bus is capable of supporting up to four P6 processors deep queues in the execution engine.
with no glue logic, reducing the cost of developing and The instruction bytes are fed into three instruction
building multiprocessor systems. This feature set makes decoders. The first decoder, at the front of the queue, can
the new processor particularly attractive for servers; it handle any x86 instruction; the others are restricted to
will also be used in high-end desktop PCs and, eventu- only simple (e.g., register-to-register) instructions. In-
ally, in mainstream PC products. structions are always decoded in program order, so if an
instruction cannot be handled by a restricted decoder,
Not Your Grandfather’s Pentium neither that instruction nor any subsequent ones can be
While Pentium’s microarchitecture carries a dis- decoded on that cycle; the complex instruction will even-
tinct legacy from the 486, it is hard to find a trace of Pen- tually reach the front of the queue and be decoded by the
tium in the P6. The P6 team threw out most of the design general decoder.
techniques used by the 486 and Pentium and started Assuming that instruction bytes are available, at
from a blank piece of paper to build a high-performance least one x86 instruction will be decoded per cycle, but
x86-compatible processor. more than one will be decoded only if the second (and
The result is a microarchitecture that is quite radi- third) instructions fall into the “restricted” category.
cal compared with Intel’s previous x86 designs, but one Intel refused to list these instructions, but they do not in-
that draws from the same bag of tricks as competitors’ clude any that operate on memory. Thus, the P6’s ability
x86 chips. To this mix, the P6 adds high-performance to execute more than one x86 instruction per cycle relies

Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT

store the result of a load or calculation uop along with


Instr TLB
the condition codes that could be changed by that uop. As
8K Instruction Cache
(32 entry) uops execute, they write their results to the ROB.
128 64
Closely associated with the ROB is the register
Branch 1 uop alias table (RAT). As uops are logged in the ROB, the
Target Simple Decoder RAT determines if their source operands should be
1 uop Reorder
Buffer Simple Decoder Buffer taken from the real register file (RRF) or from the ROB.
4 uops (40 entries)
General Decoder The latter case occurs if the destination register of a pre-
Instruction
RAT RRF
vious instruction in the ROB matches the source regis-
Fetch Unit IN-ORDER Uop Sequencer
SECTION ter; if so, that source register number is replaced by a
3 uops
pointer to the appropriate ROB entry. The RAT is also
Reservation Station updated with the destination register of each uop.
(20 entries) In this way, the P6 implements register renaming.
When uops are executed, they read their data from ei-
Store Load ther the register file or the ROB, as needed. Renaming, a
Store Integer FP Integer
Data
Addr Addr
ALU Unit Unit technique discussed in detail when Cyrix introduced its
Unit Unit
M1 design (see 071401.PDF ), helps break one of the major
bottlenecks of the x86 instruction set: the small number
Memory Reorder
Buffer (MOB) OUT-OF-ORDER of general-purpose registers. The ROB provides the P6
EXECUTION ENGINE with 40 registers that can hold the contents of any inte-
1 store 1 load load data 32 ger or FP register, reducing the number of stalls due to
Data TLB register conflicts.
(64 entry) 8K Dual-Ported Data Cache

64 Out-of-Order Engine Drives Performance


System Bus Interface L2 Cache Interface Up to three uops can be renamed and logged in the
ROB on each cycle; these three uops then flow into the
36 addr 64 data 64 data reservation station (RS). This section of the chip holds up
to 20 uops in a single structure. (The RS can be smaller
Figure 1. The P6 combines an in-order front end with a decoupled than the ROB because the ROB must track uops that
superscalar execution engine that can process RISC-like micro-
ops speculatively and out of order.
have executed but are not retired.) The uops wait in the
reservation station until their source operands are all
on avoiding long sequences of complex instructions or in- available. Due to the register renaming, only a single
structions that operate on memory. ROB entry (per source) must be checked to determine if
The decoders translate x86 instructions into uops. the needed value is available. Any uops with all operands
P6 uops have a fixed length of 118 bits, using a regular available are marked as ready.
structure to encode an operation, two sources, and a des- Each cycle, up to five uops can be dispatched: two
tination. The source and destination fields are each wide calculations, a load, a store address, and a store data. A
enough to contain a 32-bit operand. Like RISC instruc- store requires a second uop to carry the store data,
tions, uops use a load/store model; x86 instructions that which, in the case of floating-point data, must be con-
operate on memory must be broken into a load uop, an verted before writing it to memory.
ALU uop, and possibly a store uop. There are some restrictions to pairing calculation
The restricted decoders can produce only one uop uops due to the arrangement of the read ports of the
per cycle (and thus accept only instructions that trans- reservation station. As Figure 1 shows, only one uop per
late into a single uop). The generalized decoder is capable cycle can be dispatched to the main arithmetic unit,
of generating up to four uops per cycle. Instructions that which consists of the floating-point unit and a complete
require more than four uops are handled by a uop se- integer unit, which has an ALU, shifter, multiplier, and
quencer that generates the requisite series of uops over divider. The second calculation uop goes to the secondary
two or more cycles. Because of x86 constructs such as the ALU and must be either an integer-ALU (no shifts, mul-
string instructions, a single instruction can produce a tiplies, or divides) or a branch-target-address calcula-
very long sequence of uops. Many x86 instructions, on the tion. Simplifying the second calculation unit reduces the
other hand, translate into a single uop, and the average die area with little impact on performance.
is 1.5–2.0 uops per instruction, according to Intel. In many situations, more uops will be ready to exe-
The uops then pass through the reorder buffer. The cute than there are function units and read ports. When
ROB must log each uop so it can later be retired in pro- this happens, the dispatch logic prioritizes the available
gram order. Each of the 40 ROB entries also has room to uops according to a complex set of rules. Intel declines to

2 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT

discuss these rules but notes that older uops are given P6 Pentium
priority over newer ones, speeding the resolution of Throughput Latency Throughput Latency
chains of dependent operations. Integer multiply 1 cycle 4 cycles 4–8 cycles 7–14 cycles
The integer units handle most calculations in a sin- Integer divide 12–36 cyc 12–36 cyc 42–84 cyc 42–84 cyc
gle cycle, but integer multiply and divide take longer. FP add 1 cycle 3 cycles 1 cycle 3 cycles
The P6 FPU executes adds in three cycles and requires FP multiply 2 cycle 5 cycles 1 cycle 3 cycles
five cycles for multiply operations. FP add and multiply FP divide 18–38 cyc 18–38 cyc 39 cycles 39 cycles
are pipelined and can be executed in parallel with long- FP sq root 29–69 cyc 29–69 cyc 70–140 cyc 70–140 cyc
latency operations. Table 1 shows the cycle times for Table 1. P6 floating-point latencies are similar to Pentium’s, but in-
teger arithmetic is much faster. Latencies are the same for single,
long-latency integer and floating-point operations. double, and extended precision except where ranges are shown.
The FXCH (floating-point exchange) instruction is
not handled by the FPU. This instruction swaps the top with its memory reorder buffer. If the access at the front
of the FP register stack with another register in the of the MOB misses, subsequent accesses will continue to
stack and is frequently used in x86 floating-point appli- execute. Since these accesses can execute out of order,
cations, since many x86 FP instructions access only the the P6 must take care to avoid incorrect program execu-
top of the stack. The P6 handles FXCH entirely in the re- tion. For example, loads can arbitrarily pass loads, but
order buffer by treating it as a renaming of two registers. stores must always be executed in order. Loads can pass
Thus, FXCH uops enter the ROB but not the reservation stores only if the two addresses are verified to be differ-
station and never enter the function units. ent. Intel would not reveal the size of the MOB or other
The dual address-generation units each contain a details of its function.
four-input adder that combines all possible x86 address The P6 data cache can process one load and one
components (segment, base, index, and immediate) in a store per cycle as long as they access different banks.
single cycle. As in Pentium, each unit also contains a sec- The cache is divided into four interleaved banks, half as
ond four-input adder to perform a segment-limit check in many as Pentium’s data cache. AMD’s K5 also imple-
parallel. The resulting address (along with, for a store, ments a four-bank data cache, and that company says
source data) is then placed in the memory reorder buffer that there is little benefit to an eight-bank design.
(MOB, in P6-ese) to await availability of the data cache. The P6 cache cannot handle two loads at once, in
The reorder buffer can accept three results per stark contrast to processors such as the K5, M1, and
cycle: one from the main arithmetic unit, one from the even Pentium. Typical x86 code generates a large num-
secondary ALU, and one from a load uop. Because of ber of memory references due to the limited register set,
long-latency operations, two or more function units and more of these references will be loads than stores.
within the main arithmetic unit can generate results in Intel points out that the entire processor is designed for
a single cycle. In this case, the function units must arbi- one load per cycle—even the decoders cannot produce
trate to write to the ROB. more than one load uop per cycle—and that its simula-
After a uop writes its result to the ROB (or, in the tions show this capability is adequate to attain the de-
case of a store, to the MOB), it is eligible to be retired. sired performance level.
The ROB will retire up to three uops per cycle, always in The data cache has a latency of three cycles (in-
program order, by writing their results to the register cluding address generation) but is fully pipelined, pro-
file. Uops are never retired until all previous uops have ducing one result per cycle. The unified level-two (L2)
completed successfully. If any exception or error occurs, cache has a latency of three cycles and, like the data
uncommitted results in the ROB can be flushed, reset- cache, is pipelined and nonblocking. The fast L2 latency
ting the CPU to its proper state, as if the instructions is achieved by implementing the L2 cache as a single
had been executed in order up to the point of the error. chip and combining it with the CPU in a single package,
as Figure 2 (see below) shows.
Nonblocking Caches Reduce Stalls Address translation occurs in parallel with the data
A key to the P6’s performance is its cache subsys- cache access. If the access misses the data cache, the
tem. The primary data cache, at 8K, is relatively small translated (real) address is sent to the L2 cache. The la-
for a processor of this generation, but the fast level-two tency of a load that misses the data cache but hits in the
cache helps alleviate the lower hit rate of the primary L2 is six cycles, assuming that the L2 cache is not busy
cache. If an access misses the data cache, the cache can returning data from an instruction cache miss or an ear-
continue servicing requests while waiting for the miss lier data cache miss.
data to be returned. This technique, called a nonblocking The P6 CPU contains a complete L2 cache con-
cache or hit-under-miss, has been used for years by troller and is Intel’s first x86 processor with a dedicated
PA-RISC processors to avoid stalls. L2 cache bus. Both NexGen’s 586 and the R4000 use this
The P6 takes advantage of the nonblocking cache design style to increase cache-bus bandwidth while

3 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT

sents the case of an instruction that flows through the


CPU as quickly as possible. It is more likely that the in-
struction will be stalled in the reservation station for
some number of cycles, a delay represented by the thick
black band. The second black band represents another
potential delay: completed instructions can spend sev-
eral cycles waiting in the ROB before retirement.
In the first stage, the next fetch address is calcu-
lated by accessing the branch target buffer (BTB). If
there is a hit in the BTB, the fetch stream is redirected
to the indicated location. Otherwise, the processor con-
tinues to the next sequential address.
The instruction cache access is spread across two
and one-half cycles. The K5, in contrast, must calculate
the next address and read from the instruction cache in
a single cycle. By allowing multiple pipeline stages for
Figure 2. The P6 CPU and L2 cache are combined in a single 387-
the cache access, Intel removes this task from the critical
pin PGA package that measures 7.75 × 6.25 cm (2.66" × 2.46"). timing path and allows the P6 clock to run faster.
Instructions are then fed to the decoders. The com-
allowing the separate system bus to operate at a lower, plex problem of decoding variable-length x86 instruc-
more manageable speed. tions is allocated two and one-half cycles as well. Part of
The L2 cache uses the same 32-byte line size as the the problem in a superscalar x86 processor is identifying
on-chip caches. It returns 64 bits of data at a time, tak- the starting point of the second and subsequent instruc-
ing four cycles to refill a cache line. The cache always re- tions in a group. The K5 includes predecode information
turns the requested word within the first transfer, get- in its instruction cache to hasten this process, but the P6
ting the critical data back to the processor as quickly as does not, to avoid both instruction-cache bloat and the
possible. Table 2 shows other cache parameters. bottleneck of predecoding instructions as they are read
This combination of nonblocking caches with a fast from the L2 cache.
L2 cache provides the P6 with better performance on The decoding issue is easier for P6 because the sec-
memory accesses than its x86 competitors. The proces- ond and third decoders handle only simple instructions.
sor stalls less often and has relatively quick access to The restricted decoders can wait for the general decoder
256K of memory. The K5, by contrast, has 24K of on-chip to identify the length of the first instruction and still
cache but will take longer to access its secondary cache. have time to handle the remaining instructions, assum-
ing that they are simple ones. If they are not simple, the
Deep Pipeline Speeds Clock Rate restricted decoders must pass them on to the general de-
Another advantage that the P6 has over its x86 coder in the next cycle.
competitors is a higher clock rate. Intel achieves this feat At the end of stage 6, the x86 instructions have been
by deeply pipelining the chip. Figure 3 shows the P6 fully decoded and translated into uops. These uops have
pipeline, which consists of 12 stages. This pipeline repre- their registers renamed in stage 7. In stage 8, the re-
named uops are written to the reservation station. If the
Instruction Data Level Two operands for a particular uop are available, and if no
Cache Size 8K 8K 256K
other uops have priority for the needed function unit,
Line Size 32 bytes 32 bytes 32 bytes that uop is dispatched. Otherwise, the uop will wait for
Throughput / Latency 1 / 3 cycles 1 / 3 cycles 1 / 3 cycles its operands to become available.
Nonblocking? yes yes yes It takes one cycle (stage 9) for the reservation sta-
Associativity four-way two-way four-way tion to decide which uops can be dispatched. The P6 im-
Access Width 128 bits 64 bits 64 bits plements operand bypassing, so a result can be used on
Number of Ports one two one the immediately following cycle. The reservation station
Number of Banks n/a four n/a will attempt to have the corresponding uop arrive at the
Indexed virtual virtual physical function unit at the same time as the necessary data.
Tagged physical physical physical
Simple integer uops can execute in a single cycle
TLB Entries 32 entries 64 entries —
(stage 10). Some integer uops and all FP uops take sev-
TLB Associativity fully fully —
TLB Number of Ports one two —
eral cycles at this point. Load and store uops generate
Table 2. The fast nonblocking L2 cache is fully pipelined and helps their address in one cycle and are written to the MOB; if
make up for the higher miss rate of the small primary caches. the MOB is empty, a load will go directly to the data

4 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT

BTB Instruction x86 Decode / Register Write Read


Access Cache Access uop Generation Rename to RS from RS Execute Retire
1 2 3 4 5 6 7 8 9 10 11 12
Possible delay in RS Possible delay in ROB

Address Data Cache


Calc Access L2 Cache Access Retire

Load Pipeline: L1 L2 L3 L4 L5 L6 11 12
Bypass if L1 hit
Possible delay in MOB Possible delay in ROB
Figure 3. In the best case, instructions can flow through the P6 in 12 cycles, but the average is 18 cycles due to delays in the reservation sta-
tion (RS) or the reorder buffer (ROB). Load instructions take longer and can also be delayed in the memory reorder buffer (MOB).

cache, but it will take an additional three cycles (assum- a likely loop), and the target address, if available, is used
ing a cache hit) before the loaded data is available for to redirect the fetch stream. At that point, however,
use. The address-generation unit is probably a critical seven cycles have been wasted fetching and decoding in-
timing path in the P6, limiting the processor to 133 MHz; structions that are unlikely to be needed. In this case,
RISC processors can typically cycle their simpler ALUs the long fetch-and-decode pipeline saps performance.
at 200 MHz or better in 0.5-micron technology. Forward branches that miss the BTB are predicted
Once the uop executes, it writes its result to the to be not taken, so the sequential path continues to be
ROB. If all previous uops have been retired, it takes one fetched with no delay. Branches that miss the BTB are
cycle to retire a uop. If previous instructions are still mispredicted more often than those that hit and are sub-
pending, however, it may take several cycles before the ject to the same mispredicted branch penalties.
uop is retired. This delay does not impact performance.
Conditional Move Added
Branch Prediction Accuracy Is Critical The P6 instruction set is nearly identical to Pen-
The deep pipeline creates extraordinary branch tium’s and, in fact, to that of the 386. The most significant
penalties. The outcome of a conditional branch is not addition is a conditional move (CMOV) instruction. This
known until stage 10. The minimum penalty for a mis- instruction, which has been added to several RISC in-
predicted branch is 11 cycles. Intel estimates that, on av- struction sets recently, helps avoid costly mispredictions
erage, a uop spends four cycles in the reservation station, by eliminating branches. The instruction copies the con-
so a mispredicted branch will typically cause a 15-cycle tents of one register into another only if a particular con-
penalty. If the branch spends an unusually long time in dition flag is set, replacing a test-and-branch sequence.
the reservation station, the penalty could be even worse. Like all current Intel processors, the P6 implements
Thus, the P6 designers spent a lot of effort to reduce system-management mode. The new CPU also supports
the number of mispredicted branches. Like Pentium, the all the features in Pentium’s secret Appendix H. Intel
P6 uses a branch target buffer that retains both branch- says that these features—including CPU ID, large page
history information and the predicted target of the sizes, and virtual-8086 extensions—will be fully docu-
branch. This table is accessed by the current program mented when the P6 is released. The P6 also implements
counter (PC). If the PC hits in the BTB, a branch is pre- performance counters similar to those described in
dicted to the target address indicated by the BTB entry; Appendix H, but it uses a new instruction that allows
there is no delay for correctly predicted taken branches. them to be accessed more easily.
The BTB has 512 entries organized in a four-way It has been widely speculated that the P6 would in-
set-associative cache. This size is twice that of Pentium, clude new instructions to speed NSP (native signal pro-
improving the hit rate. The P6 rejects the commonly cessing) applications. While the P6’s improved integer
used Smith algorithm, which maintains four states multiplier will assist NSP, multimedia extensions such
using two bits, in favor of the more recent Yeh method[1]. as those in Sun’s UltraSparc (see 081603.PDF ) would
This adaptive algorithm uses four bits of branch history have an even bigger effect. Intel, however, denies that
and can recognize and predict repeatable sequences of the P6 has any such extensions. This oversight could
branches, for example, taken–taken–not taken. We esti- give RISC processors like UltraSparc a significant per-
mate that the P6 BTB will deliver close to 90% accuracy formance advantage when executing increasingly popu-
on programs such as the SPECint92 suite. lar multimedia software.
A second type of misprediction occurs if a branch
misses in the BTB. This situation is not detected until P6 Bus Allows Glueless MP
the instruction is fully decoded in stage 6. The branch is The P6 system bus is completely redesigned from
predicted to be taken if the offset is negative (indicating the Pentium bus. Both use 64 bits of data and operate at

5 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT

voltage (1.5 V) reduces settling time in a complex electri-


cal environment. Intel will roll out the first P6 chip sets
along with the processor, but other vendors are expected
to produce compatible chip sets as well.
With its integrated cache and APIC, the P6 module
offers an easy way to upgrade an MP system by plugging
in a new processor. This feature will be most useful in
servers, which often sell in MP configurations today. Ul-
timately, the P6 will be used in multiprocessor PCs, a
market being seeded today by the APIC-enabled P54C.

Another Big, Power-Hungry CPU


As always, there is no free lunch: high performance
comes at a price. The P6 CPU requires 5.5 million tran-
sistors, of which about 4.5 million are for logic and 1.0
million are in the 16K of cache. Even in a 0.5-micron
four-layer-metal BiCMOS process, this circuitry re-
quires 306 mm2 of silicon, as Figure 4 shows. To keep
this in perspective, however, the die size is only 4% big-
ger than that of the original Pentium. As with Pentium,
a shrink to the next process will make the P6 much
Figure 4. The P6 CPU measures 17.5 × 17.5 mm when fabricated in smaller and easier to manufacture.
a 0.5-micron four-layer-metal BiCMOS process. The cache chip consumes 202 mm2 and is built in
the same process as the CPU, as it must operate at the
a maximum of 66 MHz, but the P6 can sustain much same clock rate. It contains tags and data storage for
higher bandwidth because it uses a split-transaction pro- 256K of cache and requires 15 million transistors.
tocol. When Pentium reads from its bus, it sends the ad- Although the P6 uses the same nominal IC process
dress, waits for the result, and reads the returned data. as the P54C Pentium, it operates from a core voltage of
This sequence ties up the bus for essentially the entire 2.9 V rather than Pentium’s 3.3 V. This change could in-
transaction. The P6 bus, in contrast, continues to conduct dicate some minor process tweaks to reduce transistor
transactions while waiting for results. This overlapping size or gate-oxide thickness; such tweaks can improve
of transactions greatly improves overall bus utilization. performance but reduce the allowable supply voltage.
All arbitration is conducted on separate signals on The lower voltage has the side effect of reducing the
the P6 bus in parallel with data transmission. Addresses power consumption; even so, the P6 CPU has a prelimi-
are sent on their own 36-bit bus, so the bus is capable of nary power rating of 15 W maximum and 12 W typical.
sustaining the full 528-Mbyte/s bandwidth for an indefi- With the 256K L2 cache chip, the total power is expected
nite period. Since the P6 has a private cache bus, the en- to be 20 W maximum and 15 W typical. This power
tire system-bus bandwidth can be devoted to memory rating is slightly greater than that of the original P5
and I/O accesses. (Intel will disclose the details of the P6 Pentium and is quite reasonable compared with next-
system bus at a later date.) generation RISC processors, which start at 30 W. The
A split-transaction bus is ideal for a multiprocessor greater surface area of the P6 package allows it to use
(MP) system. Intel has designed the P6 bus to support up shorter heat sinks than the notoriously hot P5.
to four processors without any glue logic; that is, the pro- Intel investigated many types of multichip modules
cessor pins can be wired directly to each other with only (MCMs) before settling on a simple two-cavity PGA, es-
a single chip set to support several CPUs. Like Pentium, chewing more complicated (and costly) flip-chip options.
the P6 includes Intel’s advanced priority interrupt con- The package is similar to a standard PGA with extra ce-
troller (APIC), simplifying multiprocessor designs. The ramic layers to route the 64-bit cache bus between the
company believes that the system bus has enough band- CPU and cache chips. The die are attached using stan-
width to support four P6 processors with little perfor- dard wire bonding; no unusual substrates are required.
mance degradation. Larger MP systems can be built The pin arrangement, shown previously in Figure 2,
from clusters of four processors each. is asymmetric: interstitial pins surround the CPU, which
To operate at 66 MHz with up to eight devices (four drives all the signals that leave the module, but not the
processors along with two memory controllers and two cache chip. We estimate that this 387-pin dual-cavity
I/O bridges), the P6 bus operates at modified GTL (Gun- package costs Intel about $50 in volume. The MPR Cost
ning transceiver logic) signal levels. This lower signal Model computes the overall manufacturing cost of the P6

6 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources
MICROPROCESSOR REPORT

module to be roughly $350. This cost is Intel MIPS AMD Cyrix NexGen Intel
greater than that of competitive chips, P6 R10000 K5 M1 Nx586 Pentium
but the entire L2 cache is included. In a Clock Speed 133 MHz 200 MHz 100 MHz 100 MHz 93 MHz 100 MHz
0.35-micron process, the cost of the P6 Cache Size (I/D) 8K/8K 32K/32K 16K/8K 16K 16K/16K 8K/8K
Dispatch Rate 3 instr 4 instr 2–3 instr 2 instr 1 instr 2 instr
could drop to $150 in 1997.
Function Units 5 units 5 units 7 units 2 units 4 units 3 units
Improving System Performance Predecode Bits none 4 per 32 5 per 8 none none none
Branch History 512 × 4 512 × 2 1,024 × 1 256 × 2 2,048 × 2 256 × 2
Intel designed the P6 to achieve high Out of Order 40 instr 32 instr 16 instr limited 14 instr none
performance at the system level, not just Rename Regs 40 regs 64 regs 16 regs 32 regs 22 regs none
within the CPU core. Thus, the team L2 Cache Bus? yes yes no no yes no
placed significant emphasis on the cache Glueless MP? yes yes no no no no
subsystem and the system bus as well as 0.5µ 4M 0.5µ 4M 0.5µ 3M 0.65µ 3M 0.65µ 4M 0.5µ 4M
IC Process
the CPU. The nonblocking caches, closely BiCMOS CMOS CMOS CMOS CMOS BiCMOS
coupled L2 cache, and split-transaction Logic Transistors 4.5 million 2.3 million 2.4 million 2.1 million 1.6 million 2.4 million
bus exemplify this emphasis. These fea- Total Transistors 5.5 million 5.9 million 4.3 million 3.0 million 3.5 million 3.3 million
tures put the P6 a step beyond competi- 387-pin 527-pin 296-pin 296-pin 463-pin 296-pin
Package Type
tive x86 processors. MCM-C CPGA CPGA CPGA CPGA CPGA
In contrast, AMD’s K5 appears to be Die Size 306 mm2 298 mm2 225 mm2* 394 mm2 196 mm2 163 mm2
Est Mfg Cost $350*† $320* $170* $340* $200* $120*
a P6-class CPU core trapped in a Pen-
Power (max) 20 W† 30 W 12 W* 10 W 16 W 10 W
tium pinout. The business decision to tar-
Availability 3Q95* 4Q95 3Q95* 3Q95* 3Q94 2Q94
get the Pentium interface leaves the K5
SPECint92 (est) 200 int >300 int 130 int 120 int* 110 int* 113 int
with a single bus for both cache and I/O SPECfp92 (est) 200 fp* >600 fp 75 fp* 70 fp* n/a 82 fp
traffic, hampering performance on pro-
grams that overflow the K5’s small on- Table 3. The P6 feature set stacks up well against top x86 competitors and the R10000,
a similar RISC implementation. The key differences are clock speed and performance.
chip caches. This bottleneck will become (Source: vendors except *MDR estimates) †includes L2 cache chip
more severe as AMD increases the K5’s
clock speed to 150 MHz or higher, since the secondary nearly twice as many logic transistors as the MIPS chip;
cache will continue to be restricted to 66 MHz or less. the extra logic handles x86 decode, uop translation, and
Intel’s deeper pipeline should give the P6 a clock- the foibles of the x86 instruction set. Since both chips
speed advantage over the K5 in comparable manufactur- have similar die size and transistor budgets, the R10000
ing processes. While the deeper pipeline also increases is able to include four times as much on-chip cache as the
pipeline penalties, the P6 has much more sophisticated P6, improving performance on many programs.
branch prediction and a larger reorder buffer, allowing it Second, the first P6 will run at 133 MHz, while the
to outperform the K5 on a clock-for-clock basis as well. R10000 is expected to achieve 200 MHz using a similar
The K5 will, of course, be less expensive and probably manufacturing process. To come even this close in clock
consume less power than the P6. speed, Intel uses a very deep pipeline, a concept that
Cyrix’s M1 shares the same system-interface con- MIPS tried and rejected for the R10000. The deeper
straints as the K5. Furthermore, its static two-pipeline pipeline has greater branch penalties, sapping perfor-
design is less efficient than the decoupled design of the mance. And, of course, the higher clock speed gives the
P6 (and K5). Although Cyrix has access to leading-edge R10000 an intrinsic performance advantage. As a result,
manufacturing technology from IBM, it isn’t clear that the MIPS chip should achieve at least 50% better integer
the company has the resources to quickly move its CPU performance than the P6.
to the latest processes, keeping pace with Intel and Working within the constraints of the x86 instruc-
AMD. Cyrix must start from scratch to develop a decou- tion set is always a challenge. The P6 takes a huge step
pled P6-class CPU, a process that will take years. beyond the static Pentium architecture, applying a de-
The P6 is similar to several of the next-generation coupled superscalar engine to the performance problem.
RISC processors, in particular the MIPS R10000 (see Although this design works around many of the bottle-
081403.PDF ). Table 3 compares these designs. The necks of the x86 instruction set, it doesn’t match the per-
R10000 is a four-way superscalar processor, while the P6 formance of a pure RISC chip. Compared with other x86
is three-way superscalar. The MIPS chip can execute up processors, the P6 is clearly the best of class and sets a
to 32 instructions out of order, a few less than the P6. new standard for other vendors to match. ♦
Both have a dedicated L2 cache bus and support a high-
[1] Tse-Yu Yeh and Yale Patt, “Two-Level Adaptive
bandwidth MP system bus.
Training Branch Prediction,” 24th International Sympo-
The P6’s CISC handicap shows in two places. De-
sium on Microarchitecture (Nov. 1991), pp. 51–61.
spite the similar microarchitectures, the P6 requires

7 Intel’s P6 Uses Decoupled Superscalar Design Vol. 9, No. 2, February 16, 1995 © 1995 MicroDesign Resources

You might also like