0% found this document useful (0 votes)

55 views13 pages

A Case For CXL-Centric Sever Processors

This document proposes replacing DDR memory interfaces with the more bandwidth-efficient Compute Express Link (CXL) standard in server processors. CXL offers about 4x higher bandwidth per pin compared to DDR and can quadruple available memory bandwidth. While CXL has a modest latency overhead of 25-30ns, the document argues this is offset by reducing queueing delays from distributing requests across more memory channels. Evaluation shows the CXL-centric design, called Coaxial, improves performance of many-core servers by 1.52x on average and up to 3x.

Uploaded by

y18864773539

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views13 pages

A Case For CXL-Centric Sever Processors

Uploaded by

y18864773539

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

A Case for CXL-Centric Server Processors

Albert Cho∗ Anish Saxena∗ Moinuddin Qureshi Alexandros Daglis

Georgia Institute of Technology

Abstract The emerging Compute Express Link (CXL) standard

bridges the performance gap between low-bandwidth, high-
The memory system is a major performance determinant for
capacity memory and DDR-based main memory. By attaching
server processors. Ever-growing core counts and datasets de-
DRAM modules over the widely deployed high-bandwidth
mand higher bandwidth and capacity as well as lower latency
PCI Express (PCIe) bus, CXL vastly improves memory capac-
arXiv:2305.05033v1 [cs.AR] 8 May 2023

from the memory system. To keep up with growing demands,

ity and bandwidth, while retaining DDR-like characteristics at
DDR—the dominant processor interface to memory over the
a modest access latency overhead. Consequently, there has re-
past two decades—has offered higher bandwidth with every
cently been much interest in architecting CXL-based memory
generation. However, because each parallel DDR interface
systems that enable memory pooling and capacity expansion
requires a large number of on-chip pins, the processor’s mem-
[3, 16, 25, 33].
ory bandwidth is ultimately restrained by its pin-count, which
is a scarce resource. With limited bandwidth, multiple memory CXL owes its high bandwidth to the underlying PCIe-based
requests typically contend for each memory channel, resulting serial interface, which currently delivers about 4× higher band-
in significant queuing delays that often overshadow DRAM’s width per processor pin compared to the parallel DDR inter-
service time and degrade performance. face, with technological roadmaps projecting this gap to grow
further. Hence, by repurposing the processor’s DDR-allocated
We present C OA X IA L, a server design that overcomes mem-
pins to CXL, it is possible to quadruple the available memory
ory bandwidth limitations by replacing all DDR interfaces to
bandwidth. However, the higher bandwidth comes at the cost
the processor with the more pin-efficient CXL interface. The
of memory access latency overhead, expected to be as low as
widespread adoption and industrial momentum of CXL makes
25–30ns [9, 43], although higher in initial implementations
such a transition possible, offering 4× higher bandwidth per
and systems that multiplex CXL memory devices across mul-
pin compared to DDR at a modest latency overhead. We
tiple processors. Low access latency is a key requirement
demonstrate that, for a broad range of workloads, CXL’s la-
for high-performance memory, which is why CXL’s latency
tency premium is more than offset by its higher bandwidth. As
overhead has biased the research so far to treat the technology
C OA X IA L distributes memory requests across more channels,
exclusively as a memory expansion technique rather than a
it drastically reduces queuing delays and thereby both the
replacement of local DDR-attached memory.
average value and variance of memory access latency. Our
evaluation with a variety of workloads shows that C OA X IA L However, we observe that the overall memory access la-
improves the performance of manycore throughput-oriented tency in a loaded system is dominated by the queuing delay
servers by 1.52× on average and by up to 3×. at the memory controller, which arbitrates access to the DDR
channel. Modern servers feature between 4 and 12 cores per
memory channel, resulting in contention and significant queu-
ing delays even before a request can be launched over the
1. Introduction memory bus. Mitigating these queuing delays by provisioning
more memory channels requires more processor pins and die
Multicore processor architectures have been delivering per-
area, which are scarce resources. Given rigid pin constraints,
formance gains despite the end of Dennard scaling and the
CXL’s bandwidth-per-pin advantage can unlock significant
slowdown of Moore’s law in the past two decades. At the same
bandwidth and performance gains by rethinking memory sys-
time, as the data consumed by processors is increasing expo-
tems to be CXL-centric rather than DDR-centric.
nentially, technological breakthroughs have enabled higher-
capacity memory with new media like non-volatile RAM or In this paper, we make the key observation that the band-
via remote memory access over fast networks (e.g., RDMA). A width boost attainable with CXL drastically reduces memory
common technological trade-off with higher-capacity memory access queuing delays, which largely dictate the effective ac-
is significantly inferior memory access latency and bandwidth cess latency of loaded memory systems. In addition to in-
compared to the DDR-based main memory. As a result, servers creased average memory access latency, queuing delays also
continue to predominantly rely on DDR-attached memory for increase memory access variance, which we show has detri-
performance while optionally retaining a slower memory tier mental effects to performance. Driven by this insight, we
like NVRAM or remote DRAM for capacity expansion. argue that a memory system attached to the processor entirely
over CXL is a key enabler for scalable high-performance server
∗ Equal contribution. processors that deploy memory-intensive workloads. Our pro-

1
posed server design, dubbed C OA X IA L, replaces all of the requirement is determined by the width of data bus, com-
processor’s direct DDR interfaces with CXL. mand/address bus, and configuration pins. A DDR4 and
By evaluating C OA X IA L with a wide range of workloads, DDR5 [21] interface is 288 pins wide. While several of
we highlight how CXL-based memory system’s unique char- those pins are terminated at the motherboard, most of them
acteristics (i.e., increased bandwidth and higher unloaded la- (160+ for an ECC-enabled DDR4 channel [20], likely more
tency) positively impact performance of processors whose for DDR5 [45]) are driven to the processor chip.
memory system is typically loaded. Our analysis relies on a The DDR interface’s 64 data bits directly connect to the
simple but often overlooked fact about memory system be- processor and are bit-wise synchronous with the memory con-
havior and its impact on overall performance: that a loaded troller’s clock, enabling a worst-case (unloaded) access latency
memory system’s effective latency is dominated by the im- of about 50ns. Scaling a DDR-based memory system’s band-
pact of queuing effects and therefore significantly differs from width requires either clocking the channels at a higher rate,
the unloaded system’s latency, as we demonstrate in §3.1. A or attaching more channels to the processors. The former
memory system that offers higher parallelism reduces queuing approach results in signal integrity challenges [39] and a re-
effects, which in turn results in lower average latency and duction in supported ranks per channel, limiting rank-level
variance, even if its unloaded access latency is higher com- parallelism and memory capacity. Accommodating more chan-
pared to existing systems. We argue that CXL-based memory nels requires more on-chip pins, which cost significant area
systems offer exactly this design trade-off, which is favorable and power, and complicate placement, routing, and packag-
for loaded server processors handling memory-intensive ap- ing [62]. Therefore, the pin-count on processor packages has
plications, offering strong motivation for a radical change in only been doubling about every six years [51].
memory system design that departs from two decades of DDR Thus, reducing the number of cores that contend over a
and enables scalable high-performance server architectures. memory channel is difficult without clean-slate technologies,
In summary, we make the following contributions: which we discuss in §7. To this end, the emerging CXL in-
• We make the radical proposal of using high-bandwidth CXL terconnect is bound to bridge this gap by leveraging a widely
as a complete replacement of pin-inefficient DDR interfaces deployed high-bandwidth serial interface, as we discuss next.
on server processors, showcasing a ground-breaking shift
2.2. The High-bandwidth CXL Memory Interconnect
that disrupts decades-long memory system design practices.
• We show that, despite its higher unloaded memory access la- The Compute Express Link (CXL) is a recent interconnect
tency, C OA X IA L reduces the effective memory access time standard, designed to present a unified solution for coherent
in typical scenarios where the memory system is loaded. accelerators, non-coherent devices, and memory expansion
• We demonstrate the promise of C OA X IA L with a study of devices. It represents the industry’s concerted effort for a
a wide range of workloads for various CXL bandwidth and standardized interconnect to replace a wide motley collection
latency design points that are likely in the near future. of proprietary solutions (e.g., OpenCAPI [55], Gen-Z [11]).
• We identify limitations imposed on CXL by the current CXL is rapidly garnering industry adoption and is bound to be-
PCIe standard, and highlight opportunities a revised stan- come a dominant interconnect, as PCIe has been for peripheral
dard could leverage for 20% additional speedup. devices over the past twenty years.
Paper outline: §2 motivates the replacement of DDR with CXL brings load-store semantics and coherent memory ac-
CXL in server processors. §3 highlights the critical impact cess to high-capacity, high-bandwidth memory for processors
of queuing delays on a memory system’s performance and and accelerators alike. It also enables attaching DDR-based
§4 provides an overview of our proposed C OA X IA L server memory (“Type-3” CXL devices) over PCIe to the proces-
design, which leverages CXL to mitigate detrimental queuing. sor with strict timing constraints. In this work, we focus on
We outline the methodology to evaluate C OA X IA L against a this capability of CXL. CXL’s underlying PCIe physical layer
DDR-based system in §5 and analyze performance results in affords higher bandwidth per pin at the cost of increased la-
§6. We discuss related work in §7 and conclude in §8. tency. Therefore, most recent works thus far perceive CXL as
a technology enabling an auxiliary slower memory tier directly
2. Background attached to the processor. In contrast, we argue that despite its
associated latency overhead, CXL can play a central role in
In this section, we highlight how DRAM memory bandwidth future memory systems design, replacing, rather than simply
is bottlenecked by the processor-attached DDR interface and augmenting, DDR-based memory in server processors.
processor pin-count. We then discuss how CXL can bridge
this gap by using PCIe as its underlying physical layer. 2.3. Scaling the Memory Bandwidth Wall with CXL

2.1. Low-latency DDR-based Memory CXL’s high bandwidth owes to its underlying PCIe physical
layer. PCIe [47] is a high-speed serial interface featuring
Servers predominantly access DRAM over the Double Data multiple independent lanes capable of bi-directional commu-
Rate (DDR) parallel interface. The interface’s processor pin nication using just 4 pins per lane: two for transmitting data,

2
64 only adds 12ns per direction [43]. Such low latency over-
PCIe7.0
Relative bandwidth per pin

PCIe6.0
32 heads are attainable with the simplest CXL type-3 devices that
PCIe5.0
DDR6-16000
16 are not multiplexed across multiple hosts and do not need to
PCIe4.0 DDR6-9600
8 initiate any coherence transactions. Our key insight is that a
PCIe3.0 DDR5-4800 DDR5-8000
4
PCIe2.0
memory access latency penalty in the order of 30ns often pales
DDR4-2133 DDR4-3200 in comparison to queuing delays at the memory controller that
2
PCIe1.0
1 are common in server systems, and such queuing delays can
2000 2005 2010 2015 2020 2025 2030
Year
be curtailed by CXL’s considerable bandwidth boost.
Figure 1: Bandwidth per processor pin for DDR and CXL (PCIe)
interface, norm. to PCIe-1.0. Note that y-axis is in log scale.
3. Pitfalls of Unloaded and Average Latency
It is evident from current technological trends that systems
and two for receiving data. Data is sent over each lane as a with CXL-attached memory can enjoy significantly higher
serial bit stream at very high bit rates in an encoded format. bandwidth availability compared to conventional systems
Fig. 1 illustrates the bandwidth per pin for PCIe and DDR. with DDR-attached memory. A key concern hindering broad
The normalized bandwidth per pin is derived by dividing each adoption—and particularly our proposed replacement of DDR
interface’s peak interface bandwidth on JEDEC’s and PCI- interfaces on-chip with CXL—is CXL’s increased memory
SIG’s roadmap, respectively, by the processor pins required: access latency. However, in any system with a loaded memory
160 for DDR and 4 per lane for PCIe. subsystem, queuing effects play a significant role in determin-
The 4× bandwidth gap is where we are today (PCIe5.0 ing effective memory access latency. On a loaded system,
vs. DDR5-4800). The comparison is conservative, given that queuing (i) dominates the effective memory access latency,
PCIe’s stated bandwidth is per direction, while DDR5-4800 and (ii) introduces variance in accessing memory, degrading
requires about 160 processor pins for a theoretical 38.4GB/s performance. We next demonstrate the impact of both effects.
peak of combined read and write bandwidth. With a third of
the pins, 12 PCIe5.0 lanes (over which CXL operates) offer 3.1. Queuing Dictates Effective Memory Access Latency
48GB/s per direction—i.e., a theoretical peak of 48GB/s for Fig. 2a shows a DDR5-4800 channel’s memory access latency
reads and 48GB/s for writes. Furthermore, Fig. 1’s roadmaps as its load increases. We model the memory using DRAM-
suggest that the bandwidth gap will grow to 8× by 2025. Sim [46] and control the load with random memory accesses
2.4. CXL Latency Concerns of configurable arrival rate. The resulting load-latency curve
is shaped by queuing effects at the memory controller.
CXL’s increased bandwidth comes at the cost of increased When the system is unloaded, a hypothetical CXL interface
latency compared to DDR. There is a widespread assumption adding 30ns to each memory access would correspond to
that this latency cost is significantly higher than DRAM ac- a seemingly prohibitive 75% latency overhead compared to
cess latency itself. For instance, recent work on CXL-pooled the approximated unloaded latency of 40ns. However, as
memory reinforces that expectation, by reporting a latency the memory load increases, latency rises exponentially, with
overhead of 70ns [25]. The expectation of such high added average latency increasing by 3× and 4× at 50% and 60%
latency has reasonably led memory system researchers and load, respectively. p90 tail latency grows even faster, rising by
designers to predominantly focus on CXL as a technology for 4.7× and 7.1× at the same load points. In a loaded system,
enabling a secondary tier of slower memory that augments trading off additional interface latency for considerably higher
conventional DDR-attached memory. However, such a high bandwidth availability can yield significant net latency gain.
latency overhead does not represent the minimum attainable To illustrate, consider a baseline DDR-based system operat-
latency of the simplest CXL-attached memory and is largely ing at 60% of memory bandwidth utilization, corresponding
an artifact of more complex functionality, such as multiplexing to 160ns average and 285ns p90 memory access latency. A
multiple memory devices, enforcing coherence between the CXL-based alternative offering a 4× memory bandwidth boost
host and memory device, etc. would shrink the system’s bandwidth utilization to 15%, cor-
In this work, we argue that CXL is a perfect candidate to responding to 50% lower average and 68% lower p90 memory
completely replace the DDR-attached memory for server pro- access latency compared to baseline, despite the CXL inter-
cessors that handle memory-intensive workloads. The CXL face’s 30ns latency premium.
3.0 standard sets an 80ns pin-to-pin load latency target for a Fig. 2a shows that a system with bandwidth utilization as
CXL-attached memory device [9, Table 13-2], which in turn low as 20% experiences queuing effects, that are initially re-
implies that the interface-added latency over DRAM access in flected on tail latency; beyond 40% utilization, queuing effects
upcoming CXL memory devices should be about 30ns. Early also noticeably affect average latency. Utilization beyond such
implementations of the CXL 2.0 standard demonstrated a 25ns level is common, as we show with our simulation of a 12-
latency overhead per direction [42], and in 2021 PLDA an- core processor with 1 DDR5 memory channels over a range
nounced a commercially available CXL 2.0 controller that of server and desktop applications (methodological details in

3
500

424
9
400

Latency (ns)

48
Average Latency

Average Memory
Queuing Delay
Memory Access Latency (ns)

90% Latency 300 Access Service Time

400 200
100

Memory Bandwidth Access

300
0
200 75

Utilization
100 50
25
00 10 20 30 40 50 60 0

MISe
STT__adde
Com d

BF adii

raycesima
Com-sc
PR- p

R R
D

lmF

scltrace

mcrf
rom f
om pop s
net 2
BC

BelSCC

TriFaS-BVS

canuster
l
bw lbm
caaves
fotctuB
camnik
4

xal pp
anc
ma cc
kmstree
s
nea

ean
Memory Bandwidth Utilization as

ST__cop

ngl

fa id
S scal

B BF
tria

g
p

o
f l u
% of Theoretical Peak

s
ST
(a) Average and p90 memory access latency in a DDR5- (b) Memory latency breakdown (DRAM access time and queuing delay) and memory bandwidth
4800 channel (38.4GB/s) at varying bandwidth utiliza- utilization for a range of workloads. Higher utilization increases queuing delay.
tion points. p90 grows faster than the average latency.

Figure 2: Queuing drastically affects memory access time on a loaded system.

1.0

Performance norm.
§5). Fig. 2b shows that with all processor cores under use, the

to Fixed Latency
vast majority of workloads exceed 30% memory bandwidth (100ns,350ns)
utilization, and most exceed 50% utilization (except several 0.5 (75ns,450ns)
(50ns,550ns)
workloads from SPEC and PARSEC benchmarks).
0.0
s d e S e
Fig. 2b also breaks down the average memory access time bwave ST_tria masstre BF raytrac gm
seen from the LLC miss register into DRAM service time and
queuing delay at the memory controller. We observe a trend in Figure 3: Performance of workloads for synthetic memory ac-
high bandwidth consumption leading to long queuing delays, cess latency following three (X, Y) bimodal distributions with
although queuing delay doesn’t present itself as a direct func- 4:1 X:Y ratio, all with 150ns average latency, normalized to a
tion of bandwidth utilization. Queuing delay is also affected memory system with fixed 150ns latency. “gm" refers to geo-
by application characteristics such as read/write pattern and metric mean. Higher latency variance degrades performance.
spatial and temporal distribution of accesses. For example, in
an access pattern where processor makes majority of memory
access requests in a short amount of time, followed by a period
of low memory activity, the system would temporarily be in
a high bandwidth utilization state when memory requests are cess latency follows a bimodal distribution with 80%/20%
made, experiencing contention and high queuing delay, even probability of being lower/higher than the average. We keep
though the average bandwidth consumption would not be as average latency constant in all cases (80% × low_lat + 20% ×
high. Even in such cases, provisioning more bandwidth would high_lat = 150ns) and we evaluate (low_lat, high_lat) for
lead to better performance, as it would mitigate contention (100ns, 350ns), (75ns, 450ns), (50ns, 550ns), resulting in dis-
from the temporary bursts. In Fig. 2b’s workloads, queuing de- tributions with increasing standard deviations (stdev) of 100ns,
lay constitutes 72% of the memory access latency on average, 150ns, and 200ns. Variance is the square of stdev and denotes
and up to 91% in the case of lbm. how spread out the latency is from the average.
3.2. Memory Latency Variance Impacts Performance
Fig. 3 shows the relative performance of these memory
In addition to their effect on average memory access latency,
systems for five workloads of decreasing memory bandwidth
spurious queuing effects at the memory controller introduce
intensity. As variance increases, the average performance rel-
higher memory access latency fluctuations (i.e., variance).
ative to the fixed-latency baseline noticeably drops to 86%,
Such variance is closely related to the queueing delay stem-
78%, and 71%. This experiment highlights that solely rely-
ming from high utilization, as discussed in §3.1. To demon-
ing on typical average metrics like Average Memory Access
strate the impact of memory access latency variance on per-
Time (AMAT) to assess a memory system’s performance is an
formance, we conduct a controlled experiment where the av-
incomplete method of evaluating a memory system’s perfor-
erage memory access latency is kept constant, but the latency
mance. In addition to average values, the variance of memory
fluctuation around the average grows. The baseline is a toy
access latency is a major performance determinant and there-
memory system with a 150ns fixed access latency and we
fore an important quality criterion for a memory system.
evaluate three additional memory systems where memory ac-

4
Table 1: Area of processor Table 2: DDR-based versus alternative C OA X IA L server configurations.
components at TSMC 7nm Core LLC Memory Relative Relative
Server design Comment
(rel. to 1MB of L3 cache). count per core interfaces mem. BW area
DDR-based 12 DDR 1× 1 baseline
L3 cache (1MB) 1
C OA X IA L-5× 2 MB 60 x8 CXL 5× 1.17 iso-pin
Zen 3 Core
6.5 C OA X IA L-2× 24 x8 CXL 2× iso-LLC
(incl. 512 KB L2) 144
C OA X IA L-4× 48 x8 CXL 4× balanced
x8 PCIe (PHY + ctrl) 5.9 1.01 iso-area
1 MB 48 x8 CXL-asym asym. R/W
DDR channel (PHY + ctrl) 10.8 C OA X IA L-asym max BW
(see §4.3)

Summary: The significant memory bandwidth boost at-

tainable with CXL-attached memory more than compen- DDR DDR
sates for the interface’s higher latency in loaded systems,

DDR

DDR
where queuing dictates the memory system’s effective
access latency. By decreasing queuing, a CXL-based cores, caches, NoC, etc.

...
...
memory system reduces average memory access time and
variance, both of which improve performance. (a) Baseline DDR-based server.

4. The C OA X IA L Server Architecture

...

CXL
CXL
CXL
CXL
CXL
CXL
CXL
CXL
We leverage CXL’s per-pin bandwidth advantage to replace all type-3 CXL CXL type-3
of the DDR interfaces with PCIe-based CXL interfaces in our CXL CXL
proposed C OA X IA L server. Fig. 4b depicts our architecture ... CXL CXL

...
where each on-chip DDR5 channel is replaced by several cores, caches, NoC, etc.
CXL CXL
CXL channels, providing 2–4× higher aggregate memory CXL CXL
bandwidth to the processor. Fig. 4a shows the baseline DDR- (b) C OA X IA L replaces each DDR channel with several CXL channels. Each
based server design for comparison. Each CXL channel is CXL channel connects to a type-3 device with one DDR memory channel.
attached to a “Type-3“ CXL device, which features a memory
Figure 4: Overview of the baseline and C OA X IA L systems.
controller that manages a regular DDR5 channel that connects
to DRAM. The processor implements the CXL.mem protocol the DRAM-to-CPU direction, and about 13GB/s in the oppo-
of the CXL standard, which orchestrates data consistency and site direction. Furthermore, peak sustainable bandwidth for
memory semantics management. The implementation of the DDR controllers typically achieve around 70% to 90% of the
caches and cores remains unchanged, as the memory controller theoretical peak. Thus, even after factoring in PCIe and CXL’s
still supplies 64B cache lines. header overheads which reduce the practically attainable band-
width [48] to 26GB/s in the DRAM-to-CPU direction and
4.1. Processor Pin Considerations
13GB/s in the other direction, the x8 configuration supports a
A DDR5-4800 channel features a peak uni-directional band- full DDR5 channel without becoming a choke point.
width of 38.4GB/s and requires more than 160 processor pins
4.2. Silicon Area Considerations
to account for data and ECC bits, command/address bus, data
strobes, clock, feature modes, etc., as described in §2.1. A When it comes to processor pin requirements, C OA X IA L al-
full 16-lane PCIe connection delivers 64GB/s of bi-directional lows replacement of each DDR channel (i.e., PHY and mem-
bandwidth. Moreover, PCIe is modular, and higher-bandwidth ory controller) with five x8 PCIe PHY and controllers, for
channels can be constructed by grouping independent lanes a 5× memory bandwidth boost. However, the relative pin
together. Each lane requires just four processor pins: two each requirements of DDR and PCIe are not directly reflected in
for transmitting and receiving data. their relative on-chip silicon area requirements. Lacking pub-
The PCIe standard currently only allows groupings of 1, licly available information, we derive the relative size of DDR
2, 4, 8, 12, 16, or 32 lanes. To match DDR5’s bandwidth and PCIe PHYs and controllers from AMD Rome and Intel
of 38.4GB/s, we opt for an x8 configuration, which requires Golden Cove die shots [29, 53].
32 pins for a peak bandwidth of 32GB/s, 5× fewer than the Table 1 shows the relative silicon die area different key
160 pins required for the DDR5 channel. As PCIe can sus- components of the processor account for. Assuming linear
tain 32GB/s bandwidth in each direction, the peak aggregate scaling of PCIe area with the number of lanes, as appears to be
bandwidth of 8 lanes is 64GB/s, much higher than DDR5’s the case from the die shots, an x8 PCIe controller accounts for
38.4GB/s. Considering a typical 2:1 Read:Write ratio, only 54% of a DDR controller’s area. Hence, replacing each DDR
25.6GB/s of a DDR5 channel’s bandwidth would be used in controller with four x8 PCIe controllers requires 2.19× more

5
silicon area than what is allocated to DDR. However, DDR Table 3: System parameters used for simulation on
controllers account for a small fraction of the total CPU die. ChampSim.
Leveraging Table 1’s information, we now consider a num-
ber of alternative C OA X IA L server designs, shown in Table 2. DDR baseline CoaXiaL-*
We focus on high-core-count servers optimized for throughput, CPU 12 OoO cores, 2GHz, 4-wide, 256-entry ROB
such as the upcoming AMD EPYC Bergamo (128 cores) [37], L1 32KB L1-I & L1-D, 8-way, 64B blocks, 4-cycle access
and Intel Granite Rapids (128 cores) and Sierra Forest (144 L2 512 KB, 8-way, 12-cycle access
cores) [38]. All of them feature 12 DDR5 channels, resulting shared & non-inclusive, 16-way, 46-cycle access
LLC
2 MB/core 1–2 MB/core (see Table 2)
in a core-to-memory-controller (core:MC) ratio of 10.7:1 to
DDR5-4800 [36], 128 GB per channel, 2 sub-channels
12:1. A common design choice to accommodate such high
per channel, 1 rank per sub-channel, 32 banks per rank
core counts is a reduced LLC capacity; e.g., moving from the Memory
2–4 CXL-attached channels (see Table 2)
96-core Genoa [52] to the 128-core Bergamo, AMD halves the 1 channel
8 channels for C OA X IA L-asym (see §4.3)
LLC per core to 2MB. We thus consider a 144-core baseline
server processor with 12 DDR5 channels and 2MB of LLC
per core (Table 2, first row). additional read bandwidth, we provision two DDR controllers
With pin count as its only limitation, C OA X IA L-5× re- per CXL-asym channel on the type-3 device. Therefore, the
places each DDR channel with 5 x8 CXL interfaces, for a number of CXL channels on the processor (as well as their
5× bandwidth increase. Unfortunately, that results in a 17% area overhead) remains unchanged. While the 32GB/s read
increase in die area to accommodate all the PCIe PHYs and bandwidth of CXL-asym is insufficient to support two DDR
controllers. Hence, we also consider two iso-area alterna- channels at their combined read bandwidth of about 52GB/s
tives. C OA X IA L-2× leverages CXL to double memory band- (assuming a 2:1 R:W ratio), queuing delays at the DDR con-
width without any microarchitectural changes. C OA X IA L-4× troller typically become significant at a much lower utiliza-
quadruples the available memory bandwidth compared to the tion point, as shown in Fig. 2a. Therefore, C OA X IA L-asym
baseline CPU by halving the LLC from 288MB to 144MB. still provides sufficient bandwidth to eliminate contention at
queues by lowering the overall bandwidth utilization, while
4.3. C OA X IA L Asymmetric Interface Optimization providing higher aggregate bandwidth.
A key difference between CXL and DDR is that the former
4.4. Additional Benefits of C OA X IA L
provisions dedicated pins and wires for each data movement
direction (RX and TX). The PCIe standard defines a one-to- Our analysis focuses on the performance impact of a CXL-
one match of TX and RX pins: e.g., an x8 PCIe configuration based memory system. While a memory capacity and cost
implies 8 TX and 8 RX lanes. We observe that while uniform analysis is beyond the scope of this paper, C OA X IA L can have
bandwidth provisioning in each direction is reasonable for a additional positive effects on those fronts that are noteworthy.
peripheral device like a NIC, it is not the case for memory traf- Servers provisioned for maximum memory capacity deploy
fic. Because (i) most workloads read more data than they write two high-density DIMMs per DDR channel. The implica-
and (ii) every cache block that is written must typically be read tions are two-fold. First, two-DIMMs-per-channel (2DPC)
first, R:W ratios are usually in the 3:1 to 2:1 range rather than configurations increase capacity over 1DPC at the cost of
1:1. Thus, in the current 1:1 design, read bandwidth becomes ∼15% memory bandwidth. Second, DIMM cost grows super-
the bottleneck and write bandwidth is underutilized. Given linearly with density; for example, 128GB/256GB DIMMs
this observation and that serial interfaces do not fundamentally cost 5×/20× more than 64GB DIMMs. By enabling more
require 1:1 RX:TX bandwidth provisioning [59], we consider DDR channels, C OA X IA L allows the same or higher DRAM
a C OA X IA L design with asymmetric RX/TX lane provision- capacity with 1DPC and lower-density DIMMs.
ing to better match memory traffic characteristics. While the
PCIe standard currently disallows doing so, we investigate 5. Evaluation Methodology
the potential performance benefits of revisiting that statutory
restriction. We call such a channel CXL-asym. System configurations. We compare our C OA X IA L server
We consider a system leveraging such CXL-asym channels design, which replaces the processor’s DDR channels with
to compose an additional C OA X IA L-asym configuration. An CXL channels, to a typical DDR-based server processor.
x8 CXL channel consists of 32 pins, 16 each way. Without • DDR-based baseline. We simulate 12 cores and one DDR5-
the current 1:1 PCIe restriction, CXL-asym repurposes the 4800 memory channel as a scaled-down version of Table 2’s
same pin count to use 20 RX pins and 12 TX pins, resulting in 144-core CPU.
40GB/s RX and 24GB/s TX of raw bandwidth. Accounting • C OA X IA L servers. We evaluate several servers that re-
for PCIe and CXL’s header overheads, the realized bandwidth place the on-chip DDR interfaces with CXL: C OA X IA L-2×,
is approximately 32GB/s for reads (compared to 26GB/s in C OA X IA L-4×, and C OA X IA L-asym (Table 2).
x8 CXL channel) and 10GB/s for writes [48]. To utilize the We simulate the above system configurations using

6
ChampSim [1] coupled with DRAMsim3 [26]. Table 3 sum- Table 4: Workload Summary.
marizes the configuration parameters used.
LLC LLC
CXL performance modeling. For C OA X IA L, we model Application IPC Application IPC
MPKI MPKI
CXL controllers and PCIe bus on both the processor and the Ligra SPEC
type-3 device. Each CXL controller comprises a CXL port PageRank 0.36 40 lbm 0.14 64
that incurs a fixed delay of 12ns accounting for flit-packing, PageRank
0.31 27 bwaves 0.33 14
encoding-decoding, packet processing, etc. [43]. The PCIe Delta
bus incurs traversal latency due to the limited channel band- Components
0.34 48 cactusBSSN 0.68 8
-shortcut
width and bus width. For an x8 channel, the peak 32GB/s
Components 0.36 48 fotonik3d 0.33 22
bandwidth results in 26/13 GB/s RX/TX goodput when header BC 0.33 34 cam4 0.87 6
overheads are factored in, and 32/10 GB/s RX/TX in the case Radii 0.41 33 wrf 0.61 11
of CXL-asym channels. The corresponding link traversal BFSCC 0.68 17 mcf 0.793 13
latency is 2.5/ 5.5 ns RX/TX for an x8 channel and 2/ 9 ns BFS 0.69 15 roms 0.783 6
RX/TX for CXL-asym. Additionally, the CXL controller main- BFS-Bitvector 0.84 15 pop2 1.55 3
tains message queues to buffer requests. Therefore, in addition BellmanFord 0.86 9 omnetpp 0.51 10
to minimum latency overhead of about 30ns (or more, in our Triangle 0.65 21 xalancbmk 0.55 12
sensitivity analysis), queuing effects at the CXL controller are MIS 1.37 8 gcc 0.31 19
STREAM PARSEC
also captured and reflected in the performance.
Stream-copy 0.17 58 fluidanimate 0.78 7
Workloads. We evaluate 35 workloads from various bench- Stream-scale 0.21 48 facesim 0.74 6
mark suites. We deploy the same workload instance on all Stream-add 0.16 69 raytrace 1.17 5
cores and simulate 200 million instructions per core after fast- Stream-triad 0.18 59 streamcluster 0.99 14
forwarding each application to a region of interest. KVS & Data analytics canneal 0.66 7
• Graph analytics: We use 12 workloads from the LIGRA Masstree 0.37 21
Kmeans 0.50 36
benchmark suite [49].
• STREAM: We run the four kernels (copy, scale, add, triad) significant speedup, up to 3× for lbm and 1.52× on average.
from the STREAM benchmark [34] to represent bandwidth- 10 of the 35 workloads experience more than 2× speedup.
intensive matrix operations in which ML workloads spend a Four workloads lose performance, with gcc most significantly
significant portion of their execution time. impacted at 26% IPC loss. Workloads most likely to suffer
• SPEC & PARSEC: We evaluate 13 workloads from the a performance loss are those with low to moderate memory
SPEC-speed 2017 [50] benchmark suite in ref mode, as traffic and heavy dependencies among memory accesses.
well as five PARSEC workloads [5].
• We evaluate masstree [32] and kmeans [28] to represent key Fig. 5 (bottom) shows memory bandwidth utilization for
value store and data analytics workloads, respectively. the DDR-based baseline and C OA X IA L-4×, which provides
Table 4 summarizes all our evaluated workloads, along with 4× higher bandwidth than the baseline. C OA X IA L distributes
their IPC and MPKI as measured on the DDR-based baseline. memory requests over more channels which reduces the band-
width utilization of the system, in turn reducing contention
6. Evaluation Results for the memory bus. The lower bandwidth utilization and con-
tention drastically reduces the queuing delay in C OA X IA L for
We first compare our main C OA X IA L design, C OA X IA L-4×, memory-intensive workloads. Fig. 5 (middle) demonstrates
with the DDR-based baseline by analyzing the impact of re- this reduction with a breakdown of the average memory ac-
duced bandwidth utilization and queuing delays on perfor- cess latency (as measured from the LLC miss register) into the
mance in §6.1. §6.2 highlights the effect of memory access DRAM service time, queuing delay, and CXL interface delay
pattern and distribution on performance. §6.3 presents the per- (only applicable to C OA X IA L).
formance of alternative C OA X IA L designs, C OA X IA L-2×
In many cases, C OA X IA L enables the workload to drive
and C OA X IA L-asym, and §6.4 demonstrates the impact of
significantly more aggregate bandwidth from the system. For
a more conservative 50ns CXL latency penalty. §6.5 evalu-
instance, stream-copy is bottlenecked by the baseline system’s
ates C OA X IA L at different server utilization points, and §6.6
constrained bandwidth, resulting in average queuing delay
analyzes C OA X IA L’s power implications.
exceeding 300ns that largely dictates the overall access la-
6.1. From Queuing Reduction to Performance Gains tency (the total height of the stacked bars). C OA X IA L reduces
queuing delay to just 55ns for this workload, more than com-
Fig. 5 (top) shows the performance of C OA X IA L-4× relative pensating for the 30ns CXL interface latency overhead. The
to the baseline DDR-based system. Most workloads exhibit overall average access latency for stream-copy reduces from
Note that, throughout §6’s evaluation, reference to “C OA X IA L” without 348ns in baseline to just 120ns, enabling C OA X IA L to drive
a following qualifier implies the C OA X IA L-4× configuration. memory requests at a 2.9× higher rate versus the baseline,

7
5
2.7 2.6 3.0 2.6

2.4
2.0
Performance

5
2
Normalized

1.5
5

1.5
1.4
1.5

2
1.1
1.0
0.5
0.0
250

GM_STRM
GM_LIGRA
GM_PARS
GM_SPEC
GM_ALL
Queuing Delay Access Service Time CXL Interface Delay
Latency (ns)
Average Memory

200
150
100
Memory Bandwidth Access

50
0
80
60
Utilization

40
20 Baseline
CoaXiaL
0
STscaley
_tr d

an V
Co ad

MIe
Co -sc
PR p

S
-D
BC
R PR
BF adii
Be SCC
F
BF BFS

raycesima
calustee
nn r
l
bw lbm
caaves
fotctuB
ca nik
m4
mcrf
rom f
om pop s
n 2
xaetpp
gc
ma cc
ea e
ns
ea

lan
ST _cop

llm

fa id
sc trac

kmstre
ST _ad

Tri -B

w
i
mp

o
u
S

f l

s
_
ST

Figure 5: Normalized performance of C OA X IA L over DDR-based baseline (top), memory access latency breakdown (middle), and
memory bandwidth utilization (bottom). Workloads are grouped into their benchmark suite. “gm” refers to geometric mean.
C OA X IA L offers 1.52× average speedup due to 4× higher bandwidth, lowering utilization and mitigating queuing effects.

thus achieving commensurate speedup. tion to average memory access latency. Fig. 6a shows that
Despite provisioning 4× more bandwidth, C OA X IA L re- C OA X IA L also achieves a similar reduction in stdev, indicat-
duces average bandwidth utilization from 54% to 34% for ing lower dispersion and fewer extreme high-latency values.
workloads that have more than 2× performance improvement,
To further demonstrate the impact of access latency distribu-
highlighting the extra bandwidth is indeed utilized by these
tion and temporal effects, we study a few workloads in more
workloads. For most of the other workloads, C OA X IA L’s aver-
depth. Streamcluster presents an interesting case because
age memory access latency is much lower than the baseline’s,
its performance improves despite a slightly higher average
despite the CXL interface’s latency overhead.
memory access latency of 76ns compared to the baseline’s
On average, workloads experience 144ns in queuing delay
69ns (see Fig. 5). Fig. 6b shows the Cumulative Distribution
on top of ∼40ns DRAM service time. By slashing queuing
Function (CDF) of Streamcluster’s memory access latencies,
delay to just 31ns on average, C OA X IA L reduces average
illustrating that the baseline results in a higher variance than
memory access latency, thereby boosting performance. Over-
C OA X IA L (stdev of 88 versus 76), due to imbalanced queu-
all, Fig. 5’s results confirm our key insight (see §3.1): queuing
ing across DRAM banks. The tighter distribution of memory
delays largely dictate the average memory access latency.
access latency allows C OA X IA L to outperform the baseline
despite a 10% higher average memory access latency.
Takeaway #1: C OA X IA L drastically reduces queuing
delays, resulting in lower effective memory access latency Some workloads benefit from C OA X IA L more than other
for bandwidth-hungry workloads. workloads with similar or higher memory bandwidth utiliza-
tion (Fig. 5 (bottom)). For example, bwaves uses a mere
32% of the baseline’s available bandwidth but suffers an over-
6.2. Beyond Average Bandwidth Utilization and Access
whelming 390ns queuing delay. Even though bwaves uses less
Latency
bandwidth on average compared to workloads (e.g., radii with
While most of C OA X IA L’s performance gains can be justi- 65% bandwidth utilization), it exhibits bursty behavior that
fied by the achieved reduction in average memory latency, a incurs queuing spikes which can be more effectively absorbed
compounding positive effect is reduction in latency variance by C OA X IA L. Kmeans exhibits the opposite case. Despite
as evidenced in §3.2. For each of the four evaluated workload having the highest bandwidth utilization in the baseline sys-
groups, Fig. 6a shows the mean average latency and standard tem, it experiences a relatively low average queuing delay of
deviation (stdev) for C OA X IA L and the DDR-based baseline. 50ns and exhibits one of the lowest latency variance values
As already seen in §6.1, C OA X IA L delivers a 45–60% reduc- across workloads, indicating an even distribution of accesses

8
and Standard Deviation (ns) mean stdev from four with 30ns latency penalty) take a performance hit.
Baseline Baseline 69 88
Average Access Latency

500
CoaXiaL CoaXiaL 76 76 These results imply that while a C OA X IA L with a higher CXL
400 1.0 latency is still worth pursuing, it should be used selectively
300 0.8 for memory-intensive workloads. Deploying different classes

CDF
200 0.6
0.4 of servers for different optimization goals is common practice
100 Baseline
0.2 CoaXiaL not only in public clouds [15] but also in private clouds (e.g.,
0
A C C L
EAM LIGR ARSE SPE AL 0 100 200 300 different web and backend server configurations) [12, 18].
STR P Memory Access Time (ns)
(a) Average memory access la- (b) Cumulative Distribution Func- Takeaway #3: Even with a 50ns CXL latency overhead,
tency per workload group, and stdev tion (CDF) of memory access C OA X IA L achieves a considerable 1.3× average speedup
shown as error bars. time for Streamcluster. across all workloads.
Figure 6: Memory access latency distribution.

over time and across DRAM banks. Kmeans is also an outlier

6.5. Sensitivity to Core Utilization
with near-zero write traffic, thus avoiding the turnaround over-
head from the memory controller switching between read and Fig. 9 evaluates C OA X IA L’s performance under varying lev-
write mode that results in bandwidth underutilization. els of system utilization by provisioning proportionately less
6.3. Alternative C OA X IA L designs work on a fraction of cores of the system. We first study the
extreme case of using a single core on our 12-core simulated
Fig. 7 evaluates the two alternative C OA X IA L designs intro- system (8% utilization). In this scenario, virtually all work-
duced in §4—C OA X IA L-2× and C OA X IA L-asym—in addi- loads suffer performance degradation with C OA X IA L, for a
tion to our default C OA X IA L-4×. C OA X IA L-2× achieves 17% average slowdown. Xalancbmk exhibits a corner case
a 1.26× average speedup over the baseline, down from where the working set fits in the LLC when only one instance
C OA X IA L-4×’s 1.52× gain. This confirms our intuition that is running, removing most memory accesses. The extreme
doubling memory bandwidth availability at the cost of halving single-core experiment showcases C OA X IA L’s worst-case be-
the LLC is beneficial for virtually all workloads. C OA X IA L- havior, where the memory system is the least utilized.
asym improves performance by 1.67× on average—a consid- We then increase the system utilization to 33% and 66%,
erable 15% gain on top of C OA X IA L-4×—and no workload is by deploying workload instances on 4 and 8 cores of the
negatively affected by C OA X IA L-asym’s reduced write band- 12-core CPU, respectively. We also show results for 100%
width. This result implies an exciting opportunity to improve utilization (all cores used) again as a point of comparison.
bandwidth efficiency in memory devices attached via serial C OA X IA L’s bandwidth abundance gradually starts paying
interconnects by provisioning the interfaces in a manner that off, by eliminating the slowdown at 33% utilization for most
is aware of the workloads’ read versus write demands. workloads, and then delivering significant gains—1.27× on
average and up to 2.62×—even at 66% utilization. The 66%
Takeaway #2: Provisioning the lanes in read/write de- utilization point can also be considered as a good proxy for
mand aware manner considerably improves performance a fully loaded system where cores and DDR controllers are
compared to the default 1:1 read:write provisioning ratio. provisioned at an 8:1 ratio. An 8:1 core:MC ratio is the design
point of many server processors with fewer than 100 cores
6.4. Sensitivity to CXL’s Latency Overhead today, such as AMD EPYC Milan and Genoa [8, 52]. Thus,
the 66% utilization results imply that C OA X IA L’s approach
While we base our main evaluation on a 30ns roundtrip CXL is applicable beyond high-end throughput-oriented processors
interface latency off the CXL 3.0 specification and current that already exhibit 12:1 core:MC oversubscription.
industry expectations (see §2.4), we also evaluate a more pes-
simistic latency overhead of 50ns, in case early products do not
Takeaway #4: Even at 66% server utilization—or 8:1
meet the 30ns target. Such latency may also better represent
core:MC ratio—C OA X IA L delivers a 1.27× speedup.
CXL-attached memory devices located at a longer physical
distance from the CPU, or devices with an additional multi-
plexing overhead (e.g., memory devices shared by multiple 6.6. Power Requirements and Energy Efficiency
servers—a scenario CXL intends to enable [16, 25]).
Fig. 8 shows C OA X IA L’s performance at 30ns (our default) Although C OA X IA L’s added serial links and 4× more
and 50ns CXL interface latency overhead, normalized to the DIMMs increase the server’s power consumption, our sys-
DDR-based baseline. Although increasing latency overhead to tem also affords much higher throughput. To take this power
50ns reduces C OA X IA L’s average speedup, it remains signifi- increase into account, we compute the Energy Delay Product
cant at 1.33×. Memory-intensive workloads continue to enjoy (EDP = system power × CPI2 ) of the baseline and C OA X IA L-
drastic speedups of over 50%, but more workloads (nine, up 4×. A lower EDP value indicates a more efficient system that

9
3.0
CoaXiaL-2X CoaXiaL-4X CoaXiaL-asym
Norm. Performance

2.5
2.0
1.5
1.0
0.5
0.0
_s y

_ S
GM LIGRM
GM_PARA
GMSPEC
LL
Tri -BV
ST cale
ST _add
iad

Co -sc
PR p
BC
-D

R R

BF BFS
BF adii
Be SCC
F

gle
S
fac da
ray esim
scltrace
ca uster
l
bw lbm
ca ves
fot tuB
ca ik
m4

mcf

ne 2
rom f

xa tpp
c
ma c
om pop s

GM ns
kmstree
ea

lan
gc
ST _cop

llm
m

MI
P

GM _STR
on

_A
ea
i
mp

nn
_tr

flu
an
S

c
a

s
ST

_
Co

Figure 7: C OA X IA L’s performance at different design points, norm. to the DDR-based server baseline. C OA X IA L-4× outperforms
C OA X IA L-2×, despite its halved LLC size. C OA X IA L-asym considerably outperforms our default C OA X IA L-4× design.
3.0
30ns CXL Buffer Delay
Norm. Performance

2.5
50ns CXL Buffer Delay
2.0
1.5
1.0
0.5
0.0
_s y

GM_LIGRM
GM_PARA
_ S
GMSPEC
LL
ST cale
_tr d
Co d
Co -sc

BF adii
PR p
-D

MIe

om pop s
BC
R PR
Be SCC

S
F
BF BFS
Tri -BV

raycesima
scltrace
nn r
l
bw bm
ca ves
fotctuB
ca nik
m4
mcf
rom f
ne 2
xa tpp
gc
ma cc
kmstree
GM ns
ea

wr
ca uste

lan
ST _cop

llm

fa id
ST _ad
ia

_A
ea
mp

o
an

flu

GM ST
l
S

_
ST

Figure 8: C OA X IA L’s performance for different CXL latency premium, norm. to the DDR-based server. Even with a 50ns interface
latency penalty, C OA X IA L yields a 1.33× average speedup.

consumes less energy to complete the same work, even if it ble the power to obtain power consumption of a 128 GB DDR5
operates at a higher power. channel (32 GB channel for CXL). While C OA X IA L employs
We model power for a manycore processor similar to 4× more DIMMs than the baseline, its power consumption is
AMD EPYC Bergamo (128 cores) [37] or Sierra Forest (144 only 1.75× higher due to lower memory utilization.
cores) [38]. The latter is expected to have a 500W TDP, which Table 5 summarizes the key power components for the
is in line with current processors (e.g., 96-core AMD EPYC baseline and C OA X IA L systems. The overall system power
Genoa [52] has a TDP of 360W). While the memory controller consumption is 713W for the baseline system and 1.18kW
and interface require negligible power compared to the proces- for C OA X IA L, a 66% increase. Crucially, C OA X IA L mas-
sor, we include them for completeness. We estimate controller sively boosts performance, reducing CPI by 34%. As a result,
and interface power per DDR5 channel to be 0.5W and 0.6W, C OA X IA L reduces the baseline’s EDP by a considerable 28%.
respectively [57], or 13W in total for a baseline processor with
12 channels. Similarly, PCIe 5.0’s interface power is ∼0.2W Takeaway #5: In addition to boosting performance,
per lane [4], or 77W for the 384 lanes required to support C OA X IA L affords a more efficient system with a 28%
C OA X IA L’s 48 DDR5 channels. lower energy-delay product.
A significant fraction of a large-scale server’s power is
attributed to memory. We use Micron’s power calculator 6.7. Evaluation Summary
tool [35] to compute our baseline’s and CXL system’s DRAM
power requirement by taking the observed average mem- CXL-based memory systems hold great promise for manycore
ory bandwidth utilization of 52% for baseline and 21% for server processors. Replacing DDR with CXL-based memory
C OA X IA L into account. As this tool only computes power that offers 4× higher bandwidth at a 30ns latency premium
up to DDR4-3200MT/s modules, we model a 64GB 2-rank achieves a 1.52× average speedup across various workloads.
DDR4-3200 DIMM (16GB 2-rank module for CXL) and dou- Furthermore, a C OA X IA L-asym design demonstrates oppor-
3.0
1 Core (8%) 8 Cores (66%)
Norm. Performance

2.5
4 Cores (33%) 12 Cores (100%)
2.0
1.5
1.0
0.5
0.0
_s y

GM_LIGRM
GM_PARA
_ S
GMSPEC
LL
om pop s
ST cale
_tr d
Co d
Co -sc

BF adii
PR p
-D

MIe

kmstree
BC
R PR
Be SCC

Tri -BV

S
F
BF BFS

raycesima
scltrace
nn r
l
bw lbm
ca ves
fotctuB
ca nik
m4
mcf
rom f
ne 2
xa tpp
gc
ma cc

GM ns
ea

wr
ca uste

lan
ST _cop

llm

fa id
ST _ad
ia

_A
ea
mp

o
an

flu

GM _ST
S

s
ST

Figure 9: Performance of C OA X IA L as a function of active cores, norm. to DDR-based server baseline at the same active cores.

10
Table 5: Energy Delay Product (EDP = System power × CPI2 ) connect to a buffer-on-board, which in turn hosts several DDR
comparison for target 144-core server. Lower EDP is better. channels [54]. FBDIMM [14] leverages a similar concept to
Centaur’s buffer-on-board to increase memory bandwidth and
EDP Component Baseline C OA X IA L capacity. An advanced memory buffer (AMB) acts as a bridge
Processor Package power 500W 500W between the processor and the memory modules, connecting
DDR5 MC & PHY power (all) 13W 52W to the processor over serial links and featuring an abundance
DDR5 DIMM power (static and access) 200W 551W of pins to enable multiple parallel interfaces to DRAM mod-
CXL’s Interface power (idle and dynamic) N/A 77W
ules. Similar to CXL-attached memory, a key concern with
Total system power 713W 1,180W
FBDIMM is its increased latency. Open Memory Interface
Average CPI (all workloads) 2.02 1.33 (OMI) is a recent high-bandwidth memory leveraging serial
EDP (all workloads) 2,909 2,087 (0.72×) links, delivering bandwidth comparable to HBM but without
tunity for additional gain (1.67× average speedup), assuming HBM’s tight capacity limitations [7]. Originally a subset of
a modification to the PCIe standard to allow departure from OpenCAPI, OMI is now part of the CXL Consortium.
the rigid 1:1 read:write bandwidth provisioning to allow an Researchers have also proposed memory system architec-
asymmetric, workload-aware one. Even if C OA X IA L incurs tures making use of high-bandwidth serial interfaces. In MeS-
a 50ns latency premium, it promises substantial performance SOS’ two-stage memory system, high-bandwidth serial links
improvement (1.33× on average). We show that our bene- connect to a high-bandwidth DRAM cache, which is then
fits stem from reduced memory contention: by reducing the chained to planar DRAM over DDR [58]. Ham et al. pro-
utilization of available bandwidth resources, C OA X IA L mit- pose disintegrated memory controllers attached over SerDes,
igates queuing effects, thus reducing both average memory aiming to make the memory system more modular and fa-
access latency and its variance. cilitate supporting heterogeneous memory technologies [17].
Alloy combines parallel and serial interfaces to access memory,
7. Related Work maintaining the parallel interfaces for lower-latency memory
access [59]. Unlike our proposal of fully replacing DDR
We discuss recent works investigating CXL-based memory processor interfaces with CXL for memory-intensive servers,
system solutions, prior memory systems leveraging serial in- Alloy’s approach is closer to the hybrid DDR/CXL memory
terfaces, as well as circuit-level and alternative techniques to systems that most ongoing CXL-related research envisions.
improve bandwidth and optimize the memory system.
Circuit-level techniques to boost memory bandwidth.
Emerging CXL-based memory systems. Industry is rapidly HBM [23] and die-stacked DRAM caches offer an order of
adopting CXL and already investigating its deployment in magnitude higher bandwidth than planar DRAM, but suffer
production systems to reap the benefits of memory expansion from limited capacity [22, 30, 44]. BOOM [60] buffers out-
and memory pooling. Microsoft leverages CXL to pool mem- puts from multiple LPDDR ranks to reduce power and sus-
ory across servers, improving utilization and thus reducing tain server-level performance, but offers modest gains due to
cost [25]. In the same vein, Gouk et al. [16] leverage CXL to low frequency LPDDR and limited bandwidth improvement.
prototype a practical instance of disaggregated memory [27]. Chen et al. [6] propose dynamic reallocation of power pins to
Aspiring to use CXL as a memory expansion technique that boost data transfer capability from memory during memory-
will enable a secondary memory tier of higher capacity than intensive phases, during which processors are memory bound
DDR, Meta’s recent work optimizes data placement in this and hence draw less power. Pal et al. [40] propose packageless
new type of two-tier memory hierarchy [33]. Using an FPGA- processors to mitigate pin limitations and boost the memory
based prototype of a CXL type-3 memory device, Ahn et al. bandwidth that can be routed to the processor. Unlike these
evaluate database workloads on a hybrid DDR/CXL memory proposals, we focus on conventional processors, packaging,
system and demonstrate minimal performance degradation, and commodity DRAM, aiming to reshape the memory sys-
suggesting that CXL-based memory expansion is cost-efficient tem of server processors by leveraging the widely adopted
and performant [3]. Instead of using CXL-attached memory up-and-coming CXL interconnect.
as a memory system extension, our work stands out as the first Other memory system optimizations. Transparent memory
one to propose CXL-based memory as a complete replace- compression techniques are a compelling approach to increas-
ment of DDR-attached memory for server processors handling ing effective memory bandwidth [61]. Malladi et al. [31]
memory-intensive workloads. leverage mobile LPDDR DRAM devices to design a more
Memory systems leveraging serial interfaces. There have energy-efficient memory system for servers without perfor-
been several prior memory system proposals leveraging serial mance loss. These works are orthogonal to our proposed
links for high-bandwidth, energy-efficient data transfers. Mi- approach. Storage-class memory, like Phase-Change Mem-
cron’s HMC was connected to the host over 16 SerDes lanes, ory [13] or Intel’s Optane [19], has attracted significant interest
delivering up to 160GB/s [41]. IBM’s Centaur is a memory as a way to boost a server’s memory capacity, triggering re-
capacity expansion solution, where the host uses SerDes to search activity on transforming the memory hierarchy to best

11
accommodate such new memories [2, 10, 24, 56]. Unlike our [14] B. Ganesh, A. Jaleel, D. Wang, and B. L. Jacob, “Fully-Buffered
work, such systems often trade off bandwidth for capacity. DIMM Memory Architectures: Understanding Mechanisms, Over-
heads and Scaling,” in Proceedings of the 13th IEEE Symposium on
High-Performance Computer Architecture (HPCA), 2007, pp. 109–120.
8. Conclusion [15] Google Cloud, “Machine families resource and comparison guide.”
[Online]. Available: https://cloud.google.com/compute/docs/machine-
Technological trends motivate a server processor design where resource
[16] D. Gouk, S. Lee, M. Kwon, and M. Jung, “Direct Access, High-
all memory is attached to the processor over the emerging Performance Memory Disaggregation with DirectCXL,” in Proceed-
CXL interconnect instead of DDR. CXL’s superior bandwidth ings of the 2022 USENIX Annual Technical Conference (ATC), 2022,
pp. 287–294.
per pin helps bandwidth-hungry server processors scale the [17] T. J. Ham, B. K. Chelepalli, N. Xue, and B. C. Lee, “Disintegrated
bandwidth wall. By distributing memory requests over 4× control for energy-efficient and heterogeneous memory systems,” in
more memory channels, CXL reduces queueing effects on Proceedings of the 19th IEEE Symposium on High-Performance Com-
puter Architecture (HPCA), 2013, pp. 424–435.
the memory bus. Because queuing delay dominates access [18] K. M. Hazelwood, S. Bird, D. M. Brooks, S. Chintala, U. Diril,
latency in loaded memory systems, such reduction more than D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu,
P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, “Applied Ma-
compensates for the interface latency overhead introduced by chine Learning at Facebook: A Datacenter Infrastructure Perspective,”
CXL. Our evaluation on a diverse range of memory-intensive in Proceedings of the 24th IEEE Symposium on High-Performance
Computer Architecture (HPCA), 2018, pp. 620–629.
workloads shows that our proposed C OA X IA L server delivers [19] Intel Corporation, “Intel Optane DC Persistent Memory.” [On-
1.52× speedup on average, and up to 3×. line]. Available: https://www.intel.com/content/www/us/en/products/
memory-storage/optane-dc-persistent-memory.html
References [20] J. Jaffari, A. Ansari, and R. Beraha, “Systems and methods for a hybrid
parallel-serial memory access,” 2015, US Patent 9747038B2.
[1] “ChampSim.” [Online]. Available: https://github.com/ChampSim/ [21] JEDEC, “DDR5 SDRAM standard (JESD79-5B),” 2022.
ChampSim [22] D. Jevdjic, S. Volos, and B. Falsafi, “Die-stacked DRAM caches for
[2] N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent servers: hit ratio, latency, or bandwidth? have it all with footprint
Page Management for Two-tiered Main Memory,” in Proceedings of cache,” in Proceedings of the 40th International Symposium on Com-
the 22nd International Conference on Architectural Support for Pro- puter Architecture (ISCA), 2013, pp. 404–415.
gramming Languages and Operating Systems (ASPLOS-XXII), 2017, [23] J. Kim and Y. Kim, “HBM: Memory solution for bandwidth-hungry
pp. 631–644. processors,” in Hot Chips Symposium, 2014, pp. 1–24.
[3] M. Ahn, A. Chang, D. Lee, J. Gim, J. Kim, J. Jung, O. Rebholz, [24] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase
V. Pham, K. T. Malladi, and Y.-S. Ki, “Enabling CXL Memory Expan- change memory as a scalable dram alternative,” in Proceedings of the
sion for In-Memory Database Management Systems,” in Proceedings 36th International Symposium on Computer Architecture (ISCA), 2009,
of the 18th International Workshop on Data Management on New pp. 2–13.
Hardware (DaMoN), 2022, pp. 8:1–8:5. [25] H. Li, D. S. Berger, S. Novakovic, L. Hsu, D. Ernst, P. Zardoshti,
[4] M. Bichan, C. Ting, B. Zand, J. Wang, R. Shulyzki, J. Guthrie, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura,
K. Tyshchenko, J. Zhao, A. Parsafar, E. Liu, A. Vatankhahghadim, and R. Bianchini, “Pond: CXL-Based Memory Pooling Systems for
S. Sharifian, A. Tyshchenko, M. D. Vita, S. Rubab, S. Iyer, F. Spagna, Cloud Platforms,” Proceedings of the 28th International Conference
and N. Dolev, “A 32Gb/s NRZ 37dB SerDes in 10nm CMOS to Sup- on Architectural Support for Programming Languages and Operating
port PCI Express Gen 5 Protocol,” in Proceedings of the 2020 IEEE Systems (ASPLOS-XXVIII), 2023.
Custom Integrated Circuits Conference, 2020, pp. 1–4. [26] S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. L. Jacob, “DRAM-
[5] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark sim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator,” IEEE
suite: characterization and architectural implications,” in Proceedings Comput. Archit. Lett., vol. 19, no. 2, pp. 110–113, 2020.
of the 17th International Conference on Parallel Architecture and [27] K. T. Lim, J. Chang, T. N. Mudge, P. Ranganathan, S. K. Reinhardt,
Compilation Techniques (PACT), 2008, pp. 72–81. and T. F. Wenisch, “Disaggregated memory for expansion and sharing
[6] S. Chen, Y. Hu, Y. Zhang, L. Peng, J. Ardonne, S. Irving, and A. Sri- in blade servers,” in Proceedings of the 36th International Symposium
vastava, “Increasing off-chip bandwidth in multi-core processors with on Computer Architecture (ISCA), 2009, pp. 267–278.
switchable pins,” in Proceedings of the 41st International Symposium [28] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf.
on Computer Architecture (ISCA), 2014, pp. 385–396. Theory, vol. 28, no. 2, pp. 129–136, 1982.
[7] T. M. Coughlin and J. Handy, “Higher Performance and Capacity with [29] Locuza, “Die walkthrough: Alder Lake-S/P and a touch of Zen
OMI Near Memory,” in Proceedings of the 2021 Annual Symposium 3,” 2022. [Online]. Available: https://locuza.substack.com/p/die-
on High-Performance Interconnects, 2021, pp. 68–71. walkthrough-alder-lake-sp-and
[8] D. I. Cutress and A. Frumusanu, “Amd 3rd gen epyc milan review: [30] G. H. Loh and M. D. Hill, “Efficiently enabling conventional block
A peak vs per core performance balance,” 2021. [Online]. Available: sizes for very large die-stacked DRAM caches,” in Proceedings of the
https://www.anandtech.com/show/16529/amd-epyc-milan-review 44th Annual IEEE/ACM International Symposium on Microarchitec-
[9] CXL Consortium, “Compute Express Link (CXL) Spec- ture (MICRO), 2011, pp. 454–464.
ification, Revision 3.0, Version 1.0,” 2022. [On- [31] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis,
line]. Available: https://www.computeexpresslink.org/_files/ugd/ and M. Horowitz, “Towards energy-proportional datacenter memory
0c1418_1798ce97c1e6438fba818d760905e43a.pdf with mobile DRAM,” in Proceedings of the 39th International Sympo-
[10] S. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, sium on Computer Architecture (ISCA), 2012, pp. 37–48.
J. Jackson, and K. Schwan, “Data tiering in heterogeneous memory [32] Y. Mao, E. Kohler, and R. T. Morris, “Cache craftiness for fast multi-
systems,” in Proceedings of the 2016 EuroSys Conference, 2016, pp. core key-value storage,” in Proceedings of the 2012 EuroSys Confer-
15:1–15:16. ence, 2012, pp. 183–196.
[11] EE Times, “CXL will absorb Gen-Z,” 2021. [Online]. Available: [33] H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhat-
https://www.eetimes.com/cxl-will-absorb-gen-z/
tacharya, C. Petersen, M. Chowdhury, S. O. Kanaujia, and P. Chauhan,
[12] Engineering at Meta, “ Introducing “Yosemite”: the first open “TPP: Transparent Page Placement for CXL-Enabled Tiered Memory,”
source modular chassis for high-powered microservers,” 2015. Proceedings of the 28th International Conference on Architectural Sup-
[Online]. Available: https://engineering.fb.com/2015/03/10/core- port for Programming Languages and Operating Systems (ASPLOS-
data/introducing-yosemite-the-first-open-source-modular-chassis- XXVIII), 2023.
for-high-powered-microservers/ [34] J. D. McCalpin, “Memory Bandwidth and Machine Balance in Current
[13] S. W. Fong, C. M. Neumann, and H.-S. P. Wong, “Phase-change mem- High Performance Computers,” IEEE Computer Society Technical
ory—towards a storage-class memory,” IEEE Transactions on Electron Committee on Computer Architecture (TCCA) Newsletter, 1995.
Devices, vol. 64, no. 11, pp. 4374–4385, 2017. [35] Micron Technology Inc., “System Power Calculators,” https://
www.micron.com/support/tools-and-utilities/power-calc.

12
[36] Micron Technology Inc., “DDR5 SDRAM Datasheet,” 2022. [Online]. [50] Standard Performance Evaluation Corporation, “SPEC CPU2017
Available: https://media-www.micron.com/-/media/client/global/ Benchmark Suite.” [Online]. Available: http://www.spec.org/cpu2017/
documents/products/data-sheet/dram/ddr5/ddr5_sdram_core.pdf [51] P. Stanley-Marbell, V. C. Cabezas, and R. P. Luijten, “Pinned to the
[37] H. Mujtaba, “AMD EPYC Bergamo ‘Zen 4C’ CPUs Being walls: impact of packaging and application properties on the memory
Deployed In 1H 2023 To Tackle Arm CPUs, Instinct MI300 and power walls,” in Proceedings of the 2011 International Symposium
APU Back In Labs,” 2022. [Online]. Available: https: on Low Power Electronics and Design, 2011, pp. 51–56.
//wccftech.com/amd-epyc-bergamo-zen-4c-cpus-being-deployed-in- [52] StorageReview, “4th Gen AMD EPYC Review (AMD Genoa),” 2022.
1h-2023-tackle-arm-instinct-mi300-apu-back-in-labs/amp/ [Online]. Available: https://www.storagereview.com/review/4th-gen-
[38] H. Mujtaba, “Intel Granite Rapids & Sierra Forest Xeon amd-epyc-review-amd-genoa
CPU Detailed In Avenue City Platform Leak: Up To 500W [53] TechPowerUp, “AMD "Matisse" and "Rome" IO Con-
TDP & 12-Channel DDR5,” 2023. [Online]. Available: https: troller Dies Mapped Out,” 2020. [Online]. Avail-
//wccftech.com/intel-granite-rapids-sierra-forest-xeon-cpu-detailed- able: https://www.techpowerup.com/266287/amd-matisse-and-rome-
in-avenue-city-platform-leak-up-to-500w-tdp-12-channel-ddr5/ io-controller-dies-mapped-out
[39] B. Nitin, W. Randy, I. Shinichiro, F. Eiji, R. Shibata, S. Yumiko, and
O. Megumi, “DDR5 design challenges,” in 2018 IEEE 22nd Workshop [54] The Next Platform, “IBM POWER Chips Blur the
on Signal and Power Integrity (SPI), 2018, pp. 1–4. Lines to Memory and Accelerators,” 2018. [Online].
Available: https://www.nextplatform.com/2018/08/28/ibm-
[40] S. Pal, D. Petrisko, A. A. Bajwa, P. Gupta, S. S. Iyer, and R. Kumar, power-chips-blur-the-lines-to-memory-and-accelerators/#:~:text=
“A Case for Packageless Processors,” in Proceedings of the 24th IEEE The%20Centaur%20memory%20adds%20about
Symposium on High-Performance Computer Architecture (HPCA),
2018, pp. 466–479. [55] The Register, “CXL absorbs OpenCAPI on the road to interconnect
[41] J. T. Pawlowski, “Hybrid memory cube (HMC),” in Hot Chips Sympo- dominance,” 2022. [Online]. Available: https://www.theregister.com/
sium, 2011, pp. 1–24. 2022/08/02/cxl_absorbs_opencapi/
[42] PLDA, “Breaking the PCIe Latency Barrier with CXL,” 2020. [Online]. [56] D. Ustiugov, A. Daglis, J. Picorel, M. Sutherland, E. Bugnion,
Available: https://www.brighttalk.com/webcast/18357/434922 B. Falsafi, and D. N. Pnevmatikatos, “Design guidelines for high-
[43] PLDA and AnalogX, “PLDA and AnalogX Announce Market-leading performance SCM hierarchies,” in Proceedings of the 2018 Interna-
CXL 2.0 Solution featuring Ultra-low Latency and Power,” tional Symposium on Memory Systems (MEMSYS), 2018, pp. 3–16.
2021. [Online]. Available: https://www.businesswire.com/news/ [57] S. Volos, “Memory Systems and Interconnects for Scale-Out Servers,”
home/20210602005484/en/PLDA-and-AnalogX-Announce-Market- Ph.D. dissertation, EPFL, Switzerland, 2015.
leading-CXL-2.0-Solution-featuring-Ultra-low-Latency-and-Power [58] S. Volos, D. Jevdjic, B. Falsafi, and B. Grot, “Fat Caches for Scale-Out
[44] M. K. Qureshi and G. H. Loh, “Fundamental Latency Trade-off in Servers,” IEEE Micro, vol. 37, no. 2, pp. 90–103, 2017.
Architecting DRAM Caches: Outperforming Impractical SRAM-Tags [59] H. Wang, C.-J. Park, G. Byun, J. H. Ahn, and N. S. Kim, “Alloy:
with a Simple and Practical Design,” in Proceedings of the 45th Annual Parallel-serial memory channel architecture for single-chip heteroge-
IEEE/ACM International Symposium on Microarchitecture (MICRO), neous processor systems,” in Proceedings of the 21st IEEE Symposium
2012, pp. 235–246. on High-Performance Computer Architecture (HPCA), 2015, pp. 296–
[45] R. Rooney and N. Koyle, “Micron® DDR5 SDRAM: New Features,” 308.
Micron Technology Inc., Tech. Rep, 2019. [60] D. H. Yoon, J. Chang, N. Muralimanohar, and P. Ranganathan,
[46] P. Rosenfeld, E. Cooper-Balis, and B. L. Jacob, “DRAMSim2: A “BOOM: Enabling mobile memory based low-power server DIMMs,”
Cycle Accurate Memory System Simulator,” IEEE Comput. Archit. in Proceedings of the 39th International Symposium on Computer
Lett., vol. 10, no. 1, pp. 16–19, 2011. Architecture (ISCA), 2012, pp. 25–36.
[47] D. D. Sharma, “PCI Express® 6.0 Specification at 64.0 GT/s with PAM- [61] V. Young, S. Kariyappa, and M. K. Qureshi, “Enabling Transparent
4 signaling: a low latency, high bandwidth, high reliability and cost- Memory-Compression for Commodity Memory Systems,” in Proceed-
effective interconnect,” in Proceedings of the 2020 Annual Symposium ings of the 25th IEEE Symposium on High-Performance Computer
on High-Performance Interconnects, 2020, pp. 1–8. Architecture (HPCA), 2019, pp. 570–581.
[48] D. D. Sharma, “Compute Express Link®: An open industry-standard [62] Q. Zhu, S. Venkataraman, C. Ye, and A. Chandrasekhar, “Package
interconnect enabling heterogeneous data-centric computing,” in Pro- design challenges and optimizations in density efficient (Intel® Xeon®
ceedings of the 2022 Annual Symposium on High-Performance Inter- processor D) SoC,” in 2016 IEEE Electrical Design of Advanced Pack-
connects, 2022, pp. 5–12. aging and Systems (EDAPS), 2016.
[49] J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing
framework for shared memory,” in Proceedings of the 18th ACM SIG-
PLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP), 2013, pp. 135–146.

My CXL Presentation
100% (1)
My CXL Presentation
25 pages
CompTIA+A++220-1101 StudyGuide
No ratings yet
CompTIA+A++220-1101 StudyGuide
143 pages
RVfpga GettingStartedGuide
100% (1)
RVfpga GettingStartedGuide
103 pages
ARM CCIX SW Developers Guide Release Version 1.0 PDF
No ratings yet
ARM CCIX SW Developers Guide Release Version 1.0 PDF
138 pages
PCIE Protocol
No ratings yet
PCIE Protocol
29 pages
Design of AXI Verification IP
No ratings yet
Design of AXI Verification IP
25 pages
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
100% (1)
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
45 pages
2-3 - Common - Storage - Protocols - Copie
No ratings yet
2-3 - Common - Storage - Protocols - Copie
58 pages
Demystifying CXL Memory
No ratings yet
Demystifying CXL Memory
15 pages
Block Diagram of Intel Atom Processor
No ratings yet
Block Diagram of Intel Atom Processor
23 pages
Riscv Boom
No ratings yet
Riscv Boom
85 pages
01 Tutorial Intro Share
No ratings yet
01 Tutorial Intro Share
21 pages
Exploring and Evaluating Real-World CXL - Use Cases and System Adoption
No ratings yet
Exploring and Evaluating Real-World CXL - Use Cases and System Adoption
15 pages
(2010-02-27) Measuring Performance
No ratings yet
(2010-02-27) Measuring Performance
11 pages
Riscv Rocket Chip Tutorial Bootcamp Jan2015
No ratings yet
Riscv Rocket Chip Tutorial Bootcamp Jan2015
30 pages
CXL2.0 White Paper November 2020 FINAL
No ratings yet
CXL2.0 White Paper November 2020 FINAL
4 pages
03 Building Custom Socs
No ratings yet
03 Building Custom Socs
30 pages
Spec Cpu 2006
No ratings yet
Spec Cpu 2006
13 pages
Design and Verification of SDRAM Controller Based
No ratings yet
Design and Verification of SDRAM Controller Based
9 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
Sifive Vcu118 Fpga Getting Started Guide 20G1.05.00
No ratings yet
Sifive Vcu118 Fpga Getting Started Guide 20G1.05.00
34 pages
Demystifying CXL Memory With Genuine CXL Ready Systems and Devices
No ratings yet
Demystifying CXL Memory With Genuine CXL Ready Systems and Devices
12 pages
Slides CW Benini
No ratings yet
Slides CW Benini
23 pages
Lec07 Memory sp17
No ratings yet
Lec07 Memory sp17
99 pages
Time Sequence of Multiple Interrupts
No ratings yet
Time Sequence of Multiple Interrupts
49 pages
Introduction To 8086
No ratings yet
Introduction To 8086
44 pages
Lecture 7 Main Memory
No ratings yet
Lecture 7 Main Memory
36 pages
CPU Design HOWTO PDF
No ratings yet
CPU Design HOWTO PDF
21 pages
21CS43 - Module 1
No ratings yet
21CS43 - Module 1
21 pages
Mem Con 2023
No ratings yet
Mem Con 2023
8 pages
The RISC-V Instruction Set Manual: UCB/EECS-2014-54
No ratings yet
The RISC-V Instruction Set Manual: UCB/EECS-2014-54
100 pages
Code DMA
No ratings yet
Code DMA
58 pages
Chapter 4 - Cache Memory: Luis Tarrataca
No ratings yet
Chapter 4 - Cache Memory: Luis Tarrataca
159 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Life Cycle of A Thread
No ratings yet
Life Cycle of A Thread
4 pages
2023 CXL DesignTradeoffs IEEE Micro
No ratings yet
2023 CXL DesignTradeoffs IEEE Micro
9 pages
Pulpissimo: Datasheet: The Pulp Team
No ratings yet
Pulpissimo: Datasheet: The Pulp Team
101 pages
Xeon D 1500 Datasheet Vol 1
No ratings yet
Xeon D 1500 Datasheet Vol 1
608 pages
P-Tile Avalon Streaming IP For PCI Express User Guide
No ratings yet
P-Tile Avalon Streaming IP For PCI Express User Guide
222 pages
CXL Memory Interconnect Initiative:: Enabling A New Era of Data Center Architecture
No ratings yet
CXL Memory Interconnect Initiative:: Enabling A New Era of Data Center Architecture
8 pages
Docu87490 - Data Domain DD3300 Field Replacement and Upgrade Guide PDF
0% (2)
Docu87490 - Data Domain DD3300 Field Replacement and Upgrade Guide PDF
204 pages
Notes - Unit 5
No ratings yet
Notes - Unit 5
12 pages
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
No ratings yet
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
14 pages
Chapter 05
No ratings yet
Chapter 05
25 pages
How Microprocessors Work 23
No ratings yet
How Microprocessors Work 23
13 pages
SDM Boot PDF
No ratings yet
SDM Boot PDF
704 pages
File: /home/binod/documents/allfmca p/rocket-chip-master/README - MD Page 1 of 7
No ratings yet
File: /home/binod/documents/allfmca p/rocket-chip-master/README - MD Page 1 of 7
7 pages
7 Series Memory Controllers
100% (1)
7 Series Memory Controllers
36 pages
Riscv Zscale Workshop June2015 PDF
No ratings yet
Riscv Zscale Workshop June2015 PDF
19 pages
Palestra 4 Abram Belk
No ratings yet
Palestra 4 Abram Belk
143 pages
PCS White Paper
No ratings yet
PCS White Paper
14 pages
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
No ratings yet
HC21 23 131 Ajanovic-Intel-PCIeGen3 PDF
61 pages
CSE 2320 - Systems Programming: Chapter 1: Introduction To Systems Programming and Languages Used
No ratings yet
CSE 2320 - Systems Programming: Chapter 1: Introduction To Systems Programming and Languages Used
34 pages
Interconnection Networks
No ratings yet
Interconnection Networks
31 pages
001 Ucm Imx8plus Reference Guide 001
No ratings yet
001 Ucm Imx8plus Reference Guide 001
45 pages
pg055 Axi Bridge Pcie PDF
No ratings yet
pg055 Axi Bridge Pcie PDF
117 pages
Introduction NoC Paper PDF
No ratings yet
Introduction NoC Paper PDF
12 pages
Amba 4axi Stream
No ratings yet
Amba 4axi Stream
2 pages
Intel SATA Controller
No ratings yet
Intel SATA Controller
59 pages
Register File Prefetching
No ratings yet
Register File Prefetching
14 pages
Hyper Threading Seminar Report
No ratings yet
Hyper Threading Seminar Report
11 pages
Interrupts
No ratings yet
Interrupts
59 pages
Intelligent High Performance Memory Access Technique in Aspect of DDR3
No ratings yet
Intelligent High Performance Memory Access Technique in Aspect of DDR3
6 pages
Pcie Intel Specification
No ratings yet
Pcie Intel Specification
9 pages
A Microprocessor
No ratings yet
A Microprocessor
13 pages
Toshiba Satellite C655 - Compal - La-6842p - r0.2 PDF
No ratings yet
Toshiba Satellite C655 - Compal - La-6842p - r0.2 PDF
46 pages
HamRadioIndia - Baofeng UV-5R Programming Cable
100% (1)
HamRadioIndia - Baofeng UV-5R Programming Cable
2 pages
ViewSonic VX1945wm-3
No ratings yet
ViewSonic VX1945wm-3
74 pages
Clasificación TOP500 Junio 2021
No ratings yet
Clasificación TOP500 Junio 2021
132 pages
Transparent Page Placement For CXL-Enabled Tiered Memory
No ratings yet
Transparent Page Placement For CXL-Enabled Tiered Memory
14 pages
HLS Introduction Gajski Design and Test
No ratings yet
HLS Introduction Gajski Design and Test
10 pages
Empowering Azure Storage With RDMA
No ratings yet
Empowering Azure Storage With RDMA
20 pages
Micro-Programmed Versus Hardwired Control Units
No ratings yet
Micro-Programmed Versus Hardwired Control Units
10 pages
GMCE Firmware Release Notes: Revision History
No ratings yet
GMCE Firmware Release Notes: Revision History
34 pages
SIM7600E H 4G HAT Manual EN
No ratings yet
SIM7600E H 4G HAT Manual EN
28 pages
AMS HP EliteBook 840 G1 Notebook PC Data Sheet PDF
No ratings yet
AMS HP EliteBook 840 G1 Notebook PC Data Sheet PDF
5 pages
EEC 214 Lecture 3
No ratings yet
EEC 214 Lecture 3
49 pages
Mini Mi Manual
No ratings yet
Mini Mi Manual
2 pages
Advanced Computer Architectures: Exception Handling
No ratings yet
Advanced Computer Architectures: Exception Handling
17 pages
Computer Lab Checklist
No ratings yet
Computer Lab Checklist
3 pages
HP UX Support Details
No ratings yet
HP UX Support Details
18 pages
Acer E14 E5475 - 55er
No ratings yet
Acer E14 E5475 - 55er
7 pages
Ddi0439b Errata 01
No ratings yet
Ddi0439b Errata 01
5 pages
Qualcomm 215 Mobile Platform
No ratings yet
Qualcomm 215 Mobile Platform
2 pages
MIC Question Bank PTT1
No ratings yet
MIC Question Bank PTT1
2 pages
04 Memory Organization PDF
No ratings yet
04 Memory Organization PDF
20 pages
Presantation Topic
No ratings yet
Presantation Topic
6 pages
Eagle: Industrial Analog I/O Modules
No ratings yet
Eagle: Industrial Analog I/O Modules
4 pages
Echo Sounder
No ratings yet
Echo Sounder
2 pages
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet

A Case For CXL-Centric Sever Processors

Uploaded by

A Case For CXL-Centric Sever Processors

Uploaded by

A Case for CXL-Centric Server Processors

Albert Cho∗ Anish Saxena∗ Moinuddin Qureshi Alexandros Daglis

Abstract The emerging Compute Express Link (CXL) standard

from the memory system. To keep up with growing demands,

90% Latency 300 Access Service Time

Memory Bandwidth Access

Figure 2: Queuing drastically affects memory access time on a loaded system.

Summary: The significant memory bandwidth boost at-

4. The C OA X IA L Server Architecture

over time and across DRAM banks. Kmeans is also an outlier

You might also like