A Case For CXL-Centric Sever Processors
A Case For CXL-Centric Sever Processors
1
posed server design, dubbed C OA X IA L, replaces all of the requirement is determined by the width of data bus, com-
processor’s direct DDR interfaces with CXL. mand/address bus, and configuration pins. A DDR4 and
By evaluating C OA X IA L with a wide range of workloads, DDR5 [21] interface is 288 pins wide. While several of
we highlight how CXL-based memory system’s unique char- those pins are terminated at the motherboard, most of them
acteristics (i.e., increased bandwidth and higher unloaded la- (160+ for an ECC-enabled DDR4 channel [20], likely more
tency) positively impact performance of processors whose for DDR5 [45]) are driven to the processor chip.
memory system is typically loaded. Our analysis relies on a The DDR interface’s 64 data bits directly connect to the
simple but often overlooked fact about memory system be- processor and are bit-wise synchronous with the memory con-
havior and its impact on overall performance: that a loaded troller’s clock, enabling a worst-case (unloaded) access latency
memory system’s effective latency is dominated by the im- of about 50ns. Scaling a DDR-based memory system’s band-
pact of queuing effects and therefore significantly differs from width requires either clocking the channels at a higher rate,
the unloaded system’s latency, as we demonstrate in §3.1. A or attaching more channels to the processors. The former
memory system that offers higher parallelism reduces queuing approach results in signal integrity challenges [39] and a re-
effects, which in turn results in lower average latency and duction in supported ranks per channel, limiting rank-level
variance, even if its unloaded access latency is higher com- parallelism and memory capacity. Accommodating more chan-
pared to existing systems. We argue that CXL-based memory nels requires more on-chip pins, which cost significant area
systems offer exactly this design trade-off, which is favorable and power, and complicate placement, routing, and packag-
for loaded server processors handling memory-intensive ap- ing [62]. Therefore, the pin-count on processor packages has
plications, offering strong motivation for a radical change in only been doubling about every six years [51].
memory system design that departs from two decades of DDR Thus, reducing the number of cores that contend over a
and enables scalable high-performance server architectures. memory channel is difficult without clean-slate technologies,
In summary, we make the following contributions: which we discuss in §7. To this end, the emerging CXL in-
• We make the radical proposal of using high-bandwidth CXL terconnect is bound to bridge this gap by leveraging a widely
as a complete replacement of pin-inefficient DDR interfaces deployed high-bandwidth serial interface, as we discuss next.
on server processors, showcasing a ground-breaking shift
2.2. The High-bandwidth CXL Memory Interconnect
that disrupts decades-long memory system design practices.
• We show that, despite its higher unloaded memory access la- The Compute Express Link (CXL) is a recent interconnect
tency, C OA X IA L reduces the effective memory access time standard, designed to present a unified solution for coherent
in typical scenarios where the memory system is loaded. accelerators, non-coherent devices, and memory expansion
• We demonstrate the promise of C OA X IA L with a study of devices. It represents the industry’s concerted effort for a
a wide range of workloads for various CXL bandwidth and standardized interconnect to replace a wide motley collection
latency design points that are likely in the near future. of proprietary solutions (e.g., OpenCAPI [55], Gen-Z [11]).
• We identify limitations imposed on CXL by the current CXL is rapidly garnering industry adoption and is bound to be-
PCIe standard, and highlight opportunities a revised stan- come a dominant interconnect, as PCIe has been for peripheral
dard could leverage for 20% additional speedup. devices over the past twenty years.
Paper outline: §2 motivates the replacement of DDR with CXL brings load-store semantics and coherent memory ac-
CXL in server processors. §3 highlights the critical impact cess to high-capacity, high-bandwidth memory for processors
of queuing delays on a memory system’s performance and and accelerators alike. It also enables attaching DDR-based
§4 provides an overview of our proposed C OA X IA L server memory (“Type-3” CXL devices) over PCIe to the proces-
design, which leverages CXL to mitigate detrimental queuing. sor with strict timing constraints. In this work, we focus on
We outline the methodology to evaluate C OA X IA L against a this capability of CXL. CXL’s underlying PCIe physical layer
DDR-based system in §5 and analyze performance results in affords higher bandwidth per pin at the cost of increased la-
§6. We discuss related work in §7 and conclude in §8. tency. Therefore, most recent works thus far perceive CXL as
a technology enabling an auxiliary slower memory tier directly
2. Background attached to the processor. In contrast, we argue that despite its
associated latency overhead, CXL can play a central role in
In this section, we highlight how DRAM memory bandwidth future memory systems design, replacing, rather than simply
is bottlenecked by the processor-attached DDR interface and augmenting, DDR-based memory in server processors.
processor pin-count. We then discuss how CXL can bridge
this gap by using PCIe as its underlying physical layer. 2.3. Scaling the Memory Bandwidth Wall with CXL
2.1. Low-latency DDR-based Memory CXL’s high bandwidth owes to its underlying PCIe physical
layer. PCIe [47] is a high-speed serial interface featuring
Servers predominantly access DRAM over the Double Data multiple independent lanes capable of bi-directional commu-
Rate (DDR) parallel interface. The interface’s processor pin nication using just 4 pins per lane: two for transmitting data,
2
64 only adds 12ns per direction [43]. Such low latency over-
PCIe7.0
Relative bandwidth per pin
PCIe6.0
32 heads are attainable with the simplest CXL type-3 devices that
PCIe5.0
DDR6-16000
16 are not multiplexed across multiple hosts and do not need to
PCIe4.0 DDR6-9600
8 initiate any coherence transactions. Our key insight is that a
PCIe3.0 DDR5-4800 DDR5-8000
4
PCIe2.0
memory access latency penalty in the order of 30ns often pales
DDR4-2133 DDR4-3200 in comparison to queuing delays at the memory controller that
2
PCIe1.0
1 are common in server systems, and such queuing delays can
2000 2005 2010 2015 2020 2025 2030
Year
be curtailed by CXL’s considerable bandwidth boost.
Figure 1: Bandwidth per processor pin for DDR and CXL (PCIe)
interface, norm. to PCIe-1.0. Note that y-axis is in log scale.
3. Pitfalls of Unloaded and Average Latency
It is evident from current technological trends that systems
and two for receiving data. Data is sent over each lane as a with CXL-attached memory can enjoy significantly higher
serial bit stream at very high bit rates in an encoded format. bandwidth availability compared to conventional systems
Fig. 1 illustrates the bandwidth per pin for PCIe and DDR. with DDR-attached memory. A key concern hindering broad
The normalized bandwidth per pin is derived by dividing each adoption—and particularly our proposed replacement of DDR
interface’s peak interface bandwidth on JEDEC’s and PCI- interfaces on-chip with CXL—is CXL’s increased memory
SIG’s roadmap, respectively, by the processor pins required: access latency. However, in any system with a loaded memory
160 for DDR and 4 per lane for PCIe. subsystem, queuing effects play a significant role in determin-
The 4× bandwidth gap is where we are today (PCIe5.0 ing effective memory access latency. On a loaded system,
vs. DDR5-4800). The comparison is conservative, given that queuing (i) dominates the effective memory access latency,
PCIe’s stated bandwidth is per direction, while DDR5-4800 and (ii) introduces variance in accessing memory, degrading
requires about 160 processor pins for a theoretical 38.4GB/s performance. We next demonstrate the impact of both effects.
peak of combined read and write bandwidth. With a third of
the pins, 12 PCIe5.0 lanes (over which CXL operates) offer 3.1. Queuing Dictates Effective Memory Access Latency
48GB/s per direction—i.e., a theoretical peak of 48GB/s for Fig. 2a shows a DDR5-4800 channel’s memory access latency
reads and 48GB/s for writes. Furthermore, Fig. 1’s roadmaps as its load increases. We model the memory using DRAM-
suggest that the bandwidth gap will grow to 8× by 2025. Sim [46] and control the load with random memory accesses
2.4. CXL Latency Concerns of configurable arrival rate. The resulting load-latency curve
is shaped by queuing effects at the memory controller.
CXL’s increased bandwidth comes at the cost of increased When the system is unloaded, a hypothetical CXL interface
latency compared to DDR. There is a widespread assumption adding 30ns to each memory access would correspond to
that this latency cost is significantly higher than DRAM ac- a seemingly prohibitive 75% latency overhead compared to
cess latency itself. For instance, recent work on CXL-pooled the approximated unloaded latency of 40ns. However, as
memory reinforces that expectation, by reporting a latency the memory load increases, latency rises exponentially, with
overhead of 70ns [25]. The expectation of such high added average latency increasing by 3× and 4× at 50% and 60%
latency has reasonably led memory system researchers and load, respectively. p90 tail latency grows even faster, rising by
designers to predominantly focus on CXL as a technology for 4.7× and 7.1× at the same load points. In a loaded system,
enabling a secondary tier of slower memory that augments trading off additional interface latency for considerably higher
conventional DDR-attached memory. However, such a high bandwidth availability can yield significant net latency gain.
latency overhead does not represent the minimum attainable To illustrate, consider a baseline DDR-based system operat-
latency of the simplest CXL-attached memory and is largely ing at 60% of memory bandwidth utilization, corresponding
an artifact of more complex functionality, such as multiplexing to 160ns average and 285ns p90 memory access latency. A
multiple memory devices, enforcing coherence between the CXL-based alternative offering a 4× memory bandwidth boost
host and memory device, etc. would shrink the system’s bandwidth utilization to 15%, cor-
In this work, we argue that CXL is a perfect candidate to responding to 50% lower average and 68% lower p90 memory
completely replace the DDR-attached memory for server pro- access latency compared to baseline, despite the CXL inter-
cessors that handle memory-intensive workloads. The CXL face’s 30ns latency premium.
3.0 standard sets an 80ns pin-to-pin load latency target for a Fig. 2a shows that a system with bandwidth utilization as
CXL-attached memory device [9, Table 13-2], which in turn low as 20% experiences queuing effects, that are initially re-
implies that the interface-added latency over DRAM access in flected on tail latency; beyond 40% utilization, queuing effects
upcoming CXL memory devices should be about 30ns. Early also noticeably affect average latency. Utilization beyond such
implementations of the CXL 2.0 standard demonstrated a 25ns level is common, as we show with our simulation of a 12-
latency overhead per direction [42], and in 2021 PLDA an- core processor with 1 DDR5 memory channels over a range
nounced a commercially available CXL 2.0 controller that of server and desktop applications (methodological details in
3
500
424
9
400
Latency (ns)
48
Average Latency
Average Memory
Queuing Delay
Memory Access Latency (ns)
Utilization
100 50
25
00 10 20 30 40 50 60 0
MISe
STT__adde
Com d
BF adii
raycesima
Com-sc
PR- p
R R
D
lmF
scltrace
mcrf
rom f
om pop s
net 2
BC
BelSCC
TriFaS-BVS
canuster
l
bw lbm
caaves
fotctuB
camnik
4
xal pp
anc
ma cc
kmstree
s
nea
ean
Memory Bandwidth Utilization as
ST__cop
ngl
fa id
S scal
B BF
tria
g
p
o
f l u
% of Theoretical Peak
s
ST
(a) Average and p90 memory access latency in a DDR5- (b) Memory latency breakdown (DRAM access time and queuing delay) and memory bandwidth
4800 channel (38.4GB/s) at varying bandwidth utiliza- utilization for a range of workloads. Higher utilization increases queuing delay.
tion points. p90 grows faster than the average latency.
1.0
Performance norm.
§5). Fig. 2b shows that with all processor cores under use, the
to Fixed Latency
vast majority of workloads exceed 30% memory bandwidth (100ns,350ns)
utilization, and most exceed 50% utilization (except several 0.5 (75ns,450ns)
(50ns,550ns)
workloads from SPEC and PARSEC benchmarks).
0.0
s d e S e
Fig. 2b also breaks down the average memory access time bwave ST_tria masstre BF raytrac gm
seen from the LLC miss register into DRAM service time and
queuing delay at the memory controller. We observe a trend in Figure 3: Performance of workloads for synthetic memory ac-
high bandwidth consumption leading to long queuing delays, cess latency following three (X, Y) bimodal distributions with
although queuing delay doesn’t present itself as a direct func- 4:1 X:Y ratio, all with 150ns average latency, normalized to a
tion of bandwidth utilization. Queuing delay is also affected memory system with fixed 150ns latency. “gm" refers to geo-
by application characteristics such as read/write pattern and metric mean. Higher latency variance degrades performance.
spatial and temporal distribution of accesses. For example, in
an access pattern where processor makes majority of memory
access requests in a short amount of time, followed by a period
of low memory activity, the system would temporarily be in
a high bandwidth utilization state when memory requests are cess latency follows a bimodal distribution with 80%/20%
made, experiencing contention and high queuing delay, even probability of being lower/higher than the average. We keep
though the average bandwidth consumption would not be as average latency constant in all cases (80% × low_lat + 20% ×
high. Even in such cases, provisioning more bandwidth would high_lat = 150ns) and we evaluate (low_lat, high_lat) for
lead to better performance, as it would mitigate contention (100ns, 350ns), (75ns, 450ns), (50ns, 550ns), resulting in dis-
from the temporary bursts. In Fig. 2b’s workloads, queuing de- tributions with increasing standard deviations (stdev) of 100ns,
lay constitutes 72% of the memory access latency on average, 150ns, and 200ns. Variance is the square of stdev and denotes
and up to 91% in the case of lbm. how spread out the latency is from the average.
3.2. Memory Latency Variance Impacts Performance
Fig. 3 shows the relative performance of these memory
In addition to their effect on average memory access latency,
systems for five workloads of decreasing memory bandwidth
spurious queuing effects at the memory controller introduce
intensity. As variance increases, the average performance rel-
higher memory access latency fluctuations (i.e., variance).
ative to the fixed-latency baseline noticeably drops to 86%,
Such variance is closely related to the queueing delay stem-
78%, and 71%. This experiment highlights that solely rely-
ming from high utilization, as discussed in §3.1. To demon-
ing on typical average metrics like Average Memory Access
strate the impact of memory access latency variance on per-
Time (AMAT) to assess a memory system’s performance is an
formance, we conduct a controlled experiment where the av-
incomplete method of evaluating a memory system’s perfor-
erage memory access latency is kept constant, but the latency
mance. In addition to average values, the variance of memory
fluctuation around the average grows. The baseline is a toy
access latency is a major performance determinant and there-
memory system with a 150ns fixed access latency and we
fore an important quality criterion for a memory system.
evaluate three additional memory systems where memory ac-
4
Table 1: Area of processor Table 2: DDR-based versus alternative C OA X IA L server configurations.
components at TSMC 7nm Core LLC Memory Relative Relative
Server design Comment
(rel. to 1MB of L3 cache). count per core interfaces mem. BW area
DDR-based 12 DDR 1× 1 baseline
L3 cache (1MB) 1
C OA X IA L-5× 2 MB 60 x8 CXL 5× 1.17 iso-pin
Zen 3 Core
6.5 C OA X IA L-2× 24 x8 CXL 2× iso-LLC
(incl. 512 KB L2) 144
C OA X IA L-4× 48 x8 CXL 4× balanced
x8 PCIe (PHY + ctrl) 5.9 1.01 iso-area
1 MB 48 x8 CXL-asym asym. R/W
DDR channel (PHY + ctrl) 10.8 C OA X IA L-asym max BW
(see §4.3)
DDR
DDR
where queuing dictates the memory system’s effective
access latency. By decreasing queuing, a CXL-based cores, caches, NoC, etc.
...
...
memory system reduces average memory access time and
variance, both of which improve performance. (a) Baseline DDR-based server.
CXL
CXL
CXL
CXL
CXL
CXL
CXL
CXL
We leverage CXL’s per-pin bandwidth advantage to replace all type-3 CXL CXL type-3
of the DDR interfaces with PCIe-based CXL interfaces in our CXL CXL
proposed C OA X IA L server. Fig. 4b depicts our architecture ... CXL CXL
...
where each on-chip DDR5 channel is replaced by several cores, caches, NoC, etc.
CXL CXL
CXL channels, providing 2–4× higher aggregate memory CXL CXL
bandwidth to the processor. Fig. 4a shows the baseline DDR- (b) C OA X IA L replaces each DDR channel with several CXL channels. Each
based server design for comparison. Each CXL channel is CXL channel connects to a type-3 device with one DDR memory channel.
attached to a “Type-3“ CXL device, which features a memory
Figure 4: Overview of the baseline and C OA X IA L systems.
controller that manages a regular DDR5 channel that connects
to DRAM. The processor implements the CXL.mem protocol the DRAM-to-CPU direction, and about 13GB/s in the oppo-
of the CXL standard, which orchestrates data consistency and site direction. Furthermore, peak sustainable bandwidth for
memory semantics management. The implementation of the DDR controllers typically achieve around 70% to 90% of the
caches and cores remains unchanged, as the memory controller theoretical peak. Thus, even after factoring in PCIe and CXL’s
still supplies 64B cache lines. header overheads which reduce the practically attainable band-
width [48] to 26GB/s in the DRAM-to-CPU direction and
4.1. Processor Pin Considerations
13GB/s in the other direction, the x8 configuration supports a
A DDR5-4800 channel features a peak uni-directional band- full DDR5 channel without becoming a choke point.
width of 38.4GB/s and requires more than 160 processor pins
4.2. Silicon Area Considerations
to account for data and ECC bits, command/address bus, data
strobes, clock, feature modes, etc., as described in §2.1. A When it comes to processor pin requirements, C OA X IA L al-
full 16-lane PCIe connection delivers 64GB/s of bi-directional lows replacement of each DDR channel (i.e., PHY and mem-
bandwidth. Moreover, PCIe is modular, and higher-bandwidth ory controller) with five x8 PCIe PHY and controllers, for
channels can be constructed by grouping independent lanes a 5× memory bandwidth boost. However, the relative pin
together. Each lane requires just four processor pins: two each requirements of DDR and PCIe are not directly reflected in
for transmitting and receiving data. their relative on-chip silicon area requirements. Lacking pub-
The PCIe standard currently only allows groupings of 1, licly available information, we derive the relative size of DDR
2, 4, 8, 12, 16, or 32 lanes. To match DDR5’s bandwidth and PCIe PHYs and controllers from AMD Rome and Intel
of 38.4GB/s, we opt for an x8 configuration, which requires Golden Cove die shots [29, 53].
32 pins for a peak bandwidth of 32GB/s, 5× fewer than the Table 1 shows the relative silicon die area different key
160 pins required for the DDR5 channel. As PCIe can sus- components of the processor account for. Assuming linear
tain 32GB/s bandwidth in each direction, the peak aggregate scaling of PCIe area with the number of lanes, as appears to be
bandwidth of 8 lanes is 64GB/s, much higher than DDR5’s the case from the die shots, an x8 PCIe controller accounts for
38.4GB/s. Considering a typical 2:1 Read:Write ratio, only 54% of a DDR controller’s area. Hence, replacing each DDR
25.6GB/s of a DDR5 channel’s bandwidth would be used in controller with four x8 PCIe controllers requires 2.19× more
5
silicon area than what is allocated to DDR. However, DDR Table 3: System parameters used for simulation on
controllers account for a small fraction of the total CPU die. ChampSim.
Leveraging Table 1’s information, we now consider a num-
ber of alternative C OA X IA L server designs, shown in Table 2. DDR baseline CoaXiaL-*
We focus on high-core-count servers optimized for throughput, CPU 12 OoO cores, 2GHz, 4-wide, 256-entry ROB
such as the upcoming AMD EPYC Bergamo (128 cores) [37], L1 32KB L1-I & L1-D, 8-way, 64B blocks, 4-cycle access
and Intel Granite Rapids (128 cores) and Sierra Forest (144 L2 512 KB, 8-way, 12-cycle access
cores) [38]. All of them feature 12 DDR5 channels, resulting shared & non-inclusive, 16-way, 46-cycle access
LLC
2 MB/core 1–2 MB/core (see Table 2)
in a core-to-memory-controller (core:MC) ratio of 10.7:1 to
DDR5-4800 [36], 128 GB per channel, 2 sub-channels
12:1. A common design choice to accommodate such high
per channel, 1 rank per sub-channel, 32 banks per rank
core counts is a reduced LLC capacity; e.g., moving from the Memory
2–4 CXL-attached channels (see Table 2)
96-core Genoa [52] to the 128-core Bergamo, AMD halves the 1 channel
8 channels for C OA X IA L-asym (see §4.3)
LLC per core to 2MB. We thus consider a 144-core baseline
server processor with 12 DDR5 channels and 2MB of LLC
per core (Table 2, first row). additional read bandwidth, we provision two DDR controllers
With pin count as its only limitation, C OA X IA L-5× re- per CXL-asym channel on the type-3 device. Therefore, the
places each DDR channel with 5 x8 CXL interfaces, for a number of CXL channels on the processor (as well as their
5× bandwidth increase. Unfortunately, that results in a 17% area overhead) remains unchanged. While the 32GB/s read
increase in die area to accommodate all the PCIe PHYs and bandwidth of CXL-asym is insufficient to support two DDR
controllers. Hence, we also consider two iso-area alterna- channels at their combined read bandwidth of about 52GB/s
tives. C OA X IA L-2× leverages CXL to double memory band- (assuming a 2:1 R:W ratio), queuing delays at the DDR con-
width without any microarchitectural changes. C OA X IA L-4× troller typically become significant at a much lower utiliza-
quadruples the available memory bandwidth compared to the tion point, as shown in Fig. 2a. Therefore, C OA X IA L-asym
baseline CPU by halving the LLC from 288MB to 144MB. still provides sufficient bandwidth to eliminate contention at
queues by lowering the overall bandwidth utilization, while
4.3. C OA X IA L Asymmetric Interface Optimization providing higher aggregate bandwidth.
A key difference between CXL and DDR is that the former
4.4. Additional Benefits of C OA X IA L
provisions dedicated pins and wires for each data movement
direction (RX and TX). The PCIe standard defines a one-to- Our analysis focuses on the performance impact of a CXL-
one match of TX and RX pins: e.g., an x8 PCIe configuration based memory system. While a memory capacity and cost
implies 8 TX and 8 RX lanes. We observe that while uniform analysis is beyond the scope of this paper, C OA X IA L can have
bandwidth provisioning in each direction is reasonable for a additional positive effects on those fronts that are noteworthy.
peripheral device like a NIC, it is not the case for memory traf- Servers provisioned for maximum memory capacity deploy
fic. Because (i) most workloads read more data than they write two high-density DIMMs per DDR channel. The implica-
and (ii) every cache block that is written must typically be read tions are two-fold. First, two-DIMMs-per-channel (2DPC)
first, R:W ratios are usually in the 3:1 to 2:1 range rather than configurations increase capacity over 1DPC at the cost of
1:1. Thus, in the current 1:1 design, read bandwidth becomes ∼15% memory bandwidth. Second, DIMM cost grows super-
the bottleneck and write bandwidth is underutilized. Given linearly with density; for example, 128GB/256GB DIMMs
this observation and that serial interfaces do not fundamentally cost 5×/20× more than 64GB DIMMs. By enabling more
require 1:1 RX:TX bandwidth provisioning [59], we consider DDR channels, C OA X IA L allows the same or higher DRAM
a C OA X IA L design with asymmetric RX/TX lane provision- capacity with 1DPC and lower-density DIMMs.
ing to better match memory traffic characteristics. While the
PCIe standard currently disallows doing so, we investigate 5. Evaluation Methodology
the potential performance benefits of revisiting that statutory
restriction. We call such a channel CXL-asym. System configurations. We compare our C OA X IA L server
We consider a system leveraging such CXL-asym channels design, which replaces the processor’s DDR channels with
to compose an additional C OA X IA L-asym configuration. An CXL channels, to a typical DDR-based server processor.
x8 CXL channel consists of 32 pins, 16 each way. Without • DDR-based baseline. We simulate 12 cores and one DDR5-
the current 1:1 PCIe restriction, CXL-asym repurposes the 4800 memory channel as a scaled-down version of Table 2’s
same pin count to use 20 RX pins and 12 TX pins, resulting in 144-core CPU.
40GB/s RX and 24GB/s TX of raw bandwidth. Accounting • C OA X IA L servers. We evaluate several servers that re-
for PCIe and CXL’s header overheads, the realized bandwidth place the on-chip DDR interfaces with CXL: C OA X IA L-2×,
is approximately 32GB/s for reads (compared to 26GB/s in C OA X IA L-4×, and C OA X IA L-asym (Table 2).
x8 CXL channel) and 10GB/s for writes [48]. To utilize the We simulate the above system configurations using
6
ChampSim [1] coupled with DRAMsim3 [26]. Table 3 sum- Table 4: Workload Summary.
marizes the configuration parameters used.
LLC LLC
CXL performance modeling. For C OA X IA L, we model Application IPC Application IPC
MPKI MPKI
CXL controllers and PCIe bus on both the processor and the Ligra SPEC
type-3 device. Each CXL controller comprises a CXL port PageRank 0.36 40 lbm 0.14 64
that incurs a fixed delay of 12ns accounting for flit-packing, PageRank
0.31 27 bwaves 0.33 14
encoding-decoding, packet processing, etc. [43]. The PCIe Delta
bus incurs traversal latency due to the limited channel band- Components
0.34 48 cactusBSSN 0.68 8
-shortcut
width and bus width. For an x8 channel, the peak 32GB/s
Components 0.36 48 fotonik3d 0.33 22
bandwidth results in 26/13 GB/s RX/TX goodput when header BC 0.33 34 cam4 0.87 6
overheads are factored in, and 32/10 GB/s RX/TX in the case Radii 0.41 33 wrf 0.61 11
of CXL-asym channels. The corresponding link traversal BFSCC 0.68 17 mcf 0.793 13
latency is 2.5/ 5.5 ns RX/TX for an x8 channel and 2/ 9 ns BFS 0.69 15 roms 0.783 6
RX/TX for CXL-asym. Additionally, the CXL controller main- BFS-Bitvector 0.84 15 pop2 1.55 3
tains message queues to buffer requests. Therefore, in addition BellmanFord 0.86 9 omnetpp 0.51 10
to minimum latency overhead of about 30ns (or more, in our Triangle 0.65 21 xalancbmk 0.55 12
sensitivity analysis), queuing effects at the CXL controller are MIS 1.37 8 gcc 0.31 19
STREAM PARSEC
also captured and reflected in the performance.
Stream-copy 0.17 58 fluidanimate 0.78 7
Workloads. We evaluate 35 workloads from various bench- Stream-scale 0.21 48 facesim 0.74 6
mark suites. We deploy the same workload instance on all Stream-add 0.16 69 raytrace 1.17 5
cores and simulate 200 million instructions per core after fast- Stream-triad 0.18 59 streamcluster 0.99 14
forwarding each application to a region of interest. KVS & Data analytics canneal 0.66 7
• Graph analytics: We use 12 workloads from the LIGRA Masstree 0.37 21
Kmeans 0.50 36
benchmark suite [49].
• STREAM: We run the four kernels (copy, scale, add, triad) significant speedup, up to 3× for lbm and 1.52× on average.
from the STREAM benchmark [34] to represent bandwidth- 10 of the 35 workloads experience more than 2× speedup.
intensive matrix operations in which ML workloads spend a Four workloads lose performance, with gcc most significantly
significant portion of their execution time. impacted at 26% IPC loss. Workloads most likely to suffer
• SPEC & PARSEC: We evaluate 13 workloads from the a performance loss are those with low to moderate memory
SPEC-speed 2017 [50] benchmark suite in ref mode, as traffic and heavy dependencies among memory accesses.
well as five PARSEC workloads [5].
• We evaluate masstree [32] and kmeans [28] to represent key Fig. 5 (bottom) shows memory bandwidth utilization for
value store and data analytics workloads, respectively. the DDR-based baseline and C OA X IA L-4×, which provides
Table 4 summarizes all our evaluated workloads, along with 4× higher bandwidth than the baseline. C OA X IA L distributes
their IPC and MPKI as measured on the DDR-based baseline. memory requests over more channels which reduces the band-
width utilization of the system, in turn reducing contention
6. Evaluation Results for the memory bus. The lower bandwidth utilization and con-
tention drastically reduces the queuing delay in C OA X IA L for
We first compare our main C OA X IA L design, C OA X IA L-4×, memory-intensive workloads. Fig. 5 (middle) demonstrates
with the DDR-based baseline by analyzing the impact of re- this reduction with a breakdown of the average memory ac-
duced bandwidth utilization and queuing delays on perfor- cess latency (as measured from the LLC miss register) into the
mance in §6.1. §6.2 highlights the effect of memory access DRAM service time, queuing delay, and CXL interface delay
pattern and distribution on performance. §6.3 presents the per- (only applicable to C OA X IA L).
formance of alternative C OA X IA L designs, C OA X IA L-2×
In many cases, C OA X IA L enables the workload to drive
and C OA X IA L-asym, and §6.4 demonstrates the impact of
significantly more aggregate bandwidth from the system. For
a more conservative 50ns CXL latency penalty. §6.5 evalu-
instance, stream-copy is bottlenecked by the baseline system’s
ates C OA X IA L at different server utilization points, and §6.6
constrained bandwidth, resulting in average queuing delay
analyzes C OA X IA L’s power implications.
exceeding 300ns that largely dictates the overall access la-
6.1. From Queuing Reduction to Performance Gains tency (the total height of the stacked bars). C OA X IA L reduces
queuing delay to just 55ns for this workload, more than com-
Fig. 5 (top) shows the performance of C OA X IA L-4× relative pensating for the 30ns CXL interface latency overhead. The
to the baseline DDR-based system. Most workloads exhibit overall average access latency for stream-copy reduces from
Note that, throughout §6’s evaluation, reference to “C OA X IA L” without 348ns in baseline to just 120ns, enabling C OA X IA L to drive
a following qualifier implies the C OA X IA L-4× configuration. memory requests at a 2.9× higher rate versus the baseline,
7
5
2.7 2.6 3.0 2.6
2.4
2.0
Performance
5
2
Normalized
1.5
5
1.5
1.4
1.5
2
1.1
1.0
0.5
0.0
250
GM_STRM
GM_LIGRA
GM_PARS
GM_SPEC
GM_ALL
Queuing Delay Access Service Time CXL Interface Delay
Latency (ns)
Average Memory
200
150
100
Memory Bandwidth Access
50
0
80
60
Utilization
40
20 Baseline
CoaXiaL
0
STscaley
_tr d
an V
Co ad
MIe
Co -sc
PR p
S
-D
BC
R PR
BF adii
Be SCC
F
BF BFS
raycesima
calustee
nn r
l
bw lbm
caaves
fotctuB
ca nik
m4
mcrf
rom f
om pop s
n 2
xaetpp
gc
ma cc
ea e
ns
ea
lan
ST _cop
llm
gl
fa id
sc trac
kmstre
ST _ad
Tri -B
w
i
mp
o
u
S
f l
s
_
ST
Figure 5: Normalized performance of C OA X IA L over DDR-based baseline (top), memory access latency breakdown (middle), and
memory bandwidth utilization (bottom). Workloads are grouped into their benchmark suite. “gm” refers to geometric mean.
C OA X IA L offers 1.52× average speedup due to 4× higher bandwidth, lowering utilization and mitigating queuing effects.
thus achieving commensurate speedup. tion to average memory access latency. Fig. 6a shows that
Despite provisioning 4× more bandwidth, C OA X IA L re- C OA X IA L also achieves a similar reduction in stdev, indicat-
duces average bandwidth utilization from 54% to 34% for ing lower dispersion and fewer extreme high-latency values.
workloads that have more than 2× performance improvement,
To further demonstrate the impact of access latency distribu-
highlighting the extra bandwidth is indeed utilized by these
tion and temporal effects, we study a few workloads in more
workloads. For most of the other workloads, C OA X IA L’s aver-
depth. Streamcluster presents an interesting case because
age memory access latency is much lower than the baseline’s,
its performance improves despite a slightly higher average
despite the CXL interface’s latency overhead.
memory access latency of 76ns compared to the baseline’s
On average, workloads experience 144ns in queuing delay
69ns (see Fig. 5). Fig. 6b shows the Cumulative Distribution
on top of ∼40ns DRAM service time. By slashing queuing
Function (CDF) of Streamcluster’s memory access latencies,
delay to just 31ns on average, C OA X IA L reduces average
illustrating that the baseline results in a higher variance than
memory access latency, thereby boosting performance. Over-
C OA X IA L (stdev of 88 versus 76), due to imbalanced queu-
all, Fig. 5’s results confirm our key insight (see §3.1): queuing
ing across DRAM banks. The tighter distribution of memory
delays largely dictate the average memory access latency.
access latency allows C OA X IA L to outperform the baseline
despite a 10% higher average memory access latency.
Takeaway #1: C OA X IA L drastically reduces queuing
delays, resulting in lower effective memory access latency Some workloads benefit from C OA X IA L more than other
for bandwidth-hungry workloads. workloads with similar or higher memory bandwidth utiliza-
tion (Fig. 5 (bottom)). For example, bwaves uses a mere
32% of the baseline’s available bandwidth but suffers an over-
6.2. Beyond Average Bandwidth Utilization and Access
whelming 390ns queuing delay. Even though bwaves uses less
Latency
bandwidth on average compared to workloads (e.g., radii with
While most of C OA X IA L’s performance gains can be justi- 65% bandwidth utilization), it exhibits bursty behavior that
fied by the achieved reduction in average memory latency, a incurs queuing spikes which can be more effectively absorbed
compounding positive effect is reduction in latency variance by C OA X IA L. Kmeans exhibits the opposite case. Despite
as evidenced in §3.2. For each of the four evaluated workload having the highest bandwidth utilization in the baseline sys-
groups, Fig. 6a shows the mean average latency and standard tem, it experiences a relatively low average queuing delay of
deviation (stdev) for C OA X IA L and the DDR-based baseline. 50ns and exhibits one of the lowest latency variance values
As already seen in §6.1, C OA X IA L delivers a 45–60% reduc- across workloads, indicating an even distribution of accesses
8
and Standard Deviation (ns) mean stdev from four with 30ns latency penalty) take a performance hit.
Baseline Baseline 69 88
Average Access Latency
500
CoaXiaL CoaXiaL 76 76 These results imply that while a C OA X IA L with a higher CXL
400 1.0 latency is still worth pursuing, it should be used selectively
300 0.8 for memory-intensive workloads. Deploying different classes
CDF
200 0.6
0.4 of servers for different optimization goals is common practice
100 Baseline
0.2 CoaXiaL not only in public clouds [15] but also in private clouds (e.g.,
0
A C C L
EAM LIGR ARSE SPE AL 0 100 200 300 different web and backend server configurations) [12, 18].
STR P Memory Access Time (ns)
(a) Average memory access la- (b) Cumulative Distribution Func- Takeaway #3: Even with a 50ns CXL latency overhead,
tency per workload group, and stdev tion (CDF) of memory access C OA X IA L achieves a considerable 1.3× average speedup
shown as error bars. time for Streamcluster. across all workloads.
Figure 6: Memory access latency distribution.
9
3.0
CoaXiaL-2X CoaXiaL-4X CoaXiaL-asym
Norm. Performance
2.5
2.0
1.5
1.0
0.5
0.0
_s y
_ S
GM LIGRM
GM_PARA
GMSPEC
LL
Tri -BV
ST cale
ST _add
iad
Co -sc
PR p
BC
-D
R R
BF BFS
BF adii
Be SCC
F
gle
S
fac da
ray esim
scltrace
ca uster
l
bw lbm
ca ves
fot tuB
ca ik
m4
mcf
ne 2
rom f
xa tpp
c
ma c
om pop s
GM ns
kmstree
ea
wr
lan
gc
ST _cop
llm
m
MI
P
GM _STR
on
_A
ea
i
mp
nn
_tr
flu
an
S
c
a
s
ST
_
Co
Figure 7: C OA X IA L’s performance at different design points, norm. to the DDR-based server baseline. C OA X IA L-4× outperforms
C OA X IA L-2×, despite its halved LLC size. C OA X IA L-asym considerably outperforms our default C OA X IA L-4× design.
3.0
30ns CXL Buffer Delay
Norm. Performance
2.5
50ns CXL Buffer Delay
2.0
1.5
1.0
0.5
0.0
_s y
GM_LIGRM
GM_PARA
_ S
GMSPEC
LL
ST cale
_tr d
Co d
Co -sc
BF adii
PR p
-D
MIe
om pop s
BC
R PR
Be SCC
S
F
BF BFS
Tri -BV
raycesima
scltrace
nn r
l
bw bm
ca ves
fotctuB
ca nik
m4
mcf
rom f
ne 2
xa tpp
gc
ma cc
kmstree
GM ns
ea
wr
ca uste
lan
ST _cop
llm
gl
fa id
ST _ad
ia
_A
ea
mp
o
an
flu
GM ST
l
S
_
ST
Figure 8: C OA X IA L’s performance for different CXL latency premium, norm. to the DDR-based server. Even with a 50ns interface
latency penalty, C OA X IA L yields a 1.33× average speedup.
consumes less energy to complete the same work, even if it ble the power to obtain power consumption of a 128 GB DDR5
operates at a higher power. channel (32 GB channel for CXL). While C OA X IA L employs
We model power for a manycore processor similar to 4× more DIMMs than the baseline, its power consumption is
AMD EPYC Bergamo (128 cores) [37] or Sierra Forest (144 only 1.75× higher due to lower memory utilization.
cores) [38]. The latter is expected to have a 500W TDP, which Table 5 summarizes the key power components for the
is in line with current processors (e.g., 96-core AMD EPYC baseline and C OA X IA L systems. The overall system power
Genoa [52] has a TDP of 360W). While the memory controller consumption is 713W for the baseline system and 1.18kW
and interface require negligible power compared to the proces- for C OA X IA L, a 66% increase. Crucially, C OA X IA L mas-
sor, we include them for completeness. We estimate controller sively boosts performance, reducing CPI by 34%. As a result,
and interface power per DDR5 channel to be 0.5W and 0.6W, C OA X IA L reduces the baseline’s EDP by a considerable 28%.
respectively [57], or 13W in total for a baseline processor with
12 channels. Similarly, PCIe 5.0’s interface power is ∼0.2W Takeaway #5: In addition to boosting performance,
per lane [4], or 77W for the 384 lanes required to support C OA X IA L affords a more efficient system with a 28%
C OA X IA L’s 48 DDR5 channels. lower energy-delay product.
A significant fraction of a large-scale server’s power is
attributed to memory. We use Micron’s power calculator 6.7. Evaluation Summary
tool [35] to compute our baseline’s and CXL system’s DRAM
power requirement by taking the observed average mem- CXL-based memory systems hold great promise for manycore
ory bandwidth utilization of 52% for baseline and 21% for server processors. Replacing DDR with CXL-based memory
C OA X IA L into account. As this tool only computes power that offers 4× higher bandwidth at a 30ns latency premium
up to DDR4-3200MT/s modules, we model a 64GB 2-rank achieves a 1.52× average speedup across various workloads.
DDR4-3200 DIMM (16GB 2-rank module for CXL) and dou- Furthermore, a C OA X IA L-asym design demonstrates oppor-
3.0
1 Core (8%) 8 Cores (66%)
Norm. Performance
2.5
4 Cores (33%) 12 Cores (100%)
2.0
1.5
1.0
0.5
0.0
_s y
GM_LIGRM
GM_PARA
_ S
GMSPEC
LL
om pop s
ST cale
_tr d
Co d
Co -sc
BF adii
PR p
-D
MIe
kmstree
BC
R PR
Be SCC
Tri -BV
S
F
BF BFS
raycesima
scltrace
nn r
l
bw lbm
ca ves
fotctuB
ca nik
m4
mcf
rom f
ne 2
xa tpp
gc
ma cc
GM ns
ea
wr
ca uste
lan
ST _cop
llm
gl
fa id
ST _ad
ia
_A
ea
mp
o
an
flu
GM _ST
S
s
ST
Figure 9: Performance of C OA X IA L as a function of active cores, norm. to DDR-based server baseline at the same active cores.
10
Table 5: Energy Delay Product (EDP = System power × CPI2 ) connect to a buffer-on-board, which in turn hosts several DDR
comparison for target 144-core server. Lower EDP is better. channels [54]. FBDIMM [14] leverages a similar concept to
Centaur’s buffer-on-board to increase memory bandwidth and
EDP Component Baseline C OA X IA L capacity. An advanced memory buffer (AMB) acts as a bridge
Processor Package power 500W 500W between the processor and the memory modules, connecting
DDR5 MC & PHY power (all) 13W 52W to the processor over serial links and featuring an abundance
DDR5 DIMM power (static and access) 200W 551W of pins to enable multiple parallel interfaces to DRAM mod-
CXL’s Interface power (idle and dynamic) N/A 77W
ules. Similar to CXL-attached memory, a key concern with
Total system power 713W 1,180W
FBDIMM is its increased latency. Open Memory Interface
Average CPI (all workloads) 2.02 1.33 (OMI) is a recent high-bandwidth memory leveraging serial
EDP (all workloads) 2,909 2,087 (0.72×) links, delivering bandwidth comparable to HBM but without
tunity for additional gain (1.67× average speedup), assuming HBM’s tight capacity limitations [7]. Originally a subset of
a modification to the PCIe standard to allow departure from OpenCAPI, OMI is now part of the CXL Consortium.
the rigid 1:1 read:write bandwidth provisioning to allow an Researchers have also proposed memory system architec-
asymmetric, workload-aware one. Even if C OA X IA L incurs tures making use of high-bandwidth serial interfaces. In MeS-
a 50ns latency premium, it promises substantial performance SOS’ two-stage memory system, high-bandwidth serial links
improvement (1.33× on average). We show that our bene- connect to a high-bandwidth DRAM cache, which is then
fits stem from reduced memory contention: by reducing the chained to planar DRAM over DDR [58]. Ham et al. pro-
utilization of available bandwidth resources, C OA X IA L mit- pose disintegrated memory controllers attached over SerDes,
igates queuing effects, thus reducing both average memory aiming to make the memory system more modular and fa-
access latency and its variance. cilitate supporting heterogeneous memory technologies [17].
Alloy combines parallel and serial interfaces to access memory,
7. Related Work maintaining the parallel interfaces for lower-latency memory
access [59]. Unlike our proposal of fully replacing DDR
We discuss recent works investigating CXL-based memory processor interfaces with CXL for memory-intensive servers,
system solutions, prior memory systems leveraging serial in- Alloy’s approach is closer to the hybrid DDR/CXL memory
terfaces, as well as circuit-level and alternative techniques to systems that most ongoing CXL-related research envisions.
improve bandwidth and optimize the memory system.
Circuit-level techniques to boost memory bandwidth.
Emerging CXL-based memory systems. Industry is rapidly HBM [23] and die-stacked DRAM caches offer an order of
adopting CXL and already investigating its deployment in magnitude higher bandwidth than planar DRAM, but suffer
production systems to reap the benefits of memory expansion from limited capacity [22, 30, 44]. BOOM [60] buffers out-
and memory pooling. Microsoft leverages CXL to pool mem- puts from multiple LPDDR ranks to reduce power and sus-
ory across servers, improving utilization and thus reducing tain server-level performance, but offers modest gains due to
cost [25]. In the same vein, Gouk et al. [16] leverage CXL to low frequency LPDDR and limited bandwidth improvement.
prototype a practical instance of disaggregated memory [27]. Chen et al. [6] propose dynamic reallocation of power pins to
Aspiring to use CXL as a memory expansion technique that boost data transfer capability from memory during memory-
will enable a secondary memory tier of higher capacity than intensive phases, during which processors are memory bound
DDR, Meta’s recent work optimizes data placement in this and hence draw less power. Pal et al. [40] propose packageless
new type of two-tier memory hierarchy [33]. Using an FPGA- processors to mitigate pin limitations and boost the memory
based prototype of a CXL type-3 memory device, Ahn et al. bandwidth that can be routed to the processor. Unlike these
evaluate database workloads on a hybrid DDR/CXL memory proposals, we focus on conventional processors, packaging,
system and demonstrate minimal performance degradation, and commodity DRAM, aiming to reshape the memory sys-
suggesting that CXL-based memory expansion is cost-efficient tem of server processors by leveraging the widely adopted
and performant [3]. Instead of using CXL-attached memory up-and-coming CXL interconnect.
as a memory system extension, our work stands out as the first Other memory system optimizations. Transparent memory
one to propose CXL-based memory as a complete replace- compression techniques are a compelling approach to increas-
ment of DDR-attached memory for server processors handling ing effective memory bandwidth [61]. Malladi et al. [31]
memory-intensive workloads. leverage mobile LPDDR DRAM devices to design a more
Memory systems leveraging serial interfaces. There have energy-efficient memory system for servers without perfor-
been several prior memory system proposals leveraging serial mance loss. These works are orthogonal to our proposed
links for high-bandwidth, energy-efficient data transfers. Mi- approach. Storage-class memory, like Phase-Change Mem-
cron’s HMC was connected to the host over 16 SerDes lanes, ory [13] or Intel’s Optane [19], has attracted significant interest
delivering up to 160GB/s [41]. IBM’s Centaur is a memory as a way to boost a server’s memory capacity, triggering re-
capacity expansion solution, where the host uses SerDes to search activity on transforming the memory hierarchy to best
11
accommodate such new memories [2, 10, 24, 56]. Unlike our [14] B. Ganesh, A. Jaleel, D. Wang, and B. L. Jacob, “Fully-Buffered
work, such systems often trade off bandwidth for capacity. DIMM Memory Architectures: Understanding Mechanisms, Over-
heads and Scaling,” in Proceedings of the 13th IEEE Symposium on
High-Performance Computer Architecture (HPCA), 2007, pp. 109–120.
8. Conclusion [15] Google Cloud, “Machine families resource and comparison guide.”
[Online]. Available: https://cloud.google.com/compute/docs/machine-
Technological trends motivate a server processor design where resource
[16] D. Gouk, S. Lee, M. Kwon, and M. Jung, “Direct Access, High-
all memory is attached to the processor over the emerging Performance Memory Disaggregation with DirectCXL,” in Proceed-
CXL interconnect instead of DDR. CXL’s superior bandwidth ings of the 2022 USENIX Annual Technical Conference (ATC), 2022,
pp. 287–294.
per pin helps bandwidth-hungry server processors scale the [17] T. J. Ham, B. K. Chelepalli, N. Xue, and B. C. Lee, “Disintegrated
bandwidth wall. By distributing memory requests over 4× control for energy-efficient and heterogeneous memory systems,” in
more memory channels, CXL reduces queueing effects on Proceedings of the 19th IEEE Symposium on High-Performance Com-
puter Architecture (HPCA), 2013, pp. 424–435.
the memory bus. Because queuing delay dominates access [18] K. M. Hazelwood, S. Bird, D. M. Brooks, S. Chintala, U. Diril,
latency in loaded memory systems, such reduction more than D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu,
P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, “Applied Ma-
compensates for the interface latency overhead introduced by chine Learning at Facebook: A Datacenter Infrastructure Perspective,”
CXL. Our evaluation on a diverse range of memory-intensive in Proceedings of the 24th IEEE Symposium on High-Performance
Computer Architecture (HPCA), 2018, pp. 620–629.
workloads shows that our proposed C OA X IA L server delivers [19] Intel Corporation, “Intel Optane DC Persistent Memory.” [On-
1.52× speedup on average, and up to 3×. line]. Available: https://www.intel.com/content/www/us/en/products/
memory-storage/optane-dc-persistent-memory.html
References [20] J. Jaffari, A. Ansari, and R. Beraha, “Systems and methods for a hybrid
parallel-serial memory access,” 2015, US Patent 9747038B2.
[1] “ChampSim.” [Online]. Available: https://github.com/ChampSim/ [21] JEDEC, “DDR5 SDRAM standard (JESD79-5B),” 2022.
ChampSim [22] D. Jevdjic, S. Volos, and B. Falsafi, “Die-stacked DRAM caches for
[2] N. Agarwal and T. F. Wenisch, “Thermostat: Application-transparent servers: hit ratio, latency, or bandwidth? have it all with footprint
Page Management for Two-tiered Main Memory,” in Proceedings of cache,” in Proceedings of the 40th International Symposium on Com-
the 22nd International Conference on Architectural Support for Pro- puter Architecture (ISCA), 2013, pp. 404–415.
gramming Languages and Operating Systems (ASPLOS-XXII), 2017, [23] J. Kim and Y. Kim, “HBM: Memory solution for bandwidth-hungry
pp. 631–644. processors,” in Hot Chips Symposium, 2014, pp. 1–24.
[3] M. Ahn, A. Chang, D. Lee, J. Gim, J. Kim, J. Jung, O. Rebholz, [24] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase
V. Pham, K. T. Malladi, and Y.-S. Ki, “Enabling CXL Memory Expan- change memory as a scalable dram alternative,” in Proceedings of the
sion for In-Memory Database Management Systems,” in Proceedings 36th International Symposium on Computer Architecture (ISCA), 2009,
of the 18th International Workshop on Data Management on New pp. 2–13.
Hardware (DaMoN), 2022, pp. 8:1–8:5. [25] H. Li, D. S. Berger, S. Novakovic, L. Hsu, D. Ernst, P. Zardoshti,
[4] M. Bichan, C. Ting, B. Zand, J. Wang, R. Shulyzki, J. Guthrie, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura,
K. Tyshchenko, J. Zhao, A. Parsafar, E. Liu, A. Vatankhahghadim, and R. Bianchini, “Pond: CXL-Based Memory Pooling Systems for
S. Sharifian, A. Tyshchenko, M. D. Vita, S. Rubab, S. Iyer, F. Spagna, Cloud Platforms,” Proceedings of the 28th International Conference
and N. Dolev, “A 32Gb/s NRZ 37dB SerDes in 10nm CMOS to Sup- on Architectural Support for Programming Languages and Operating
port PCI Express Gen 5 Protocol,” in Proceedings of the 2020 IEEE Systems (ASPLOS-XXVIII), 2023.
Custom Integrated Circuits Conference, 2020, pp. 1–4. [26] S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. L. Jacob, “DRAM-
[5] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark sim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator,” IEEE
suite: characterization and architectural implications,” in Proceedings Comput. Archit. Lett., vol. 19, no. 2, pp. 110–113, 2020.
of the 17th International Conference on Parallel Architecture and [27] K. T. Lim, J. Chang, T. N. Mudge, P. Ranganathan, S. K. Reinhardt,
Compilation Techniques (PACT), 2008, pp. 72–81. and T. F. Wenisch, “Disaggregated memory for expansion and sharing
[6] S. Chen, Y. Hu, Y. Zhang, L. Peng, J. Ardonne, S. Irving, and A. Sri- in blade servers,” in Proceedings of the 36th International Symposium
vastava, “Increasing off-chip bandwidth in multi-core processors with on Computer Architecture (ISCA), 2009, pp. 267–278.
switchable pins,” in Proceedings of the 41st International Symposium [28] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf.
on Computer Architecture (ISCA), 2014, pp. 385–396. Theory, vol. 28, no. 2, pp. 129–136, 1982.
[7] T. M. Coughlin and J. Handy, “Higher Performance and Capacity with [29] Locuza, “Die walkthrough: Alder Lake-S/P and a touch of Zen
OMI Near Memory,” in Proceedings of the 2021 Annual Symposium 3,” 2022. [Online]. Available: https://locuza.substack.com/p/die-
on High-Performance Interconnects, 2021, pp. 68–71. walkthrough-alder-lake-sp-and
[8] D. I. Cutress and A. Frumusanu, “Amd 3rd gen epyc milan review: [30] G. H. Loh and M. D. Hill, “Efficiently enabling conventional block
A peak vs per core performance balance,” 2021. [Online]. Available: sizes for very large die-stacked DRAM caches,” in Proceedings of the
https://www.anandtech.com/show/16529/amd-epyc-milan-review 44th Annual IEEE/ACM International Symposium on Microarchitec-
[9] CXL Consortium, “Compute Express Link (CXL) Spec- ture (MICRO), 2011, pp. 454–464.
ification, Revision 3.0, Version 1.0,” 2022. [On- [31] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis,
line]. Available: https://www.computeexpresslink.org/_files/ugd/ and M. Horowitz, “Towards energy-proportional datacenter memory
0c1418_1798ce97c1e6438fba818d760905e43a.pdf with mobile DRAM,” in Proceedings of the 39th International Sympo-
[10] S. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, sium on Computer Architecture (ISCA), 2012, pp. 37–48.
J. Jackson, and K. Schwan, “Data tiering in heterogeneous memory [32] Y. Mao, E. Kohler, and R. T. Morris, “Cache craftiness for fast multi-
systems,” in Proceedings of the 2016 EuroSys Conference, 2016, pp. core key-value storage,” in Proceedings of the 2012 EuroSys Confer-
15:1–15:16. ence, 2012, pp. 183–196.
[11] EE Times, “CXL will absorb Gen-Z,” 2021. [Online]. Available: [33] H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhat-
https://www.eetimes.com/cxl-will-absorb-gen-z/
tacharya, C. Petersen, M. Chowdhury, S. O. Kanaujia, and P. Chauhan,
[12] Engineering at Meta, “ Introducing “Yosemite”: the first open “TPP: Transparent Page Placement for CXL-Enabled Tiered Memory,”
source modular chassis for high-powered microservers,” 2015. Proceedings of the 28th International Conference on Architectural Sup-
[Online]. Available: https://engineering.fb.com/2015/03/10/core- port for Programming Languages and Operating Systems (ASPLOS-
data/introducing-yosemite-the-first-open-source-modular-chassis- XXVIII), 2023.
for-high-powered-microservers/ [34] J. D. McCalpin, “Memory Bandwidth and Machine Balance in Current
[13] S. W. Fong, C. M. Neumann, and H.-S. P. Wong, “Phase-change mem- High Performance Computers,” IEEE Computer Society Technical
ory—towards a storage-class memory,” IEEE Transactions on Electron Committee on Computer Architecture (TCCA) Newsletter, 1995.
Devices, vol. 64, no. 11, pp. 4374–4385, 2017. [35] Micron Technology Inc., “System Power Calculators,” https://
www.micron.com/support/tools-and-utilities/power-calc.
12
[36] Micron Technology Inc., “DDR5 SDRAM Datasheet,” 2022. [Online]. [50] Standard Performance Evaluation Corporation, “SPEC CPU2017
Available: https://media-www.micron.com/-/media/client/global/ Benchmark Suite.” [Online]. Available: http://www.spec.org/cpu2017/
documents/products/data-sheet/dram/ddr5/ddr5_sdram_core.pdf [51] P. Stanley-Marbell, V. C. Cabezas, and R. P. Luijten, “Pinned to the
[37] H. Mujtaba, “AMD EPYC Bergamo ‘Zen 4C’ CPUs Being walls: impact of packaging and application properties on the memory
Deployed In 1H 2023 To Tackle Arm CPUs, Instinct MI300 and power walls,” in Proceedings of the 2011 International Symposium
APU Back In Labs,” 2022. [Online]. Available: https: on Low Power Electronics and Design, 2011, pp. 51–56.
//wccftech.com/amd-epyc-bergamo-zen-4c-cpus-being-deployed-in- [52] StorageReview, “4th Gen AMD EPYC Review (AMD Genoa),” 2022.
1h-2023-tackle-arm-instinct-mi300-apu-back-in-labs/amp/ [Online]. Available: https://www.storagereview.com/review/4th-gen-
[38] H. Mujtaba, “Intel Granite Rapids & Sierra Forest Xeon amd-epyc-review-amd-genoa
CPU Detailed In Avenue City Platform Leak: Up To 500W [53] TechPowerUp, “AMD "Matisse" and "Rome" IO Con-
TDP & 12-Channel DDR5,” 2023. [Online]. Available: https: troller Dies Mapped Out,” 2020. [Online]. Avail-
//wccftech.com/intel-granite-rapids-sierra-forest-xeon-cpu-detailed- able: https://www.techpowerup.com/266287/amd-matisse-and-rome-
in-avenue-city-platform-leak-up-to-500w-tdp-12-channel-ddr5/ io-controller-dies-mapped-out
[39] B. Nitin, W. Randy, I. Shinichiro, F. Eiji, R. Shibata, S. Yumiko, and
O. Megumi, “DDR5 design challenges,” in 2018 IEEE 22nd Workshop [54] The Next Platform, “IBM POWER Chips Blur the
on Signal and Power Integrity (SPI), 2018, pp. 1–4. Lines to Memory and Accelerators,” 2018. [Online].
Available: https://www.nextplatform.com/2018/08/28/ibm-
[40] S. Pal, D. Petrisko, A. A. Bajwa, P. Gupta, S. S. Iyer, and R. Kumar, power-chips-blur-the-lines-to-memory-and-accelerators/#:~:text=
“A Case for Packageless Processors,” in Proceedings of the 24th IEEE The%20Centaur%20memory%20adds%20about
Symposium on High-Performance Computer Architecture (HPCA),
2018, pp. 466–479. [55] The Register, “CXL absorbs OpenCAPI on the road to interconnect
[41] J. T. Pawlowski, “Hybrid memory cube (HMC),” in Hot Chips Sympo- dominance,” 2022. [Online]. Available: https://www.theregister.com/
sium, 2011, pp. 1–24. 2022/08/02/cxl_absorbs_opencapi/
[42] PLDA, “Breaking the PCIe Latency Barrier with CXL,” 2020. [Online]. [56] D. Ustiugov, A. Daglis, J. Picorel, M. Sutherland, E. Bugnion,
Available: https://www.brighttalk.com/webcast/18357/434922 B. Falsafi, and D. N. Pnevmatikatos, “Design guidelines for high-
[43] PLDA and AnalogX, “PLDA and AnalogX Announce Market-leading performance SCM hierarchies,” in Proceedings of the 2018 Interna-
CXL 2.0 Solution featuring Ultra-low Latency and Power,” tional Symposium on Memory Systems (MEMSYS), 2018, pp. 3–16.
2021. [Online]. Available: https://www.businesswire.com/news/ [57] S. Volos, “Memory Systems and Interconnects for Scale-Out Servers,”
home/20210602005484/en/PLDA-and-AnalogX-Announce-Market- Ph.D. dissertation, EPFL, Switzerland, 2015.
leading-CXL-2.0-Solution-featuring-Ultra-low-Latency-and-Power [58] S. Volos, D. Jevdjic, B. Falsafi, and B. Grot, “Fat Caches for Scale-Out
[44] M. K. Qureshi and G. H. Loh, “Fundamental Latency Trade-off in Servers,” IEEE Micro, vol. 37, no. 2, pp. 90–103, 2017.
Architecting DRAM Caches: Outperforming Impractical SRAM-Tags [59] H. Wang, C.-J. Park, G. Byun, J. H. Ahn, and N. S. Kim, “Alloy:
with a Simple and Practical Design,” in Proceedings of the 45th Annual Parallel-serial memory channel architecture for single-chip heteroge-
IEEE/ACM International Symposium on Microarchitecture (MICRO), neous processor systems,” in Proceedings of the 21st IEEE Symposium
2012, pp. 235–246. on High-Performance Computer Architecture (HPCA), 2015, pp. 296–
[45] R. Rooney and N. Koyle, “Micron® DDR5 SDRAM: New Features,” 308.
Micron Technology Inc., Tech. Rep, 2019. [60] D. H. Yoon, J. Chang, N. Muralimanohar, and P. Ranganathan,
[46] P. Rosenfeld, E. Cooper-Balis, and B. L. Jacob, “DRAMSim2: A “BOOM: Enabling mobile memory based low-power server DIMMs,”
Cycle Accurate Memory System Simulator,” IEEE Comput. Archit. in Proceedings of the 39th International Symposium on Computer
Lett., vol. 10, no. 1, pp. 16–19, 2011. Architecture (ISCA), 2012, pp. 25–36.
[47] D. D. Sharma, “PCI Express® 6.0 Specification at 64.0 GT/s with PAM- [61] V. Young, S. Kariyappa, and M. K. Qureshi, “Enabling Transparent
4 signaling: a low latency, high bandwidth, high reliability and cost- Memory-Compression for Commodity Memory Systems,” in Proceed-
effective interconnect,” in Proceedings of the 2020 Annual Symposium ings of the 25th IEEE Symposium on High-Performance Computer
on High-Performance Interconnects, 2020, pp. 1–8. Architecture (HPCA), 2019, pp. 570–581.
[48] D. D. Sharma, “Compute Express Link®: An open industry-standard [62] Q. Zhu, S. Venkataraman, C. Ye, and A. Chandrasekhar, “Package
interconnect enabling heterogeneous data-centric computing,” in Pro- design challenges and optimizations in density efficient (Intel® Xeon®
ceedings of the 2022 Annual Symposium on High-Performance Inter- processor D) SoC,” in 2016 IEEE Electrical Design of Advanced Pack-
connects, 2022, pp. 5–12. aging and Systems (EDAPS), 2016.
[49] J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing
framework for shared memory,” in Proceedings of the 18th ACM SIG-
PLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP), 2013, pp. 135–146.
13