0% found this document useful (0 votes)
17 views9 pages

Ebpf, Fpga

Uploaded by

bk2838
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Ebpf, Fpga

Uploaded by

bk2838
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

research highlights

DOI:10.1145/ 35 43 6 6 8

hXDP: Efficient
To view the accompanying Technical Perspective,
visit doi.acm.org/10.1145/3543844 tp

Software Packet Processing


on FPGA NICs
By Marco Spaziani Brunella, Giacomo Belocchi, Marco Bonola, Salvatore Pontarelli, Giuseppe Siracusano,
Giuseppe Bianchi, Aniello Cammarano, Alessandro Palumbo, Luca Petrucci, and Roberto Bifulco

Abstract performance,17 new architectural solutions are being intro-


The network interface cards (NICs) of modern computers duced to handle these growing workloads.
are changing to adapt to faster data rates and to help with the The inclusion of programmable accelerators on the
scaling issues of general-purpose CPU technologies. Among NIC is one of the promising approaches to offload the
the ongoing innovations, the inclusion of programmable resource-intensive packet processing tasks from the CPU,
accelerators on the NIC’s data path is particularly interest- thereby saving its precious cycles for tasks that cannot be
ing, since it provides the opportunity to offload some of the performed elsewhere. Nonetheless, achieving programma-
CPU’s network packet processing tasks to the accelerator. bility for high-performance network packet processing tasks
Given the strict latency constraints of packet processing is an open research problem, with solutions exploring differ-
tasks, accelerators are often implemented leveraging plat- ent areas of the solution space that compromise in different
forms such as Field-Programmable Gate Arrays (FPGAs). ways between performance, flexibility, and ease-of-use.25 As
FPGAs can be re-programmed after deployment, to adapt a result, today’s accelerators are implemented using differ-
to changing application requirements, and can achieve ent technologies, including Application-Specific Integrated
both high throughput and low latency when implement- Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs),
ing packet processing tasks. However, they have limited and many-core System-on-Chip.
resources that may need to be shared among diverse appli- FPGA-based NICs are especially interesting since they
cations, and programming them is difficult and requires provide good performance together with a high degree
hardware design expertise. of flexibility, which enables programmers to define vir-
We present hXDP, a solution to run on FPGAs software tually any function, provided that it fits in the available
packet processing tasks described with the eBPF technol- hardware resources. Compared to other accelerators for
ogy and targeting the Linux’s eXpress Data Path. hXDP NICs, such as network processing ASICs4 or many-core
uses only a fraction of the available FPGA resources, while System-on-Chip SmartNICs,27 the FPGA NICs flexibility
matching the performance of high-end CPUs. The iterative gives also the additional benefit of supporting diverse
execution model of eBPF is not a good fit for FPGA accelera- accelerators for a wider set of applications. For instance,
tors. Nonetheless, we show that many of the instructions Microsoft employs them in datacenters for both network
of an eBPF program can be compressed, parallelized, or and machine learning tasks,8, 11 and in telecom networks,
completely removed, when targeting a purpose-built FPGA they are used also to perform radio signal processing
design, thereby significantly improving performance. tasks.20, 31, 26 Nonetheless, programming FPGAs is difficult,
We implement hXDP on an FPGA NIC and evaluate it run- often requiring the establishment of a dedicated team of
ning real-world unmodified eBPF programs. Our implemen- hardware specialists,11 which interacts with software and
tation runs at 156.25MHz and uses about 15% of the FPGA operating system developers to integrate the offloading
resources. Despite these modest requirements, it can run solution with the system. Furthermore, previous work
dynamically loaded programs, achieves the packet process- that simplifies network functions programming on FPGAs
ing throughput of a high-end CPU core, and provides a 10× assumes that a large share of the FPGA is fully dedicated
lower packet forwarding latency. to packet processing,1, 30, 28 reducing the ability to share the
FPGA with other accelerators.
Our goal is to provide a more general and easy-to-use
1. INTRODUCTION solution to program packet processing on FPGA NICs, using
Computers in datacenter and telecom operator networks little FPGA resources, while seamlessly integrating with
employ a large fraction of their CPU’s resources to pro- existing operating systems. We build toward this goal by pre-
cess network traffic coming from their network interface senting hXDP, a set of technologies that enables the efficient
cards (NICs). Enforcing security, for example, using a fire-
wall function, monitoring network-level performance, and
routing packets toward their intended destinations are just The original version of this paper is was published in the
few examples of the tasks being performed by these sys- Proceedings of the 14th USENIX Symposium on Operating
tems. With NICs’ port speeds growing beyond 100Gigabit/s Systems Design and Implementation, November 2020.
(Gbps), and given the limitations in further scaling CPUs’

92 COM MUNICATIO NS O F TH E ACM | AU GU ST 2022 | VO L . 65 | NO. 8


execution of the Linux’s eXpress Data Path (XDP)19 on Figure 1. The hXDP concept. hXDP provides an easy-to-use network
FPGA. XDP leverages the eBPF technology to provide secure accelerator that shares the FPGA NIC resources with other
programmable packet processing within the Linux kernel, application-specific accelerators.
and it is widely used by the Linux’s community in produc-
tion environments. hXDP provides full XDP support, allow- Available FPGA resources
CPU RAM
ing users to dynamically load and run their unmodified XDP ML Accelerator hXDP
programs on the FPGA. Cache memory
Data Pre-processing
The challenge is therefore to run XDP programs effec- PCIe Accelerator
tively on FPGA, since the eBPF technology is originally Host machine FPGA NIC
designed for sequential execution on a high-performance
RISC-like register machine. That is, eBPF is designed for
server CPUs with high clock frequency and the ability to
execute many of the sequential eBPF instructions per A little use of the FPGA resources is especially important,
second. Instead, FPGAs favor a widely parallel execution since it enables extra consolidation by packing different
model with clock frequencies that are 5–10× lower than application accelerators on the same FPGA.
those of high-end CPUs. As such, a straightforward imple- The choice of supporting XDP is instead motivated by
mentation of the eBPF iterative execution model on FPGA a twofold benefit brought by the technology: It readily
is likely to provide low packet forwarding performance. enables NIC offloading for already deployed XDP pro-
Furthermore, the hXDP design should implement arbitrary grams, and it provides an on-NIC programming model
XDP programs while using little hardware resources, in that is already familiar to a large community of Linux
order to keep FPGA’s resources free for other accelerators. programmers; thus, developers do not need to learn new
We address the challenge performing a detailed analy- programming paradigms, such as those introduced by
sis of the eBPF Instruction Set Architecture (ISA) and of P43 or FlowBlaze.28
the existing XDP programs, to reveal and take advantage Non-Goals: unlike previous work targeting FPGA NICs,30, 1, 28
of opportunities for optimization. First, we identify eBPF hXDP does not assume the FPGA to be dedicated to network
instructions that can be safely removed, when not running processing tasks. Because of that, hXDP adopts an iterative
in the Linux kernel context. For instance, we remove data processing model, which is in stark contrast to the pipe-
boundary checks and variable zero-ing instructions by lined processing model supported by previous work. The
providing targeted hardware support. Second, we define iterative model requires a fixed amount of resources, no
extensions to the eBPF ISA to introduce 3-operand instruc- matter the complexity of the program being implemented.
tions, new 6B load/store instructions, and a new parame- Instead, in the pipeline model the resource requirement is
terized program exit instruction. Finally, we leverage eBPF dependent on the implemented program complexity, since
instruction-level parallelism, performing a static analysis programs are effectively “unrolled” in the FPGA. In fact,
of the programs at compile time, which allows us to exe- hXDP provides dynamic runtime loading of XDP programs,
cute several eBPF instructions in parallel. We design hXDP whereas solutions such as P4->NetFPGA30 or FlowBlaze
to implement these optimizations, and to take full advan- need to often load a new FPGA bitstream when changing
tage of the on-NIC execution environment, for example, application. As such, hXDP is not designed to be faster
avoiding unnecessary PCIe transfers. Our design includes at processing packets than those designs. Instead, hXDP
the following: (i) a compiler to translate XDP programs’ aims at freeing precious CPU resources, which can then be
bytecode to the extended hXDP ISA; (ii) a self-contained dedicated to workloads that cannot run elsewhere, while
FPGA IP Core module that implements the extended ISA providing similar or better performance than the CPU.
alongside several other low-level optimizations; (iii) and Likewise, hXDP cannot be directly compared to
the toolchain required to dynamically load and interact SmartNICs dedicated to network processing. Such NICs’
with XDP programs running on the FPGA NIC. resources are largely, often exclusively, devoted to network
To evaluate hXDP, we provide an open-source implemen- packet processing. Instead, hXDP leverages only a fraction
tation for the NetFPGA.32 We test our implementation using of an FPGA resources to add packet processing with good
the XDP example programs provided by the Linux source code performance, alongside other application-specific accelera-
and using two real-world applications: a simple stateful fire- tors, which share the same chip’s resources.
wall and Facebook’s Katran load balancer. hXDP can match Requirements: given the above discussion, we can derive
the packet forwarding throughput of a multi-GHz server CPU three high-level requirements for hXDP:
core, while providing a 10× lower forwarding latency. This is
achieved despite the low clock frequency of our prototype 1. It should execute unmodified compiled XDP programs
(156MHz) and using less than 15% of the FPGA resources. and support the XDP frameworks’ toolchain, for exam-
hXDP sources are at https://axbryd.io/technology. ple, dynamic program loading and userspace access to
maps;
2. CONCEPT AND OVERVIEW 2. It should provide packet processing performance at
Our main goal is to provide the ability to run XDP programs least comparable to that of a high-end CPU core;
efficiently on FPGA NICs, while using little FPGA’s hardware 3. It should require a small amount of the FPGA’s hard-
resources (see Figure 1). ware resources.

AU G U ST 2 0 2 2 | VO L. 6 5 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 93
research highlights

Before presenting a more detailed description of the hXDP solutions the FPGA-based executor would be 2–3× slower
concept, we now give a brief background about XDP. than the CPU core.
Furthermore, existing solutions to speed-up sequential
2.1. XDP primer code execution, for example, superscalar architectures,
XDP allows programmers to inject programs at the NIC are too expensive in terms of hardware resources to be
driver level, so that such programs are executed before a adopted in this case. In fact, in a superscalar architecture the
network packet is passed to the Linux’s network stack. XDP speed-up is achieved leveraging instruction-level parallel-
programs are based on the Linux’s eBPF technology. eBPF ism at runtime. However, the complexity of the hardware
provides an in-kernel virtual machine for the sandboxed required to do so grows exponentially with the number
execution of small programs within the kernel context. In of instructions being checked for parallel execution. This
its current version, the eBPF virtual machine has 11 64b rules out re-using general-purpose soft-core designs, such
registers: r0 holds the return value from in-kernel functions as those based on RISC-V.16, 14
and programs, r1 – r5 are used to store arguments that are
passed to in-kernel functions, r6 – r9 are registers that are 2.3. hXDP Overview
preserved during function calls, and r10 stores the frame hXDP addresses the outlined challenge by taking a soft-
pointer to access the stack. The eBPF virtual machine has ware-hardware co-design approach. In particular, hXDP
a well-defined ISA composed of more than 100 fixed length provides both a compiler and the corresponding hardware
instructions (64b). Programmers usually write an eBPF pro- module. The compiler takes advantage of eBPF ISA optimi-
gram using the C language with some restrictions, which zation opportunities, leveraging hXDP’s hardware module
simplify the static verification of the program. features that are introduced to simplify the exploitation of
eBPF programs can also access kernel memory areas such opportunities. Effectively, we design a new ISA that
called maps, that is, kernel memory locations that essen- extends the eBPF ISA, specifically targeting the execution
tially resemble tables. For instance, eBPF programs can use of XDP programs.
maps to implement arrays and hash tables. An eBPF pro- The compiler optimizations perform transformations
gram can interact with map’s locations by means of pointer at the eBPF instruction level: remove unnecessary instruc-
deference, for un-structured data access, or by invoking tions; replace instructions with newly defined more con-
specific helper functions for structured data access, for cise instructions; and parallelize instruction execution. All
example, a lookup on a map configured as a hash table. the optimizations are performed at compile time, moving
Maps are especially important since they are the only mean most of the complexity to the software compiler, thereby
to keep state across program executions, and to share infor- reducing the target hardware complexity. Accordingly, the
mation with other eBPF programs and with programs run- hXDP hardware module implements an infrastructure to
ning in user space. run up to 4 instructions in parallel, implementing a Very
Long Instruction Word (VLIW) soft processor. The VLIW
2.2. Challenges soft processor does not provide any runtime program opti-
To grasp an intuitive understanding of the design challenge mization, for example, branch prediction, instruction re-
involved in supporting XDP on FPGA, we now consider the ordering. We rely entirely on the compiler to optimize XDP
example of an XDP program that implements a simple state- programs for high-performance execution, thereby freeing
ful firewall for checking the establishment of bi-directional the hardware module of complex mechanisms that would
TCP or UDP flows. A C program describing this simple fire- use more hardware resources.
wall function is compiled to 71 eBPF instructions. Ultimately, the hXDP hardware component is deployed
We can build a rough idea of the potential best-case as a self-contained IP core module to the FPGA. The mod-
speed of this function running on an FPGA-based eBPF ule can be interfaced with other processing modules if
executor, assuming that each eBPF instruction requires 1 needed, or just placed as a bump-in-the-wire between the
clock cycle to be executed, that clock cycles are not spent for
any other operation, and that the FPGA has a 156MHz clock Figure 2. An overview of the XDP workflow and architecture,
rate, which is common in FPGA NICs.32 In such a case, a including the contribution of this article.
naive FPGA implementation that implements the sequen-
tial eBPF executor would provide a maximum throughput eBPF BCC toolstack Control
Program Program
of 2.8 Million packets per second (Mpps), under optimistic
ELF
assumptions, for example, assuming no additional over- Object file Compiler
This paper contribution

User-space bpf syscall


heads due to queues management. For comparison, when bpf syscall

running on a single core of a high-end server CPU clocked


Maps
at 3.7GHz, and including also operating system overhead Verifier
JIT
and the PCIe transfer costs, the XDP simple firewall pro- Maps Helper
Sephirot Functions
gram achieves a throughput of 7.4Mpps.a Since it is often Helper Functions
hXDP
undesired or not possible to increase the FPGA clock rate; eBPF runtime
environment NIC Driver
for example, due to power constraints, in the lack of other XDP hook
Kernel NIC

a Intel Xeon E5–1630v3, Linux kernel v.5.6.4.

94 COMMUNICATIO NS O F TH E AC M | AU GU ST 2022 | VO L . 65 | NO. 8


NIC’s port and its PCIe driver toward the host system. The need two instructions with just a single instruction.
hXDP software toolchain, which includes the compiler, Load/store size. The eBPF ISA includes byte-aligned
provides all the machinery to use hXDP within a Linux memory load/store operations, with sizes of 1B, 2B, 4B, and
operating system. 8B. While these instructions are effective for most cases, we
From a programmer perspective, a compiled eBPF pro- noticed that during packet processing the use of 6B load/
gram could be therefore interchangeably executed in-kernel store can reduce the number of instructions in common
or on the FPGA (see Figure 2). cases. In fact, 6B is the size of an Ethernet MAC address,
which is a commonly accessed field. Extending the eBPF
3. hXDP COMPILER ISA with 6B load/store instructions often halves the required
We now describe the hXDP instruction-level optimizations instructions.
and the compiler design to implement them. Parameterized exit. The end of an eBPF program
Instructions reduction: the eBPF technology is designed is marked by the exit instruction. In XDP, programs set the
to enable execution within the Linux kernel, for which it r0 to a value corresponding to the desired forwarding
requires programs to include a number of extra instruc- action (that is, DROP, TX); then, when a program exits,
tions, which are then checked by the kernel’s verifier. When the framework checks the r0 register to finally perform the
targeting a dedicated eBPF executor implemented on FPGA, forwarding action. While this extension of the ISA only saves
most of such instructions could be safely removed, or they one (runtime) instruction per program, as we will see in
can be replaced by cheaper embedded hardware checks. Section 4, it will also enable more significant hardware opti-
Two relevant examples are instructions for memory bound- mizations (Figure 4).
ary checks and memory zero-ing (Figure 3). Instruction parallelism. Finally, we explore the
Boundary checks are required by the eBPF verifier to opportunity to perform parallel processing of an eBPF
ensure programs only read valid memory locations, when- program’s instructions. Since our target is to keep the
ever a pointer operation is involved. In hXDP, we can safely hardware design as simple as possible, we do not intro-
remove these instructions, implementing the check directly duce runtime mechanisms for that and instead perform
in hardware. only a static analysis of the instruction-level parallelism
Zero-ing is the process of setting a newly created vari- of eBPF programs at compile time. We therefore design
able to zero, and it is a common operation performed by pro- a custom compiler to implement the optimizations out-
grammers both for safety and for ensuring correct execution lined in this section and to transform XDP programs
of their programs. A dedicated FPGA executor can provide into a schedule of parallel instructions that can run with
hard guarantees that all relevant memory areas are zero-ed hXDP. The compiler analyzes eBPF bytecode, considering
at program start, therefore making the explicit zero-ing of both (i) the Data & Control Flow dependencies and (ii) the
variables during initialization redundant. hardware constraints of the target platform. The schedule
ISA extension: to effectively reduce the number of instruc- can be visualized as a virtually infinite set of rows, each
tions, we define an ISA that enables a more concise descrip- with multiple available spots, which need to be filled
tion of the program. Here, there are two factors at play to with instructions. The number of spots corresponds to
our advantage. First, we can extend the ISA without account- the number of execution lanes of the target executor. The
ing for constraints related to the need to support efficient compiler fits the given XDP program’s instructions in
Just-In-Time compilation. Second, our eBPF programs are the smallest number of rows, while respecting the three
part of XDP applications, and as such, we can expect packet Bernstein conditions that ensure the ability to run the
processing as the main program task. Leveraging these two selected instructions in parallel.2
facts we define a new ISA that changes in three main ways
the original eBPF ISA. 4. HARDWARE MODULE
Operands number. The first significant change has to We design hXDP as an independent IP core, which can be
deal with the inclusion of three-operand operations, in place added to a larger FPGA design as needed. Our IP core com-
of eBPF’s two-operand ones. Here, we believe that the eBPF’s prises the elements to execute all the XDP functional blocks
ISA selection of two-operand operations was mainly dictated on the NIC, including helper functions and maps.
by the assumption that an ×86 ISA would be the final com-
pilation target. Instead, using three-operand instructions 4.1. Architecture and components
allows us to implement an operation that would normally The hXDP hardware design includes five components
(see Figure 5): the Programmable Input Queue (PIQ); the
Figure 3. Examples of instructions removed by hXDP. Active Packet Selector (APS); the Sephirot processing

if (data+sizeof (*eth))>data_end) r4 = r2 Figure 4. Examples of hXDP ISA extensions.


goto EOP; r4 += 14
if r4 > r3 goto +60 <LBB0_17>
l4 = data + nh_off ; r4 = r2 r4 = r2 + 42
struct flow_ctx_table_leaf r4 = 0 r4 += 42
new_flow = {0}; *(u32 *)(r10 -4) = r4
struct flow_ctx_table_key *(u64 *)(r10 -16) = r4 return XDP_DROP; r0 = 1 exit_drop
flow_key = {0}; *(u64 *)(r10 -24) = r4 exit

AU G U ST 2 0 2 2 | VO L. 6 5 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 95
research highlights

Figure 5. The logic architecture of the hXDP hardware design.


operands. The memory access unit abstracts the access to
the different memory areas, that is, the stack, the packet
hXDP Sephirot data stored in the APS, and the maps memory. The control
Active Packet Selector Stack
Load/store
1, 2, 4, 6, 8
Register file unit provides the logic to modify the program counter, for
bytes example, to perform a jump and to invoke helper functions.

Instruction
Packet
Select. Decoder
Finally, during the commit stage the results of the IE phase

fetch
logic write Instr. Instr.
1, 2, 4, 6, 8
Decoder bytes
Exec decode are stored back to the register file, or to one of the memory
packet selection signal

read Write
1, 2, 4, 6, 8 Buffer
start Helper jump Program areas. Sephirot terminates execution when it finds an exit
bytes exit function counter
Read Packet Write Scratch call instruction, in which case it signals to the APS the packet
logic Buffer logic memory Instr.
Memory forwarding decision.
32B

Helper BUS
Prog.
32B

Structured
4.2. Pipeline optimizations
Data BUS

Input access to
Helper Functions maps
Queue We now list a subset of notable architectural optimizations
Output Maps
New packet
input frames Queue Memory Configurator
we applied to our design.
Program state self-reset. As we have seen in Section 3,
eBPF programs may perform zero-ing of the variables they
are going to use. We provide automatic reset of the stack
core; the Helper Functions Module (HF); and the Memory and of the registers at program initialization. This is an
Maps Module (MM). All the modules work in the same inexpensive feature in hardware, which improves security9
clock frequency domain. Incoming data is received by and allows us to remove any such zero-ing instruction from
the PIQ. The APS reads a new packet from the PIQ into the program.
its internal packet buffer. In doing so, the APS provides Parallel branching. The presence of branch instructions
a byte-aligned access to the packet data through a data may cause performance problems with architectures that
bus, which Sephirot uses to selectively read/write the lack branch prediction, and speculative and out-of-order
packet content. When the APS makes a packet available execution. For Sephirot, this forces a serialization of
to the Sephirot core, the execution of a loaded eBPF the branch instructions. However, in XDP programs there
program starts. Instructions are entirely executed within are often series of branches in close sequence, especially
Sephirot, using four parallel execution lanes, unless during header parsing. We enabled the parallel execu-
they call a helper function or read/write to maps. In such tion of such branches, establishing a priority ordering of
cases, the corresponding modules are accessed using the the Sephirot’s lanes. That is, all the branch instructions
helper bus and the data bus, respectively. We detail the are executed in parallel by the VLIW’s lanes. If more than
architecture’s core component, that is, the Sephirot one branch is taken, the highest priority one is selected to
eBPF processor, next. update the program counter. The compiler takes that into
Sephirot is a VLIW processor with four parallel lanes account when scheduling instructions, ordering the branch
that execute eBPF instructions. Sephirot is designed as a instructions accordingly.b
pipeline of four stages: instruction fetch (IF); instruction Early processor exit. The processor stops when an exit
decode (ID); instruction execute (IE); and commit. A pro- instruction is executed. The exit instruction is recognized
gram is stored in a dedicated instruction memory, from during the IF phase, which allows us to stop the processor
which Sephirot fetches the instructions in order. The pipeline early and save the three remaining clock cycles.
processor has another dedicated memory area to imple- This optimization improves also the performance gain
ment the program’s stack, which is 512B in size, and 11 64b obtained by extending the ISA with parameterized exit
registers stored in the register file. These memory and reg- instructions, as described in Section 3. In fact, XDP pro-
ister locations match one-to-one the eBPF virtual machine grams usually perform a move of a value to r0, to define the
specification. Sephirot begins execution when the APS forwarding action, before calling an exit. Setting a value to
has a new packet ready for processing, and it gives the pro- a register always needs to traverse the entire Sephirot
cessor start signal. pipeline. Instead, with a parameterized exit we remove the
On processor start (IF stage), a VLIW instruction is read need to assign a value to r0, since the value is embedded in
and the 4 extended eBPF instructions that compose it are a newly defined exit instruction.
statically assigned to their respective execution lanes. In
this stage, the operands of the instructions are pre-fetched 4.3 Implementation
from the register file. The remaining 3 pipeline stages are We prototyped hXDP using the NetFPGA,32 a board embed-
performed in parallel by the four execution lanes. During ID, ding 4 10Gb ports and a Xilinx Virtex7 FPGA. The hXDP
memory locations are pre-fetched, if any of the eBPF instruc- implementation uses a frame size of 32B and is clocked at
tions is a load, while at the IE stage the relevant subunit is 156.25MHz. Both settings come from the standard configu-
activated, using the related pre-fetched values. The subunits ration of the NetFPGA reference NIC design.
are the Arithmetic and Logic Unit (ALU), the Memory Access The hXDP FPGA IP core takes 9.91% of the FPGA logic
Unit, and the Control Unit. The ALU implements all the resources, 2.09% of the register resources, and 3.4% of the
operations described by the eBPF ISA, with the notable dif-
ference that it is capable of performing operations on three b This applies equally to a sequence of if...else or goto statements.

96 COMM UNICATIO NS O F THE AC M | AU GU ST 2022 | VO L . 65 | NO. 8


FPGA’s available BRAM. The APS and Sephirot are the the Linux’s eBPF JIT compiler. In this figure, we report the
components that need more logic resources, since they gain for instruction parallelization, and the additional gain
are the most complex ones. Interestingly, even somewhat from code movement, which is the gain obtained by antici-
complex helper functions, for example, a helper function pating instructions from control equivalent blocks. As we
to implement a hashmap lookup, have just a minor con- can see, the compiler is capable of providing a number of
tribution in terms of required logic, which confirms that VLIW instructions that is often 2–3× smaller than the origi-
including them in the hardware design comes at little cost nal program’s number of instructions. Notice that, by con-
while providing good performance benefits. When includ- trast, the output of the JIT compiler for ×86 usually grows the
ing the NetFPGA’s reference NIC design, that is, to build a number of instructions.c
fully functional FPGA-based NIC, the overall occupation of Instructions per cycle. We compare the paralleliza-
resources grows to 18.53%, 7.3%, and 14.63% for logic, regis- tion level obtained at compile time by hXDP, with the run-
ters, and BRAM, respectively. This is a relatively low occupa- time parallelization performed by the ×86 CPU core. Table
tion level, which enables the use of the largest share of the 2 shows that the static hXDP parallelization achieves often
FPGA for other accelerators. a parallelization level as good as the one achieved by the
complex ×86 runtime machinery.d
5. EVALUATION Hardware performance. We compare hXDP with XDP
We use a selection of the Linux’s XDP example applications running on a server machine and with the XDP offload-
and two real-world applications to perform the hXDP evalu- ing implementation provided by a SoC-based Netronome
ation. The Linux examples are described in Table 1. The NFP4000 SmartNIC. The NFP4000 has 60 programmable
real-world applications are the simple firewall described in network processing cores (called microengines), clocked
Section 2 and the Facebook’s Katran server load balancer.10
Katran is a high-performance software load balancer that c This is also due to the overhead of running on a shared executor, for ex-
translates virtual addresses to actual server addresses using ample, calling helper functions requires several instructions.
a weighted scheduling policy and providing per-flow consis- d The x86 IPC should be understood as a coarse-grained estimation of
the XDP instruction-level parallelism since, despite being isolated, the
tency. Furthermore, Katran collects several flow metrics and CPU runs also the operating system services related to the eBPF virtual
performs IPinIP packet encapsulation. machine, and its IPC is also affected by memory access latencies, which
Using these applications, we perform an evaluation of more significantly impact the IPC for high clock frequencies.
the impact of the compiler optimizations on the programs’
number of instructions and the achieved level of parallel- Figure 6. Number of VLIW instructions, and impact of optimizations
ism. Then, we evaluate the performance of our NetFPGA on its reduction.
implementation. We use the microbenchmarks also to com- 350
hXDP optimized
pare the hXDP prototype performance with a Netronome code-motion
NFP4000 SmartNIC. Although the two devices target differ- instructions-parallelization
no-zeroing
ent deployment scenarios, this can provide further insights
# VLIW instructions

250 bound-checks
on the effect of the hXDP design choices. Unfortunately, the 6B load/store
NFP4000 offers only limited eBPF support, which does not parametrized-exit
3-operands
allow us to run a complete evaluation. We further include a 150 ×86 JIT
comparison of hXDP to other FPGA NIC programming solu-
tions, before concluding the section with a brief discussion
of the evaluation results. 50

5.1. Test results


p1

p2

fo

ck

an
i

ne

al
ta

Compiler. Figure 6 shows the number of VLIW instructions


pv

o
xd

tr
xd

ew
n
_

_i

_s
_i

Ka
tu
st

q
r

fir
p
rx
te
ju

p_

produced by the compiler. We show the reduction provided


xd

e_
p_
ad

_i
ro

pl
xd
p_

tx
p_

sim

by each optimization as a stacked column and report also


p_
xd

xd

xd

the number of ×86 instructions, which result as output of

Table 1. Tested Linux XDP example programs. Table 2. Programs’ number of instructions, ×86 runtime instruction-
per-cycle (IPC) and hXDP static IPC mean rates.
Program Description
xdp1 parse pkt headers up to IP, and XDP_DROP Program # instr. ×86 IPC hXDP IPC
xdp2 parse pkt headers up to IP, and XDP_TX xdp1 61 2.20 1.70
xdp_adjust_tail receive pkt, modify pkt into ICMP pkt and XDP_TX xdp2 78 2.19 1.81
router_ipv4 parse pkt headers up to IP, look up in routing table and xdp_adjust_tail 117 2.37 2.72
forward (redirect) router_ipv4 119 2.38 2.38
rxq_info (drop) increment counter and XDP_DROP rxq_info 81 2.81 1.76
rxq_info (tx) increment counter and XDP_TX tx_ip_tunnel 283 2.24 2.83
tx_ip_tunnel parse pkt up to L4, encapsulate and XDP_TX simple_firewall 72 2.16 2.66
redirect_map output pkt from a specified interface (redirect) Katran 268 2.32 2.62

AU G U ST 2 0 2 2 | VO L. 6 5 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 97
research highlights

at 800MHz. The server machine is equipped with an Intel executors with a very high clock frequency are advantaged,
Xeon E5-1630 v3 @3.70GHz, an Intel XL710 40GbE NIC, since they can run more instructions per second. However,
and running Linux v.5.6.4 with the i40e Intel NIC drivers. notice the clock frequencies of the CPUs deployed at, for
During the tests, we use different CPU frequencies, that example, Facebook’s datacenters15 have frequencies close to
is, 1.2GHz, 2.1GHz, and 3.7GHz, to cover a larger spec- 2.1GHz, favoring many-core deployments in place of high-
trum of deployment scenarios. In fact, many deployments frequency ones. hXDP clocked at 156MHz is still capable of
favor CPUs with lower frequencies and a higher number outperforming a CPU core clocked at that frequency.
of cores.15 We use a DPDK packet generator to perform Linux examples. We finally measure the performance
throughput and latency measurements. The packet gen- of the Linux’s XDP examples listed in Table 1. These applica-
erator is capable of generating a 40Gbps throughput with tions allow us to better understand the hXDP performance
any packet size and it is connected back-to-back with the with programs of different types (see Figure 9). We can iden-
system-under-test, that is, the hXDP prototype running tify three categories of programs. First, programs that for-
on the NetFPGA, the Netronome SmartNIC, or the Linux ward packets to the NIC interfaces are faster when running
server running XDP. Delay measurements are performed on hXDP. These programs do not pass packets to the host
using hardware packet timestamping at the traffic genera- system, and thus, they can live entirely in the NIC. For such
tor’s NIC and measure the round-trip time. Unless differ- programs, hXDP usually performs at least as good as a single
ently stated, all the tests are performed using packets with ×86 core clocked at 2.1GHz. In fact, processing XDP on the
size 64B belonging to a single network flow. This is a chal- host system incurs the additional PCIe transfer overhead
lenging workload for the systems under test. to send the packet back to the NIC. Second, programs that
Applications performance. In Section 2, we mentioned always drop packets are usually faster on ×86, unless the pro-
that an optimistic upper-bound for the simple firewall cessor has a low frequency, such as 1.2GHz. Here, it should
performance would have been 2.8Mpps. When using be noted that such programs are rather uncommon, for
hXDP with all the compiler and hardware optimizations example, programs used to gather network traffic statistics
described in this paper, the same application achieves a receiving packets from a network tap. Finally, programs that
throughput of 6.53Mpps, as shown in Figure 7. This is only are long, for example, tx_ip_tunnel has 283 instructions,
12% slower than the same application running on a pow- are faster on ×86. Like we noticed in the case of Katran, with
erful ×86 CPU core clocked at 3.7GHz and 55% faster than longer programs the hXDP’s implementation low clock fre-
the same CPU core clocked at 2.1GHz. In terms of latency, quency can become problematic.
hXDP provides about 10× lower packet processing latency, 5.1.1. Comparison to other FPGA solutions. hXDP
for all packet sizes (see Figure 8). This is the case since provides a more flexible programming model than previous
hXDP avoids crossing the PCIe bus and has no software- work for FPGA NIC programming. However, in some cases,
related overheads. We omit latency results for the remain- simpler network functions implemented with hXDP could
ing applications, since they are not significantly different. be also implemented using other programming approaches
While we are unable to run the simple firewall applica- for FPGA NICs, while keeping functional equivalence. One
tion using the Netronome’s eBPF implementation, Figure such example is the simple firewall presented in this article,
8 shows also the forwarding latency of the Netronome which is supported also by FlowBlaze.28
NFP4000 (nfp label) when programmed with an XDP pro- Throughput. Leaving aside the cost of reimplementing the
gram that only performs packet forwarding. Even in this function using the FlowBlaze abstraction, we can gener-
case, we can see that hXDP provides a lower forwarding ally expect hXDP to be slower than FlowBlaze at process-
latency, especially for packets of smaller sizes. ing packets. In fact, in the simple firewall case, FlowBlaze
When measuring Katran we find that hXDP is instead can forward about 60Mpps vs. 6.5Mpps of hXDP. The
38% slower than the ×86 core at 3.7GHz and only 8% faster FlowBlaze design is clocked at 156MHz, like hXDP, and
than the same core clocked at 2.1GHz. The reason for this its better performance is due to the high level of special-
relatively worse hXDP performance is the overall program ization. FlowBlaze is optimized to process only packet
length. Katran’s program has many instructions, as such headers, using statically defined functions. This requires
loading a new bitstream on the FPGA when the function
Figure 7. Throughput for real-world applications. hXDP is faster than
Figure 8. Packet forwarding latency for different packet sizes.
a high-end CPU core clocked at over 2GHz.
50 ×[email protected]
×[email protected]
7 nfp
×[email protected] hXDP
6 ×[email protected] 40
latency [us]

hXDP
5 30
Mpps

4
20
3
2 10
1
0
0 64 256 512 1518
simple firewall katran packet size [bytes]

98 COMM UNICATIO NS O F THE ACM | AU GU ST 2022 | VO L . 65 | NO. 8


Figure 9. Throughput of Linux’s XDP programs. hXDP is faster for
should be clear that the memory area dedicated to maps
programs that perform TX or redirection. reduces the memory resources available to other accelera-
tors on the FPGA. As such, the memory requirements of
30
×[email protected]
×[email protected]
XDP programs, which are anyway known at compile time,
×[email protected] are another important factor to consider when taking pro-
hXDP
20 gram offloading decisions.
Mpps

10
6. RELATED WORK
0 NIC programming. AccelNet11 is a match-action offload-
ing engine used in large cloud datacenters to offload vir-
p_ er )
p_ _in 4

xd rec tx)

ju ap

l
p1

p_ op)

el
ai
xd out (tx
xd rxq _ipv

nn
_t

xd
p_ t_m
di fo (

tx (dr
2

tu
st
p_ dp

tual switching and firewalling functions, implemented


p_ o
x

nf
ad

_i
i
q_
r

re

on top of the Catapult FPGA NIC.6 FlexNIC23 is a design


rx
xd

p_

xd
xd

based on the RMT4 architecture, which provides a flexible


network DMA interface used by operating systems and
changes, but it enables the system to achieve the reported applications to offload stateless packet parsing and clas-
high performance.e Conversely, hXDP has to pay a sig- sification. P4->NetFPGA1 and P4FPGA30 provide high-level
nificant cost to provide full XDP compatibility, including synthesis from the P43 domain-specific language to an
dynamic network function programmability and process- FPGA NIC platform. FlowBlaze28 implements a finite-state
ing of both packet headers and payloads. machine abstraction using match-action tables on an
Hardware resources. A second important difference FPGA NIC, to implement simple but high-performance
is the amount of hardware resources required by the two network functions. Emu29 uses high-level synthesis to
approaches. hXDP needs about 18% of the NetFPGA logic implement functions described in C# on the NetFPGA.
resources, independently from the particular network Compared to these works, instead of match-action or
function being implemented. Conversely, FlowBlaze higher-level abstractions, hXDP leverages abstractions
implements a packet processing pipeline, with each of defined by the Linux’s kernel and implements network
the pipeline’s stage requiring about 16% of the NetFPGA’s functions described using the eBPF ISA.
logic resources. For example, the simple firewall func- The Netronome SmartNICs implement a limited form of
tion implementation requires two FlowBlaze pipeline’s eBPF offloading.24 Unlike hXDP that implements a solution
stages. More complex functions, such as a load balancer, specifically targeted to XDP programs, the Netronome solu-
may require 4 or 5 stages, depending on the implemented tion is added on top of their network processor as an after-
load-balancing logic.12 thought, and therefore, it is not specialized for the execution
In summary, the FlowBlaze’s pipeline leverages hard- of XDP programs.
ware parallelism to achieve high performance. However, NIC hardware. Previous work presenting VLIW core
it has the disadvantage of often requiring more hardware designs for FPGAs did not focus on network process-
resources than a sequential executor, like the one imple- ing.21, 22 Brunella5 is the closest to hXDP. It employs a non-
mented by hXDP. Because of that, hXDP is especially help- specialized MIPS-based ISA and a VLIW architecture for
ful in scenarios where a small amount of FPGA resources is packet processing. hXDP has an ISA design specifically
available, for example, when sharing the FPGA among dif- targeted to network processing using the XDP abstrac-
ferent accelerators. tions. Forencich13 presents an open source 100Gbps FPGA
NIC design. hXDP can be integrated in such design to
5.2. Discussion implement an open source FPGA NIC with XDP offloading
Suitable applications. hXDP can run XDP programs with support.
no modifications; however, the results presented in this
section show that hXDP is especially suitable for pro- 7. CONCLUSION
grams that can process packets entirely on the NIC, and This paper presented the design and implementation
which are no more than a few 10s of VLIW instructions of hXDP, a system to run Linux’s XDP programs on FPGA
long. This is a common observation made also for other NICs. hXDP can run unmodified XDP programs on FPGA
offloading solutions.18 matching the performance of a high-end ×86 CPU core
FPGA sharing. At the same time, hXDP succeeds in clocked at more than 2GHz. Designing and implement-
using little FPGA resources, leaving space for other accel- ing hXDP required a significant research and engineer-
erators. For instance, we could co-locate on the same ing effort, which involved the design of a processor and
FPGA several instances of the VLDA accelerator design its compiler, and while we believe that the performance
for neural networks presented in Chen.7 Here, one impor- results for a design running at 156MHz are already remark-
tant note is about the use of memory resources (BRAM). able, we also identified several areas for future improve-
Some XDP programs may need larger map memories. It ments. In fact, we consider hXDP a starting point and a tool
to design future interfaces between operating systems/
e FlowBlaze allows the programmer to perform some runtime reconfig-
applications and NICs/accelerators. To foster work in this
uration of the functions, however this a limited feature. For instance, direction, we make our implementations available to the
packet parsers are statically defined. research community.

AU G U ST 2 0 2 2 | VO L. 6 5 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 99
research highlights

Acknowledgments 23. Kaufmann, A., Peter, S., Anderson, T., Design and Implementation (NSDI 19).
The research leading to these results has received fund- Krishnamurthy, A. Flexnic: Rethinking Boston, MA, USENIX Association, 2019,
network DMA. In 15th Workshop on Hot 531–548
ing from the ECSEL Joint Undertaking in collaboration Topics in Operating Systems (HotOS 29. Sultana, N., Galea, S., Greaves, D.,
with the European Union’s H2020 Framework Programme XV), Kartause Ittingen, Switzerland, Wojcik, M., Shipton, J., Clegg, R., Mai, L.,
USENIX Association, 2015. Bressana, P., Soulé, R., Mortier, R.,
(H2020/2014–2020) and National Authorities, under grant 24. Kicinski, J., Viljoen, N. eBPF hardware Costa, P., Pietzuch, P., Crowcroft, J.,
agreement n. 876967 (Project “BRAINE”). offload to SmartNICs: cls bpf and Moore, A.W., Zilberman, N. Emu: Rapid
XDP. Proc. Netdev 1, 2016. prototyping of networking services.
25. Michel, O., Bifulco, R., Rétvári, G., In 2017 USENIX Annual Technical
Schmid, S. The programmable data Conference (USENIX ATC 17), Santa
References USENIX Symposium on Networked plane: Abstractions, architectures, Clara, CA, USENIX Association, 2017,
1. P4-NetFPGA. https://github.com/ Systems Design and Implementation algorithms, and applications. 54, 4 459–471.
NetFPGA/P4-NetFPGA-public/wiki. (NSDI 18), Renton, WA, USENIX (2021). 30. Wang, H., Soulé, R., Dang, H.T.,
2. Bernstein, A.J. Analysis of programs Association, 2018, 51–66. 26. NEC. Building an Open vRAN Lee, K.S., Shrivastav, V., Foster, N.,
for parallel processing. IEEE Trans. 12. FlowBlaze. Repository with FlowBlaze Ecosystem White Paper. 2020. https:// Weatherspoon, H. P4fpga: A rapid
Electron. Comput EC-15, 5 (1966), source code and additional material. www.nec.com/en/global/solutions/5g/ prototyping framework for p4. In
757–763. http://axbryd.com/FlowBlaze.html. index.html. Proceedings of the Symposium on
3. Bosshart, P., Daly, D., Gibb, G., 13. Forencich, A., Snoeren, A.C., Porter, G., 27. Netronome. AgilioTM CX 2x40GbE SDN Research, SOSR ’17. New York,
Izzard, M., McKeown, N., Rexford, J., Papen, G. Corundum: An open- intelligent server adapter. https:// NY, USA, Association for Computing
Schlesinger, C., Talayco, D., Vahdat, A., source 100-Gbps NIC. In 28th IEEE www.netronome.com/media/ Machinery, 2017, 122–135.
Varghese, G., Walker, D. P4: International Symposium on Field- redactor_files/PB_Agilio_ 31. Xilinx. 5G Wireless Solutions Powered
Programming protocol-independent Programmable Custom Computing CX_2x40GbE.pdf. by Xilinx. 2020. https://www.xilinx.com/
packet processors. SIGCOMM Machines, 2020. 28. Pontarelli, S., Bifulco, R., Bonola, M., applications/megatrends/5g.html
Comput. Commun. Rev 44, 3 (2014), 14. Gautschi, M., Schiavone, P.D., Cascone, C., Spaziani, M., Bruschi, V., 32. Zilberman, N., Audzevich, Y.,
87–95. Traber, A., Loi, I., Pullini, A., Rossi, D., Sanvito, D., Siracusano, G., Capone, A., Covington, G.A., Moore, A.W. NetFPGA
4. Bosshart, P., Gibb, G., Kim, H.-S., Flamand, E., Gürkaynak, F.K., Benini, L.. Honda, M., Huici, F., Siracusano, G. SUME: Toward 100 Gbps as Research
Varghese, G., McKeown, N., Izzard, M., Near-threshold risc-v core with dsp Flowblaze: Stateful packet processing Commodity. IEEE Micro ’14 34, 5
Mujica, F., Horowitz, M.. Forwarding extensions for scalable iot endpoint in hardware. In 16th USENIX (2014), 32–41.
metamorphosis: Fast programmable devices. IEEE Trans. Very Large Symposium on Networked Systems
match-action processing in hardware Scale Integr. VLSI Syst 25, 10 (2017),
for sdn. In Proceedings of the ACM 2700–2713.
SIGCOMM 2013 Conference on 15. Hazelwood, K, Bird, S., Brooks, D., Marco Spaziani Brunella and Giacomo Giuseppe Bianchi and Luca Petrucci
SIGCOMM, SIGCOMM ‘13 (New Chintala, S., Diril, U., Dzhulgakov, D., Belocchi ({spaziani, belocchi}@axbryd.com), ({giuseppe.bianchi, luca.petrucci@}@
York, NY, USA, 2013). Association for Fawzy, M., Jia, B., Jia, Y., Kalro, A., Law, Axbryd/University of Rome Tor Vergata, uniroma2.it), University of Rome Tor
Computing Machinery, 99–110. J., Lee, K., Lu, J., Noordhuis, P., Rome, Italy. Vergata, Rome, Italy.
5. Brunella, M.S., Pontarelli, S., Bonola, M., Smelyanskiy, M., Xiong, L., Wang, X.
Bianchi, G. V-PMP: A VLIW packet Applied machine learning at Facebook: Salvatore Pontarelli (salvatore@axbryd. Aniello Cammarano (cammarano@
manipulator processor. In 2018 a datacenter infrastructure perspective. com), Axbryd/University of Rome La axbryd.com), Axbryd, Rome, Italy.
European Conference on Networks In High Performance Computer Sapienza, Rome, Italy.
and Communications (EuCNC), IEEE, Architecture (HPCA). IEEE, 2018. Alessandro Palumbo (palumbo@ing.
2018, 1–9. 16. Heinz, C., Lavan, Y., Hofmann, J., Marco Bonola ([email protected]), uniroma2.it), University of Rome Tor
6. Caulfield, A.M., Chung, E.S., Putnam, A., Koch, A. A catalog and in-hardware Axbryd/CNIT, Rome, Italy. Vergata, Rome, Italy.
Angepat, H., Fowers, J., Haselman, M., evaluation of open-source drop-in
Heil, S., Humphrey, M., Kaur, P., Kim, J., compatible risc-v softcore processors. Giuseppe Siracusano and Roberto
Lo, D., Massengill, T., Ovtcharov, K., In 2019 International Conference Bifulco ({giuseppe.siracusano, roberto.
Papamichael, M., Woods, L., Lanka, S., on ReConFigurable Computing and bifulco}@neclab.eu), NEC Laboratories
Chiou, D., Burger, D. A cloud-scale FPGAs (ReConFig), IEEE, 2019, 1–8. Europe, Heidelberg, Germany.
acceleration architecture. In 2016 17. Hennessy, J.L., Patterson, D.A. A new
49th Annual IEEE/ACM International golden age for computer architecture.
Symposium on Microarchitecture Commun. ACM 62, 2 (2019), 48–60.
(MICRO), 2016, 1–13. 18. Hohlfeld, O., Krude, J., Reelfs, J.H., This work is licensed under a Creative Commons
7. Chen, T., Moreau, T., Jiang, Z., Zheng, L., Rüth, J., Wehrle, K. Demystifying Attribution-NonCommercial-ShareAlike International 4.0 License.
Yan, E., Shen, H., Cowan, M., Wang, L., the performance of XDP BPF. In
Hu, Y., Ceze, L., Guestrin, C., 2019 IEEE Conference on Network
Krishnamurthy, A. TVM: An automated Softwarization (NetSoft), IEEE, 2019,
end-to-end optimizing compiler 208–212.
for deep learning. In 13th USENIX 19. Høiland-Jørgensen, T., Brouer, J.D.,
Symposium on Operating Systems Borkmann, D., Fastabend, J., Herbert, T.,
Design and Implementation (OSDI Ahern, D., Miller, D. The express data
18), USENIX Association, Carlsbad, path: Fast programmable packet
CA, 2018, 578–594. processing in the operating system
8. Chiou, D. The microsoft catapult kernel. In Proceedings of the 14th
project. In 2017 IEEE International International Conference on Emerging
Symposium on Workload Networking EXperiments and
Characterization (IISWC), IEEE, 2017, Technologies, CoNEXT ’18. New York,
124–124. NY, USA, Association for Computing
9. Dumitru, M.V., Dumitrescu, D., Raiciu, C.. Machinery, 2018, 54–66.
Can we exploit buggy p4 programs? In 20. Intel. 5g wireless. 2020 https://
Proceedings of the Symposium on SDN www.intel.com/content/www/
Research, SOSR ’20, Association for us/en/communications/products/
Computing Machinery, New York, NY, programmable/applications/
USA, 2020, 62–68. baseband.html.
10. Facebook. Facebook. Katran source 21. Iseli, C., Sanchez, E. Spyder: A
code repository, 2018. https: //github. reconfigurable vliw processor using
com/facebookincubator/katran. FPGAs. In [1993] Proceedings IEEE
11. Firestone, D., Putnam, A., Mundkur, S., Workshop on FPGAs for Custom
Chiou, D., Dabagh, A., Andrewartha, M., Computing Machines. IEEE, 1993,
Angepat, H., Bhanu, V., Caulfield, A., 17–24.
Chung, E., Chandrappa, H.K., 22. Jones, A.K., Hoare, R., Kusic, D.,
Chaturmohta, S., Humphrey, M., Lavier, Fazekas, J., Foster, J. An fpga-
J., Lam, N., Liu, F., Ovtcharov, K., based vliw processor with cust om
Padhye, J., Popuri, G., Raindel, S., hardware execution. In Proceedings
Sapre, T., Shaw, M., Silva, G., Sivakumar, of the 2005 ACM/SIGDA 13th
M., Srivastava, N., Verma, A., International Symposium on
Zuhair, Q., Bansal, D., Burger, D., Field-Programmable Gate Arrays, Watch the authors discuss
Vaid, K., Maltz, D.A., Greenberg, FPGA ’05. New York, NY, USA, this work in the exclusive
A. Azure accelerated networking: Association for Computing Communications video.
Smartnics in the public cloud. In 15th Machinery, 2005, 107–117. https://cacm.acm.org/videos/hxdp

100 CO M MUNICATIO NS O F TH E ACM | AU GU ST 2022 | VO L . 65 | NO. 8

You might also like