AutoDSE: Enabling Software Programmers To Design Efficient FPGA Accelerators
AutoDSE: Enabling Software Programmers To Design Efficient FPGA Accelerators
Accelerators
ATEFEH SOHRABIZADEH∗ , Computer Science Department, University of California, Los Angeles, USA
CODY HAO YU∗ , Computer Science Department, University of California, Los Angeles, USA
MIN GAO, Falcon-computing Inc., USA
JASON CONG, Computer Science Department, University of California, Los Angeles, USA
arXiv:2009.14381v2 [cs.AR] 31 Aug 2021
Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard
to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator
designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance.
While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability
of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated
DSE framework—𝐴𝑢𝑡𝑜𝐷𝑆𝐸— that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point.
𝐴𝑢𝑡𝑜𝐷𝑆𝐸 detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental
results show that 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 is able to identify the design point that achieves, on the geometric mean, 19.9× speedup over one CPU core
for Machsuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸
can reduce their optimization pragmas by 26.38× while achieving similar performance. With less than one optimization pragma per
design on average, we are making progress towards democratizing customizable computing by enabling software programmers to
design efficient FPGA accelerators.
Additional Key Words and Phrases: Bottleneck Optimizer, Customized Computing, HLS, Merlin
1 Introduction
Due to the rapid growth of datasets in recent years, the demand for scalable, high-performance computing continues
to increase. However, the breakdown of Dennard’s scaling [19] has made the energy efficiency an important concern
in datacenters, and has spawned exploration into using accelerators such as field-programmable gate arrays (FPGAs)
to alleviate power consumption. For example, Microsoft has adopted CPU-FPGA systems in its datacenter to help
accelerate the Bing search engine [35]; Amazon introduced the F1 instance [2], a compute instance equipped with FPGA
boards, in its commercial Elastic Compute Cloud (EC2).
Although the interest in customized computing using FPGAs is growing, they are more difficult to program compared
to CPUs and GPUs because the traditional register-transfer level (RTL) programming model is more like circuit design
rather than software implementation. To improve the programmability, high-level synthesis (HLS) [13, 56] has attracted
a large amount of attention over the past decades. Currently, both FPGA vendors have their commercial HLS products—
Xilinx Vitis [50] and Intel FPGA SDK for OpenCL [26]. With the help of HLS, one can program the FPGA more easily by
controlling how the design should be synthesized from a high-level view. The main enabler of this feature is the ability
to iteratively re-optimize the micro-architecture quickly just by inserting synthesis directives in the form of pragmas
∗ Both authors contributed equally to this research.
Authors’ addresses: Atefeh Sohrabizadeh, [email protected], Computer Science Department, University of California, Los Angeles, Los Angeles,
CA, USA; Cody Hao Yu, [email protected], Computer Science Department, University of California, Los Angeles, Los Angeles, CA, USA; Min Gao,
Falcon-computing Inc., Los Angeles, USA, [email protected]; Jason Cong, Computer Science Department, University of California, Los
Angeles, Los Angeles, CA, USA, [email protected].
1
2 Sohrabizadeh and Yu, et al.
instead of re-writing the low-level behavioral description of the design. Because of the reduced code development cycle
and the shorter turn-around times, HLS has been rapidly adopted by both academia and industry [3, 20, 29, 42, 46, 62].
In fact, Code 1 shows an intuitive HLS C implementation of one forward path of a Convolutional Neural Network (CNN)
on Xilinx FPGAs. Xilinx Vitis generates about 5800 lines of RTL kernel from Code 1 with the same functionality. As a
result, it is much more convenient and productive for designers to evaluate and improve their designs in HLS C/C++.
Even though HLS is suitable for hardware experts to quickly implement an optimal design, it is not friendly for most
of the general software designers who have limited FPGA domain knowledge. Since the hardware architecture inferred
from a syntactic C implementation could be ambiguous, current commercial HLS tools usually generate architecture
structures according to specific HLS C/C++ code patterns. As a result, even though it was shown in [13] that the HLS
tool is capable of generating FPGA designs with a performance as competitive as the one in RTL, not every C program
gives a good performance and designers must manually reconstruct the HLS C/C++ kernel with specific code patterns
and hardware specific pragmas to achieve high performance. As a matter of fact, the generated FPGA accelerator from
Code 1 is 80× slower than a single-thread CPU. However, the optimized code (shown in Code 3 in Appendix A.1) is able
to achieve around 7,041× speedup after we analyze and resolve several performance bottlenecks listed in Table 1 by
applying code transformations and inserting 28 pragmas.
It turns out that the bottlenecks presented in Table 1 occur for most C/C++ programs developed by software
programmers, and similar optimizations have to be repeated for each new application, which makes HLS C/C++ design
not scalable. In general, there are three levels of optimization that one needs to employ to get to a high-performance FPGA
design. The level one is for increasing the data reuse or reducing/removing the data dependency by loop transformations,
which is common in CPU performance optimizations as well (e.g. for cache locality); therefore, it is well accepted
by software programmers and we expect them to apply such transformations manually without any problems. The
second level is required to enable repetitive architectural optimizations that most of the designs benefit from, such as
memory burst and memory coalescing, as mentioned in reasons 1-2 in Table 1. Fortunately, the recently developed
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 3
Merlin Compiler 1 [11, 12, 21] from Falcon Computing Solutions [21], which was acquired by Xilinx in late 2020 [48],
can automatically take care of this kind of code transformations.
The final and the most critical level deals with FPGA-specific architectural optimizations, detailed in reasons 3-5 in
Table 1, that vary from application to application. Although the Merlin Compiler also helps alleviate this problem to
some extent by introducing a few high-level optimization pragmas and applying source-to-source code transformation
to enable them, these optimizations are much more difficult for software programmers to learn and apply effectively.
More specifically, choosing the right part of the program to optimize, deciding the type of optimization and the pragmas
to apply for enabling it, and tuning the pragma to get to the design with the highest quality complicate this level.
Apparently, the requirement of mastering all three levels of optimizations makes the bar for general software
programmers to use FPGA extremely high. Hence, general software programmers will lean towards other popular
accelerators such as power-consuming GPUs or high-cost ASICs with less considerations over FPGAs. These obstacles
consequently result in huge barriers in the adoptions of FPGA in datacenters, the expansion of the FPGA user community,
and the advances of FPGA techology. One possible solution is to apply an automated micro-architecture optimization.
Thus, everyone with decent knowledge of programming is able to try customized computing with minimum effort. In
order to free accelerator designers from the iterations of HLS design improvement, automated design space exploration
(DSE) for HLS attracts more and more attention. However, existing DSE methods face the following challenges:
Challenge 1: The large solution space: The solution space grows exponentially by the number of candidate
pragmas. In fact, only applying pipeline, unroll, and array partition pragmas to Code 1 produces 1020 design points.
This huge number of combinations creates a serious impediment to exploring the whole design space.
Challenge 2: Non-monotonic effect of design parameters on performance/area: As pointed out by Nigam,
et al. [33], we cannot assume that an individual design parameter will affect the performance/area in a smooth and/or
monotonic way.
Challenge 3: Correlation of different characteristics of a design: When different pragmas are employed to-
gether in a design, they do not affect only one characteristic of a design. We will use the convolution part of the Code 1
as an example. If we apply fine-grained (fg) pipeline to w loop and parallelize the loop with a factor of 2, it results in a
loop with initiation interval (II) of 2 synthesized by Vivado HLS [45]. However, when we change the parallel factor to
4, the HLS tool increases the II to 3 to optimize resource consumption by reusing some of the logic units instead of
doubling the resource utilization. The analytical models usually fail to capture these cases. Furthermore, pipelining the j
1 The Merlin Compiler will be open-sourced in the near future after passing Xilinx’s legal review.
4 Sohrabizadeh and Yu, et al.
loop is part of the best design configuration. However, it does not improve the performance until after the fg pipelining
is applied on the w loop. It suggests that the order of applying the pragmas is crucial in designing the explorer.
Challenge 4: Implementation disparity of HLS tools: The HLS tools from different vendors employ different
implementation strategies. Even within the same vendor, the optimization and implementation rules keep changing
across different versions. For example, the past Xilinx SDAccel versions consistently utilize registers to implement array
partitions with small sizes to save BRAMs. However, the latest ones use dual-port BRAMs for implementation to support
two reads in one cycle for achieving full pipelining, or II = 1, even if the array size is small. Such implementation details
are hard to capture and maintain in analytical models and make it difficult to port an analytical model built on a specific
tool to the other.
Challenge 5: Long synthesis time of HLS tools: HLS tools usually take 5-30 minutes to generate RTL and
estimate the performance—and even longer if the design has a high performance. This emphasizes the need for a DSE
that can find the Pareto-optimal design points in fewer iterations.
In this paper, as our first step to lowering the bar for general software programmers to make the FPGA programming
universally accessible, we focus on automating the final level of optimization. To solve the challenges 2 to 4 mentioned
above, instead of developing an analytical model, we treat the HLS tool as a black-box. Challenges 1 and 5 imply that we
need to explore the solution space intelligently. For that, we first apply the coordinate descent with the finite difference
method to guide the explorer. However, we show that the general application-oblivious approaches fail to perform
well for the HLS DSE problem. As a result, we present the 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 2 framework that adapts a bottleneck-guided
coordinate optimizer to systematically search for better configurations. We incorporate a flexible list-comprehension
syntax to represent a grid design space with all invalid points marked. In addition, we also partition the design space
systematically to address the local optimum problem caused by Challenge 2.
In summary, this paper makes the following contributions:
• We propose two strategies to guide DSE. One adapts the commonly used coordinate descent with the finite
difference method and the other exploits a bottleneck-guided coordinate optimizer.
• We incorporate list-comprehension to represent a smooth, grid design space with all invalid points marked.
• We develop the 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 framework on top of the Merlin Compiler to automatically perform DSE using the
bottleneck optimizer to systematically close in on high-QoR design points.
• To the best of our knowledge, we are the first ones to evaluate our tool using the Xilinx optimized vision
library [49]. Evaluation results indicate that 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 is able to achieve the same performance, yet with 26.38×
reduction of their optimization pragmas resulting in less than one required optimization pragma per kernel, on
the geometric mean.
• We evaluate 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 on 11 computational kernels from Machsuite [36] and Rodinia [8] benchmarks and one
convolution layer of Alexnet [28], showing that we are able to achieve, on the geometric mean, 19.9× speedup
over a single-thread CPU—only a 7% performance gap compared to manual designs.
2 Problem Formulation
Our goal is to expedite the hardware design by automating its exploration process. In general, there are two types of
pragmas (using Vivado HLS as an example) that are applied to a program. One type is the non-optimization pragmas,
which are relatively easy for software programmers to learn and apply. The other type is optimization pragmas, including
PIPELINE and UNROLL pragmas. These pragmas require knowledge of FPGA devices and micro-architecture optimization
experience, which are usually much more challenging for a software programmer to learn and master as explained in
Section 1. The goal of this research is to minimize or eliminate the need to apply optimization pragmas manually and
let 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 insert them automatically. More formally, we formulate the HLS DSE problem as the following:
Problem 1: Identify Design Space. Given a C program P as the FPGA accelerator kernel, construct a design space
R𝐾P with 𝐾 parameters that contains possible combinations of HLS pragmas for P as design configurations.
Problem 2: Find the Optimal Configuration. Given a C program P, we would like to insert a minimal number of
optimization pragmas manually to get a new program P ′ as the FPGA accelerator kernel along with its design space
set R𝐾P ′ which is identified in Problem 1, and we let the DSE tool insert the rest of the pragmas automatically. More
specifically, having a vendor HLS tool H that estimates the execution cycle 𝐶𝑦𝑐𝑙𝑒 (H, P ′ ) and the resource utilization
𝑈 𝑡𝑖𝑙 (H, P ′ ) of the given P ′ as a black-box evaluation function, the DSE must find a configuration 𝜃 ∈ R𝐾P ′ in a given
search time limit so that the generated design P ′ (𝜃 ) with 𝜃 can fit in the FPGA and the execution cycle is minimized.
Formally, our objective is:
3 Related Work
There are a number of previous works that propose an automated framework to explore the HLS design space, and they
can be summarized in two categories: model-based and model-free techniques.
a simple analytical model for performance and area estimation. However, they assume that the performance/area
changes monotonically by modifying an individual design parameter, which is not a valid assumption as we explained
in Challenge 2 of Section 1. To increase the accuracy of the estimation model, a number of other studies restrict the
target application to those that have a well-defined accelerator micro-architecture template [9, 14, 15, 37, 42, 55], a
specific application [52, 58], or a particular computation pattern [10, 27, 34]; hence, they lose generality.
To the same end, there are other studies that build the predictive model using learning algorithms. They train a
model by iteratively synthesizing a set of sample designs and updating the model until it gets to the desired accuracy.
Later on, they use the trained model for estimating the quality of design instead of invocations of the HLS tool. To learn
the behavior of the HLS tool, these works adapt supervised learning algorithms to better capture uncertainty of HLS
tools [27, 30, 31, 40, 53, 60]. While this technique increases the accuracy of the model, it is still hard to port the model
to another HLS tool in a different vendor or version. Often by changing the HLS tool or the target FPGA, new samples
should be collected which can be an expensive step. After that, for each of them, a new model should be trained to
include the new dataset.
Merlin pragmas with architecture structures. Note that the fg option in the fine-grained pipeline mode refers to the code
transformation that tries to apply fine-grained pipelining to a loop nest by fully unrolling all its sub-loops; whereas, the
cg option in the coarse-grained pipelining transforms the code to enable double buffering. Based on these user-specified
pragmas, the Merlin Compiler performs source-to-source code transformation and automatically generates the related
HLS pragmas such as PIPELINE, UNROLL, and ARRAY_PARTITION to apply the corresponding architecture optimization.
To reduce the size of the solution space, we chose to utilize the Merlin Compiler as the backend of our tool. Since the
number of pragmas required by the Merlin Compiler is much smaller (as it performs source level code reconstruction
and generates most of the HLS required pragmas), it defines a more compact design space, which makes it a better fit for
developing a DSE as shown in [15, 54]. For instance, Code 2 shows the CNN kernel with Merlin pragmas. With inserting
only four lines of pragmas and no further manual code transformation, the Merlin Compiler is able to transform Code 2
to a high-performance HLS kernel with the same performance as the manually optimized design written in HLS C
which has 28 pragmas as mentioned in Section 1.
The Merlin Compiler, by default, applies code transformations to address the bottlenecks 1 and 2 listed in Table 1 and
provides high-level optimization pragmas for the rest of them. For example, instead of rewriting Code 1 to test whether
double buffering would help the performance as described in reason 3 in Table 1, we just need to use the cg PIPELINE
pragma and the Merlin Compiler will rewrite the code to satisfy it. As a result, our focus in this work is on finding the
best location of each of these high-level pragmas and tuning them, automatically; hence, we can address reasons 3-5 in
Table 1 as well by enabling the architectural optimizations along with the best pipelining and parallelization attributes.
As a result, our solution to Problem 1 is defined as in Table 3. We identify the design space for each kernel by
analyzing the kernel abstract syntax tree (AST) to gather loop trip-counts, available bit-widths, etc. The rules we enforce
in building this design space are listed in Section 5.4
Now that we have defined the design space in Table 3 for Problem 1, we focus on Problem 2 in the remainder of
this paper. Although to some extents, Merlin pragmas alleviate the manual code reconstruction overhead, a designer
8 Sohrabizadeh and Yu, et al.
still has to manually search for the best option for each pragma, including position, type, and factors. In fact, choices for
the CNN design in Code 1 contain four DRAM buffers and thirteen loops, which result in ∼ 1016 design configurations.
The large design space motivates us to develop an efficient approach to find the best configuration.
5 AutoDSE Methodology
In this section, we first examine the efficiency of application-oblivious heuristics, which were considered in our initial
study, in Section 5.1. As we will discuss, the main drawback of these heuristics for the HLS DSE problem is the fact
that they do not have any knowledge of the semantics of the program parameter. This problem can potentially linger
the DSE process since the explorer may waste a lot of time on parameters with no impact on the results at that stage
of optimization. As a result, in Section 5.2, we present a bottleneck-guided coordinate optimizer that can mimic an
expert’s optimization method and generate high-QoR design points in fewer number of iterations. We propose several
optimizations in Sections 5.3 to 5.5 to further improve the performance of our framework.
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 9
Design
Design
Design Design
DesignSpace
Space
Representative
Design Space Space
Space Partition
Space Profiler and Seed Partition
Design Space
Generator + Partition
Partition
C Kernel Partition Generation
Partitioner
Explorer
Bottleneck Optimization
Algorithm
C Kernel w.
Design Config. Waiting Queue Result Database Optimized
Design Config.
Evaluator
Execution Flow Result Query
1.0
Normalized Speedup
0.8
fft-
bfs d
km
sp
bfs fft-
ge ulk, ans
ae
mv
mm nw
str
-qu
-b tr
p
ide
eu
0.6
e,
ste
0.4
nc
il
0.2
0.00 5 10 15 20 25
Time (hours)
Fig. 2. Speedup Over the Manual Design Using S2FA [54]
Coordinate descent is another well-known iterative optimization algorithm for finding a locally minimum point. It is
based on the idea that one can minimize a multi-variable function by minimizing it along one direction at a time and
solving single-variable optimization problems. At each iteration, we generate a set of candidates, Θ𝑐𝑎𝑛𝑑 , as the input to
the algorithm. Each candidate is generated by advancing the value of each parameter in the current configuration by
one step. Formally, the 𝑐-th candidate generated from design point 𝜃𝑖 is:
where 𝐾 is the total number of parameters, 𝑝𝑐 is the value of 𝑐-th parameter in 𝜃𝑖 , 𝑝𝑐 + 1 denotes the next value of this
parameter (the next numeric factor for PARALLEL and TILING pragma and the next mode of pipelining for PIPELINE
pragma). Accordingly, we will generate 𝐾 candidates at each iteration, which means we run HLS 𝐾 times to determine
the next configuration as follows:
We leverage the finite difference method to approximate the coordinate value by treating the HLS tool as a black-box.
That is, given a candidate configuration 𝜃 𝑗 deviated from the current configuration 𝜃𝑖 , the coordinate value is defined
as:
𝐶𝑦𝑐𝑙𝑒 (H, P (𝜃 𝑗 )) − 𝐶𝑦𝑐𝑙𝑒 (H, P (𝜃𝑖 ))
𝑔(𝜃 𝑗 , 𝜃𝑖 ) ∼ (5)
𝑈 𝑡𝑖𝑙 (H, P (𝜃 𝑗 )) − 𝑈 𝑡𝑖𝑙 (H, P (𝜃𝑖 ))
We calculate 𝑈 𝑡𝑖𝑙 (H, P (𝜃 )) by taking into account all the different types of resources using the following formula:
∑︁ 1
𝑈 𝑡𝑖𝑙 (H, P (𝜃 )) = 2 1−𝑢 (6)
𝑢
where 𝑢 is the utilization of one of the FPGA resources. We use exponential function to penalize the over-utilization
of FPGA more seriously. Note that Eq. 5 considers not only performance gain but also resource efficiency, so it could
reduce the possibility of being trapped in a local optimum. For example, we may reduce 10% execution cycle by spending
30% more area if we increase the parallel factor of a loop (configuration 𝜃 1 ); we can also reduce 5% execution cycle by
spending 10% more area if we enlarge the bit-width of a certain buffer (configuration 𝜃 2 ). Although 𝜃 1 seems better
in terms of the execution cycle, it may be more easily trapped by a locally optimal point because it has a relatively
limited resource left to be further improved. On the other hand, the finite difference values for the two configurations
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 11
Two main inefficiencies of the approaches reviewed in the previous section are 1) they must evaluate many design
points to identify the performance bottleneck, 2) they have no knowledge of the semantics of the parameters, so
12 Sohrabizadeh and Yu, et al.
they have no way to differentiate them and prioritize the important ones. Identifying the key parameters is not
straightforward. Although HLS report may provide the cycle breakdown for the loop and function statements, it is
hard to map them to tuning parameters due to the applications of several code transformations applied by the Merlin
Compiler. Fortunately, the Merlin Compiler includes a feature that transmits the performance breakdown reported by
the HLS tool to the user input code, allowing us to identify the performance bottleneck by traversing the Merlin report
and mapping the bottleneck statement to one or few tuning parameters.
Fig. 3 illustrates how the Merlin Compiler generates its report of the cycle breakdown. When performing code
transformation, the Merlin Compiler records the code change step by step so that it is able to propagate the latency
estimated by the HLS tool back to the user input code. In this example, the i loop corresponds to the compute unit
in the transformed code, so the latency of this unit is assigned to it. Note that the latency of all load, compute, and
store units are included in the task_batch loop which will determine the latency of task loop in both the original
and transformed codes. This feature is helpful for us to analyze the performance bottleneck and identify the key tuning
parameter by running HLS once at each iteration instead of evaluating the effect of all 𝐾 parameters.
By exploiting the cycle breakdown, we can resolve the issues mentioned above by developing a bottleneck analyzer.
We first build a map from the loop or function statements in the user input code to design parameters so that we
know which parameters should be focused on for a particular statement. To identify the critical path and type, we start
with the kernel top function statement and build hierarchy paths of the design by traversing the Merlin report using
depth-first search (DFS). More specifically, for each hierarchy level, we first check to see if the current statement has
child loop statements and sort them by their latency. Then, we traverse each of the child loops and repeat this process.
In case of a function call statement, we dive into the function implementation to further check its child statements
for building the hierarchy paths. Finally, we return a list of paths in order.Note that since we sort all loop statements
according to their latency by checking the Merlin report, the hierarchy paths we created will also be sorted by their
latency.
Subsequently, for each statement, we check the Merlin report again to determine whether its performance bottleneck
is memory transfer or computation. The Merlin Compiler obtains this information by analyzing the transformed kernel
code along with the HLS report. A cycle is considered to be a memory transfer cycle if it is consumed by communicating
to global memory. As a result, we can not only figure out the performance bottleneck for each design point, but also
identify a small set of effective design parameters to focus on. Therefore, we are able to significantly improve the
efficiency of our searching algorithm.
When we obtain an ordered list of critical hierarchy paths from the bottleneck analyzer, we start from the innermost
loop statement (because of the DSF traversal) of the most critical entry and identify its corresponding parameters as
candidate parameters to explore, if they are not already tuned. Based on the bottleneck type, provided by the bottleneck
analysis, (i.e., memory transfer or computation), we pick a subset of the parameters mapped to that statement to work
on. For example, we may have design parameters of PARALLEL and TILING at the same loop level. When the bottleneck
type of the loop is memory transfer, we focus on the TILING parameter for the loop; otherwise, we focus on PARALLEL
parameter. In other words, we reduce the number of candidate design parameters not only by the bottleneck statement
but also by the bottleneck type.
We define each design point as a data structure containing the following information:
curr_point = DesignPoint(configuration, tuned, result, quality, children)
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 13
where configuration contains the value of all the parameters and tuned lists the parameters which the algorithm has
explored for the current point. quality stores the quality of design measured by finite difference value and result includes
all the related information gathered from the HLS tool including the resource utilization and the cycle count. Finally,
each design point stores a stack of the configurations for its unexplored children where each child is generated by
advancing one of the parameters by one step. The children are pushed to the stack in the order of their importance
(from least to most important) as computed by the bottleneck analyzer so that by popping the stack, we get to work
with the child who has changed the parameter with the most promising impact.
We define level n as a point where we have fixed the value of n parameters, so the maximum level in our algorithm
is equal to the total number of parameters. For each level, we define a heap of the pending design points that can be
further explored and push the design points by their quality into the heap. Since new design points are sorted by their
quality values when they were pushed into the heap, the design point with a better quality value will be chosen for
tuning more of its parameters prior to other points. As mentioned above, the next point to be explored is chosen by
popping the stack of the unexplored children of this design point so that at each step, we get to evaluate the most
promising design point.
Algorithm 1 presents our exploring strategy. As we will explain in Section 5.5, we partition the design space to
alleviate the local optimum problem. For each partition, we first get its default point and initialize the heap of the first
level (lines 3 to 9). Then, at each iteration of the algorithm, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 gets the heap with the highest level, peeks the first
node of the heap, and pops its stack of unexplored children to get the new candidate (lines 11 to 14). Next, each option
of the new focused parameter will be evaluated and the result will be passed to the bottleneck analyzer to generate
a new set of focused parameters for making new children (lines 16 to 21). Since the number of fixed parameters is
increased by one, it will be pushed to the heap of the next level if there is still a parameter left that has not been tuned
yet (lines 22 to 26). When the stack of unexplored children of the current design point is empty, it will be popped out
of heap (lines 28 to 30). The algorithm continues either until all the heaps are empty or when the DSE has reached a
runtime threshold (Line 10).
As an example, when 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 optimizes Code 1, it will see that the convolution part of the code takes 85.2% of the
overall cycle counts. Since that section of the code is a series of nested loops, the parameters of the inner-most loop will
take the top of the list produced by the bottleneck analyzer. We explain in Section 5.4 that we do not consider loops
with trip count of less than or equal to 16 in our DSE since the HLS tool can automatically optimize these loops well.
As a result, the w loop in Line 15 would be the inner-most loop with parameters which the Merlin report tells us it is
a computation-bound loop. As we describe Section 5.3, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 first tries to apply fg PIPELINE which would be a
successful attempt. In the next iteration, the last level heap will contain the design point that was just optimized and
since the convolution part is still the bottleneck, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 would try parallelizing the w loop and will choose factor=4
since it achieves the highest quality value. Although factor=8 can reduce the cycle count by 11%, it increases the
overall area (Eq.6) by 63% which results in a worse quality; therefore, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 picks factor=4 to make room for
further improvement. By adopting Algorithm 1, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 can improve the performance by 218× very quickly, only
after 2 iterations of the algorithm.
as two different parameters based on its mode and choose the order of applying the pragmas to be fg PIPELINE,
PARALLEL, and cg PIPELINE which is a heuristic approach to improve the performance by utilizing more fine-grained
parallelization units since the HLS tool handles such optimizations better. Here, measuring the quality of design points
with the finite difference value (Eq. 5) helps 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 not to over-utilize the FPGA. For a configuration, when the gain
of the achieved speedup is not comparable to the loss of available resources, the quality of design decreases; hence,
𝐴𝑢𝑡𝑜𝐷𝑆𝐸 will not tune that parameter and the resources are left for applying a design parameter with higher impact.
Moreover, as mentioned in Challenge 3 of Section 1, the order of applying the pragmas is crucial in order to get to
the best design. Our experiments show that evaluating the fine-grained optimizations first helps 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 reach the
best design point in fewer iterations. This is mainly because HLS tools schedule fine-grained optimizations better than
the coarse-grained ones. Table 4 shows how the performance and resource utilization change when fg PIPELINE and
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 15
Table 4. Performance and Area Compared to The Base Design When Parameters of Line 15 in Code 1 Change. TIMEOUT is set to 60
minutes. The results suggest that applying fine-grained optimization first lets the HLS tool synthesize the design easier.
PARALLEL pragmas are applied to line 15 in Code 1 compared to the base design where all the pragmas are off. The
time limit to run the HLS tool is set to 60 minutes. The results suggest that in order to get to the optimal configuration
for this loop, we must first apply the fine-grained pipelining. This way, the HLS tool can better schedule the loop
when parallelization is further applied and its synthesis will finish in 28 minutes. However, if we first apply the other
pragma which results in a coarse-grained parallelization, the synthesis will be timed out and 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 does not tune
this pragma at this stage.
Note that we do not prune the other design parameters. We just change the order of the parameters to be explored as
these rules can not be generalized to all cases due to the unpredictability of the HLS tools. If the bottleneck of a design
point is memory transfer, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 prioritizes cg PIPELINE over TILING pragma. The Merlin Compiler, by default,
caches the data and the former will further overlap the communication time with computation by applying double
buffering; however, the latter, can be used to change the size of the cached data.
fg 1
P1 cg
off 2
1 2 4 8 16 32 64
P2
Fig. 4. Proposed Design Space Representation and Its Impact on DSE. P1 and P2 denote the PIPELINE and PARALLEL pragmas,
respectively
Fig. 4 illustrates the goal of an efficient design space representation. In this example, we attempt to explore the best
parameter with the best option for loop j of Code 1 with pragma P1 and P2 denoting the PIPELINE and PARALLEL
16 Sohrabizadeh and Yu, et al.
pragmas, respectively. Pragma 𝑃1 and 𝑃2 are exclusive when 𝑃1 is used in cg mode; therefore, only one of them
should be inserted at a time. A good design space representation must preserve the grid design space but invalidate
infeasible points. An example of such representation is presented in Fig. 4. Assume that we are at the configuration
(𝑃1, 𝑃2) = (cg, 1), we only have two candidates to explore in the next step because the configuration (𝑃1, 𝑃2) = (cg, 2)
is invalid. This representation is exploration friendly and, it is easy to enforce rules on the infeasible points.
To represent a grid design space with invalid points, we introduce a Python list comprehension syntax to 𝐴𝑢𝑡𝑜𝐷𝑆𝐸.
The Python list comprehension is a concise approach for creating lists with conditions. It has the following syntax:
list_name = [expression for item in list if condition]
Formally, we define the design space representation for Merlin pragmas with list comprehensions as follows:
#pragma ACCEL <pragma-type> <attribute-key>=auto{
options: parameter_name=list-comprehension-expression;
default: default-value }
For our example, the design space can be represented using list comprehensions as follows:
1 // Skip the rest due to page limit
2 #pragma ACCEL PIPELINE auto{
3 options: P1 = [x for x in [off, cg, fg]];
4 default: 'off' }
5 #pragma ACCEL PARALLEL factor=auto{
6 options: P2 = [x for x in [1, 2, 4, 8, 16, 32, 64] if P1!=cg];
7 default: 1 }
8 for (int j = 0; j < NumIn; ++i) {
9 // Skip the rest due to page limit
where line 6 indicates that the two pragmas are exclusive. In other words, when we set 𝑃1 = cg, the available option for
𝑃2 is only the default value, which is 1 in this case. Note that the default value of each pragma turns it off.
There are three main advantages to adopting list comprehension-based design space representations. First, we are
able to represent a design space with exclusive rules to greatly reduce its size. Second, the Python list comprehension
is general. It provides a friendly and comprehensive interface for higher layers such as polyhedral analysis [63] and
domain-specific languages to generate an effective design space in the future. Third, the syntax of this representation is
Python compatible. This means we can leverage the Python interpreter to evaluate the design space and improve overall
stability of the DSE framework.
The Design Space Generator, depicted in Fig. 1, adapts the Rose Compiler [1] to analyze the kernel AST and extract
the required information for running the DSE such as the loops in the design, their trip-count, and available bit-width.
Artisan [44] employs a similar approach for analyzing the code. However, it only considers unroll pragma in code
instrumentation. Our approach, on the other hand, considers a wider set of pragmas as mentioned in Table 2 and
exploits the following rules to prune the design space:
• Ignore the fine-grained loops with trip count (TC) of less than or equal to 16 as the HLS tool can schedule these
loops well.
• Tiling factors (TF) should be integer divisors of their loop TC.
• The allowed parallel factors (PF) for a loop are all sub-divisors of the loop TC up to 𝑚𝑖𝑛(128,𝑇𝐶) plus the TC
itself. PF of larger than 128 causes the HLS tool to run for a long time and it usually does not result in a good
performance.
• For each loop, we should have 𝑇 𝐹 ∗ 𝑃𝐹 ≤ 𝑇𝐶.
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 17
• When fg PIPELINE is applied on a loop, no other pragma is allowed for the inner loops since this parameter
want to unroll all the inner loops completely.
• A parallel pragma is invalid for a loop nest when cg PIPELINE is applied on that loop.
• A tiling pragma is added only to the loops with an inner loop.
hours to finish the entire process. On the other hand, some partitions that are based on an insignificant pipeline pragma
may have a similar performance, so it is more efficient to only explore one of them. As a result, we profile each partition
by running HLS with minimized parameter values to obtain the minimum area and performance and use K-means
clustering with performance and area as features to identify 𝑡 representative partitions among all 2𝑚 partitions.
6 Evaluation
CPU
Fig. 5. Speedup of the Merlin Compiler without any Pragmas, Proposed Approach with Different Optimizations, and the Manual
Design over an Intel Xeon CPU Core
We first measure the performance of the Merlin Compiler without any pragmas and without the help of AutoDSE to
get the impact of its default optimizations. The 1𝑠𝑡 bar of each case in Fig. 5 depicts the speedup gained by the Merlin
Compiler with respect to CPU. Then, we evaluate the original coordinate descent (CD) method described in Section 5.1
and the proposed optimization strategies explained in sections 5.4 and 5.5. The 2𝑛𝑑 to 4𝑡ℎ bars in Fig. 5 show the
speedup gained after tuning the pragmas by each of these optimizations. Note that the chart is in logarithmic scale. We
can see that the default optimizations of the Merlin Compiler are not enough and after applying the candidate pragmas
generated by the Original CD, we get 13.52× speedup, on the geometric mean. Moreover, each of the proposed strategies
benefits at least one case in our benchmark and together further bring a 2.47× speedup. The list-based design space
representation keeps the search space smooth by invalidating infeasible combinations. As a result, we can investigate
more design points in a fixed amount of time. This helps AES, NW, KMP, PATHFINDER, KMEANS, and KNN. Design space
partition benefits the designs with many loop nests in which the coordinate process is easily trapped by the local
optimum when changing pipeline modes—such as AES, GEMM, NW, STENCIL-2D, and STENCIL-3D.
The 5𝑡ℎ bar shows the speedup of 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 when the bottleneck-guided coordinate optimizer detailed in Section 5.2
is adapted along with the parameter ordering explained in Section 5.3, design space representation introduced in
Section 5.4, and design space partitioning described in Section 5.5. With this setup, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 further improves the result
by 5.5× on the geometric mean bringing the overall speedup compared to when no pragmas are applied to 182.92×. As
a result, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 is able to achieve a speedup of 19.9× over CPU and get to 0.93× performance of the manual designs
while running only for 1.1 hours on the geometric mean. The manual designs, depicted by the 6𝑡ℎ bar, are optimized by
applying the Merlin pragmas manually without changing the source programs.
Fig. 6 depicts the 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 process for four cases where the bottleneck-guided optimizer showed significant im-
provement in the performance. This shows that our approach can rapidly achieve a high performance design. 𝐴𝑢𝑡𝑜𝐷𝑆𝐸
does not exactly match the performance of manual designs for all of the cases because the HLS report may not reflect
the accurate computation cycles when the kernels contain many unbounded loops or while-loops, which in turn affects
the Merlin report. In order to get the importance of the parameters, the bottleneck analyzer (explained in Section 5.2)
needs to receive the accurate cycle estimation of the design. In the absence of the true cycle breakdown, it cannot
determine the high-impact design parameters. Therefore, our search algorithm may focus on unimportant parameters.
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 19
Fig. 6. Speedup Over the Manual Design Using AutoDSE for the 4 Cases that the Bottleneck-guided Optimizer had Significant Impact
Table 5. Speedup of Our Approach Compared to S2FA [54], Lattice-traversing DSE [23], and Gaussian process-based Bayesian
Optimization [43]
As we discussed in Section 5.1, the deficiency of S2FA stems from how hard it is for the problem-independent learning
algorithm to find the key parameters. Lattice-traversing DSE needs an initial sampling step to learn the design space.
This takes a long time for our benchmark due to the size of the design space even though the authors only consider
unrolling the loops and function inlining. This constraint makes it hard for the tool to start the exploration process
before the time limit for DSE is met. The Gaussian process-based Bayesian optimization also has to spend some time to
sample the design space and build an initial surrogate model. However, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 can learn the high-impact directives
by exploring the performance breakdown and thus, is able to find a high-performance design in a few iterations.
Moreover, adopting the Merlin Compiler as the backend gives further advantage to 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 compared to other
DSE tools. This allows the tool to exploit the automatic code transformations for applying the common optimization
techniques such as memory burst, memory coalescing, and double buffering; and focus only on high-level hardware
changes. Nonetheless, the performance comparison with S2FA demonstrates that adopting the Merlin Compiler is not
enough and we still need to explore the design space more efficiently.
20 Sohrabizadeh and Yu, et al.
Table 6. Average (Geometric Mean) Speedup of the Vitis tool, the Merlin Compiler, and 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 over the Manually Optimized
Kernels from Xilinx Vitis Libraries. The manual designs are the original kernels from the library. The performance of those designs are
compared to when the optimization pragmas we search for (UNROLL, PIPELINE, ARRAY_PARTITION, DEPENDENCE, LOOP_FLATTEN, and
INLINE) are removed and the code is passed to three different tools.
To better understand the effect of our optimizer, we tested the performance of the Vitis tool and the Merlin Compiler
on the input to 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 (which does not include the optimization pragmas mentioned above). The performance
comparisons are summarized in Table 6. As the results show, while the Merlin Compiler can get to a speedup of 3.29×
compared to the Vitis tool, it still needs the help of 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 to get to the manually optimized kernels in the library. In
fact, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 could achieve a further speedup of 2.74× by automatically inserting 3.2 Merlin pragmas per kernel, on
the geometric mean. As a result, it could improve the performance of the Vitis tool by 9.04× and 1.04× when the code
with reduced set of pragmas and the manual code, respectively, are passed to it.
Fig. 7 in Appendix A.2 depicts the performance comparison of the design points 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 generated with respect to
Xilinx results along with the number of pragmas that we removed in detail. The results show that 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 is able
to achieve to a same or better performance yet with 26.38× reduction of their optimization pragmas in 0.3 hours, on
the geometric mean; therefore, proving the effectiveness of our bottleneck-based approach and the fact that it can
mimic the method an expert would take. For the cases that 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 does not exactly match the performance of Vitis,
𝐴𝑢𝑡𝑜𝐷𝑆𝐸 still finds the best combination of the pragmas. The inequality lies in the different II that Merlin has achieved.
For example, the histEqualize, histogram, and otsuthreshold kernels all have a loop that requires the II to be set
to 2 when PIPELINE pragma is used. Otherwise, Vivado HLS achieves an II=3. However, it is not possible to change the
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 21
II using the Merlin Compiler. On the other hand, 𝐴𝑢𝑡𝑜𝐷𝑆𝐸 is able to outperform the performance of customConv and
reduce kernels significantly by better detecting the choices and locations for pipelining and parallelization.
Acknowledgments
The authors would like to thank Dr. Peichen Pan for his invaluable support with the Merlin Compiler and Dr. Lorenzo
Ferretti and Qi Sun for helping with the comparison to their work. This work is supported by the ICN-WEN award jointly
funded by the NSF (CNS-1719403) and Intel (34627365), the CAPA award also jointly funded by NSF (CCF-1723773) and
Intel (36888881), and CDSC industrial partners3 .
References
[1] [n.d.]. Rose Compiler Infrastructure. http://rosecompiler.org/.
[2] Amazon EC2 F1 Instance. [n.d.]. https://aws.amazon.com/ec2/instance-types/f1/.
[3] Joao Andrade, Nithin George, Kimon Karras, David Novo, Frederico Pratas, Leonel Sousa, Paolo Ienne, Gabriel Falcao, and Vitor Silva. 2017. Design
space exploration of LDPC decoders using high-level synthesis. IEEE Access 5 (2017), 14600–14615.
3 https://cdsc.ucla.edu/partners/
22 Sohrabizadeh and Yu, et al.
[4] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014.
Opentuner: An extensible framework for program autotuning. In PACT. 303–316.
[5] Gary Bradski. 2000. The opencv library. Dr Dobb’s J. Software Tools 25 (2000), 120–125.
[6] Cadence Stratus High-Level Synthesis. [n.d.]. https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/synthesis/stratus-high-level-
synthesis.html.
[7] Catapult High-Level Synthesis. [n.d.]. https://eda.sw.siemens.com/en-US/ic/ic-design/high-level-synthesis-and-verification-platform/.
[8] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for
heterogeneous computing. In IISWC. 44–54.
[9] Yuze Chi, Jason Cong, Peng Wei, and Peipei Zhou. 2018. SODA: stencil with optimized dataflow architecture. In ICCAD. 1–8.
[10] Young-kyu Choi and Jason Cong. 2018. HLS-based optimization and design space exploration for applications with variable loop bounds. In ICCAD.
1–8.
[11] Jason Cong, Muhuan Huang, Peichen Pan, Yuxin Wang, and Peng Zhang. 2016. Source-to-source optimization for HLS. In FPGAs for Software
Programmers. 137–163.
[12] Jason Cong, Muhuan Huang, Peichen Pan, Di Wu, and Peng Zhang. 2016. Software infrastructure for enabling FPGA-based accelerations in data
centers. In ISLPED. 154–155.
[13] Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. 2011. High-level synthesis for FPGAs: From prototyping
to deployment. In TCAD, Vol. 30. 473–491.
[14] Jason Cong and Jie Wang. 2018. PolySA: polyhedral-based systolic array auto-compilation. In ICCAD. 1–8.
[15] Jason Cong, Peng Wei, Cody Hao Yu, and Peng Zhang. 2018. Automated accelerator generation and optimization with composable, parallel and
pipeline architecture. In DAC.
[16] CyberWorkBench. [n.d.]. https://www.nec.com/en/global/prod/cwb/index.html.
[17] Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and
engineering 5, 1 (1998), 46–55.
[18] Steve Dai, Yuan Zhou, Hang Zhang, Ecenur Ustun, Evangeline FY Young, and Zhiru Zhang. 2018. Fast and accurate estimation of quality of results in
high-level synthesis with machine learning. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines
(FCCM). IEEE, 129–132.
[19] Robert H Dennard, Fritz H Gaensslen, Hwa-Nien Yu, V Leo Rideout, Ernest Bassous, and Andre R LeBlanc. 1974. Design of ion-implanted MOSFET’s
with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5, 256–268.
[20] Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan
Tran, et al. 2018. Fast inference of deep neural networks in FPGAs for particle physics. Journal of Instrumentation 13, 07 (2018), P07027.
[21] Falcon Computing Solutions, Inc. [n.d.]. http://www.falcon-computing.com.
[22] Lorenzo Ferretti, Giovanni Ansaloni, and Laura Pozzi. 2018. Cluster-based heuristic for high level synthesis design space exploration. IEEE
Transactions on Emerging Topics in Computing.
[23] Lorenzo Ferretti, Giovanni Ansaloni, and Laura Pozzi. 2018. Lattice-traversing design space exploration for high level synthesis. In ICCD. 210–217.
[24] Álvaro Fialho, Luis Da Costa, Marc Schoenauer, and Michèle Sebag. 2010. Analyzing bandit-based adaptive operator selection mechanisms. Annals
of Mathematics and Artificial Intelligence 60, 1-2, 25–64.
[25] Intel. [n.d.]. https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html.
[26] Intel SDK for OpenCL Applications. [n.d.]. https://software.intel.com/en-us/intel-opencl.
[27] David Koeplinger, Raghu Prabhakar, Yaqi Zhang, Christina Delimitrou, Christos Kozyrakis, and Kunle Olukotun. 2016. Automatic generation of
efficient accelerators for reconfigurable hardware. In ISCA. 115–127.
[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097–1105.
[29] Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A multi-paradigm
programming infrastructure for software-defined reconfigurable computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. 242–251.
[30] Hung-Yi Liu and Luca P Carloni. 2013. On learning-based methods for design-space exploration with high-level synthesis. In DAC. 1–7.
[31] Shuangnan Liu, Francis CM Lau, and Benjamin Carrion Schafer. 2019. Accelerating fpga prototyping through predictive model-based hls design
space exploration. In DAC. 1–6.
[32] Anushree Mahapatra and Benjamin Carrion Schafer. 2014. Machine-learning based simulated annealer method for high level synthesis design space
exploration. In ESLsyn. 1–6.
[33] Rachit Nigam, Sachille Atapattu, Samuel Thomas, Zhijing Li, Theodore Bauer, Yuwei Ye, Apurva Koti, Adrian Sampson, and Zhiru Zhang. 2020.
Predictable accelerator design with time-sensitive affine types. arXiv preprint arXiv:2004.04852 (2020).
[34] Raghu Prabhakar, David Koeplinger, Kevin J Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. Generating
configurable hardware from parallel patterns. ASPLOS 51, 4, 651–665.
[35] Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers,
Gopi Prashanth Gopal, Jan Gray, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In ISCA. 13–24.
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators 23
[36] Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and
customized architectures. In IISWC. 110–119.
[37] Enrico Reggiani, Marco Rabozzi, Anna Maria Nestorov, Alberto Scolari, Luca Stornaiuolo, and Marco Santambrogio. 2019. Pareto optimal design
space exploration for accelerated CNN on FPGA. In IPDPSW. 107–114.
[38] Benjamin Carrion Schafer. 2017. Parallel high-level synthesis design space exploration for behavioral ips of exact latencies. TODAES 22, 4, 1–20.
[39] Benjamin Carrion Schafer and Kazutoshi Wakabayashi. 2012. Divide and conquer high-level synthesis design space exploration. TODAES 17, 3,
1–19.
[40] B Carrion Schafer and Kazutoshi Wakabayashi. 2012. Machine learning predictive modelling high-level synthesis design space exploration. In IET
computers & digital techniques, Vol. 6. 153–159.
[41] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams.
2015. Scalable bayesian optimization using deep neural networks. In International conference on machine learning. PMLR, 2171–2180.
[42] Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. 2020. End-to-End Optimization of Deep Learning Applications. In FPGA. 133–139.
[43] Qi Sun, Tinghuan Chen, Siting Liu, Jin Miao, Jianli Chen, Hao Yu, and Bei Yu. 2021. Correlated Multi-objective Multi-fidelity Optimization for HLS
Directives Design. In IEEE/ACM Proceedings Design, Automation and Test in Europe (DATE). 01–05.
[44] Jessica Vandebon, Jose GF Coutinho, Wayne Luk, Eriko Nurvitadhi, and Tim Todman. 2020. Artisan: a Meta-Programming Approach For Codifying
Optimisation Strategies. In FCCM. 177–185.
[45] Vivado HLS. [n.d.]. www.xilinx.com/products/design-tools/vivado.
[46] Jie Wang, Licheng Guo, and Jason Cong. 2021. AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. In Proceedings of the
2021 ACM/SIGDA international symposium on Field-programmable gate arrays.
[47] Shuo Wang, Yun Liang, and Wei Zhang. 2017. Flexcl: An analytical performance model for opencl workloads on flexible fpgas. In DAC. 1–6.
[48] Xilinx. [n.d.]. https://www.xilinx.com/about/xilinx-ventures/falcon-computing.html.
[49] Xilinx Vitis Libraries. [n.d.]. www.github.com/Xilinx/Vitis_Libraries.
[50] Xilinx Vitis Platform. [n.d.]. https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html.
[51] Chang Xu, Gai Liu, Ritchie Zhao, Stephen Yang, Guojie Luo, and Zhiru Zhang. 2017. A parallel bandit-based approach for autotuning fpga compilation.
In FPGA. 157–166.
[52] Pengfei Xu, Xiaofan Zhang, Cong Hao, Yang Zhao, Yongan Zhang, Yue Wang, Chaojian Li, Zetong Guan, Deming Chen, and Yingyan Lin. 2020.
AutoDNNchip: An automated dnn chip predictor and builder for both FPGAs and ASICs. In FPGA. 40–50.
[53] Sotirios Xydis, Gianluca Palermo, Vittorio Zaccaria, and Cristina Silvano. 2014. SPIRIT: Spectral-Aware pareto iterative refinement optimization for
supervised high-level synthesis. In TCAD, Vol. 34. 155–159.
[54] Cody Hao Yu, Peng Wei, Max Grossman, Peng Zhang, Vivek Sarker, and Jason Cong. 2018. S2FA: an accelerator automation framework for
heterogeneous computing in datacenters. In DAC. 1–6.
[55] Georgios Zacharopoulos, Lorenzo Ferretti, Giovanni Ansaloni, Giuseppe Di Guglielmo, Luca Carloni, and Laura Pozzi. 2019. Compiler-assisted
selection of hardware acceleration candidates from application source code. In ICCD. 129–137.
[56] Zhiru Zhang, Yiping Fan, Wei Jiang, Guoling Han, Changqi Yang, and Jason Cong. 2008. AutoPilot: A platform-based ESL synthesis system. In
High-Level Synthesis. 99–112.
[57] Jieru Zhao, Liang Feng, Sharad Sinha, Wei Zhang, Yun Liang, and Bingsheng He. 2017. COMBA: A comprehensive model-based analysis framework
for high level synthesis of real applications. In ICCAD. 430–437.
[58] Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. FlexTensor: An Automatic Schedule Exploration and Optimization
Framework for Tensor Computation on Heterogeneous System. In ASPLOS. 859–873.
[59] Guanwen Zhong, Alok Prakash, Yun Liang, Tulika Mitra, and Smail Niar. 2016. Lin-analyzer: a high-level performance analysis tool for FPGA-based
accelerators. In DAC. 1–6.
[60] Guanwen Zhong, Alok Prakash, Siqi Wang, Yun Liang, Tulika Mitra, and Smail Niar. 2017. Design Space exploration of FPGA-based accelerators
with multi-level parallelism. In DATE. 1141–1146.
[61] Guanwen Zhong, Vanchinathan Venkataramani, Yun Liang, Tulika Mitra, and Smail Niar. 2014. Design space exploration of multiple loops on
FPGAs using high level synthesis. In ICCD. 456–463.
[62] Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined spatial and temporal blocking for high-performance stencil computation
on FPGAs using OpenCL. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 153–162.
[63] Wei Zuo, Peng Li, Deming Chen, Louis-Noël Pouchet, Shunan Zhong, and Jason Cong. 2013. Improving polyhedral code generation for high-level
synthesis. In CODES+ ISSS. 1–10.
24 Sohrabizadeh and Yu, et al.
A Appendix
Fig. 7. Speedup and Number of Reduced Pragmas Using AutoDSE Compared to Vision Kernels of Xilinx Vitis libraries [49]