没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge ServerScott Beamer Krste Asanović David Patterson Electrical Engineering & Computer Sciences DepartmentUniversity of California Berkeley, California{sbeamer,krste,pattrsn}@eecs.berkeley.eduAbstract—Graph processing is an increasingly important ap- plication domain and is typically communication-bound. In this work, we analyze the performance characteristics of three high- performance graph algorithm codebases using
资源推荐
资源详情
资源评论































Locality Exists in Graph Processing:
Workload Characterization on an Ivy Bridge Server
Scott Beamer Krste Asanovi
´
c David Patterson
Electrical Engineering & Computer Sciences Department
University of California
Berkeley, California
{sbeamer,krste,pattrsn}@eecs.berkeley.edu
Abstract—Graph processing is an increasingly important ap-
plication domain and is typically communication-bound. In this
work, we analyze the performance characteristics of three high-
performance graph algorithm codebases using hardware per-
formance counters on a conventional dual-socket server. Unlike
many other communication-bound workloads, graph algorithms
struggle to fully utilize the platform’s memory bandwidth and
so increasing memory bandwidth utilization could be just as
effective as decreasing communication. Based on our observations
of simultaneous low compute and bandwidth utilization, we find
there is substantial room for a different processor architecture to
improve performance without requiring a new memory system.
I. INTRODUCTION
Graph processing is experiencing a surge of interest, as
applications in social networks and their analysis have grown
in importance [25], [31], [45]. Additionally, graph algorithms
have found new applications in recognition [28], [46] and the
sciences [39].
Graph algorithms are notoriously difficult to execute ef-
ficiently, and so there has been considerable recent effort in
improving the performance of processing large graphs for these
important applications. Their inefficiency is due to the large
volume and irregular pattern of communication between com-
putations at each vertex or edge. When executed on a shared-
memory multiprocessor, this large volume of communication
is translated into loads and stores in the memory hierarchy.
When executed in parallel on a large-scale distributed cluster,
this communication is translated into messages across the
inter-processor network. Because message-passing is far less
efficient than accessing memory in contemporary systems,
distributed clusters are a poor match to graph processing. For
example, on Graph500, a world ranking of the fastest super-
computers for graph algorithms, the efficiency of each core in
a cluster is on average one to two orders-of-magnitude lower
than cores in shared-memory nodes [21]. This communication-
bound behavior has led to surprising results, where a single
Mac Mini operating on a large graph stored in an SSD is able
to outperform a medium-sized cluster [26].
Due to the inefficiency of message-passing communication,
the only reason to use a cluster for graph processing is if the
data is too large to fit on a single node [30]. However, many
interesting graph problems are not large enough to justify a
cluster. For example, the entire Facebook friend graph requires
only 1.5 TB uncompressed [3], which can reside in a single
high-end server node’s memory today.
In this paper, we focus on the performance of a shared-
memory multiprocessor node executing optimized graph algo-
rithms. We analyze the performance of three high-performance
graph processing codebases each using a different parallel
runtime, and we measure results for these graph libraries using
five different graph kernels and a variety of input graphs. We
use microbenchmarks and hardware performance counters to
analyze the bottlenecks these optimized codes experience when
executed on a modern Intel Ivy Bridge server. We derive the
following insights from our analysis:
• Memory bandwidth is not fully utilized - Surprisingly,
the other bottlenecks described below prevent the off-
chip memory system from achieving full utilization on
well-tuned parallel graph codes. In other words, there
is the potential for significant performance improve-
ment on graph codes with current off-chip memory
systems.
• Graph codes exhibit substantial locality - Optimized
graph codes experience a moderately high last-level
cache (LLC) hit rate.
• Reorder buffer size limits achievable memory through-
put - The relatively high LLC hit rate implies many
instructions are executed for each LLC miss. These
instructions fill the reorder buffer in the core, prevent-
ing future loads that will miss in the LLC from issuing
early, resulting in unused memory bandwidth.
• Multithreading has limited potential for graph pro-
cessing - In the context of a large superscalar out-
of-order multicore, we see only modest room for
performance improvement on graph codes from mul-
tithreading and that is likely achievable with only a
modest number (2) of threads per core.
We also confirm conventional wisdom that the most effi-
cient algorithms are often the hardest to parallelize, and that
these algorithms have their scaling hampered by load imbal-
ance, synchronization overheads, and non-uniform memory
access (NUMA) penalties. Additionally, we find that different
input graph sizes and topologies can lead to very different
conclusions for algorithms and architectures, so it is important
to consider a range of input graphs in any analysis.
Based on our empirical results, we make recommendations
for future work in both hardware and software to improve
graph algorithm performance.

II. GRAPH BACKGROUND
Graph applications are characterized not only by the algo-
rithms used, but also by the structure of the graphs that make
up their workload. A graph’s diameter is the largest number
of vertices that must be traversed in order to travel from one
vertex to another when paths that backtrack, detour, or loop
are excluded from consideration. The degree of a node in a
graph is the number of connections it has to other nodes, and
the degree distribution is the probability distribution of these
degrees over the whole graph.
Commonly used graphs can be divided into two broad
categories named for their most emblematic members: meshes
and social networks [7]. Meshes tend to be derived from
physically spatial sources, such as road maps or the finite-
element mesh of a simulated car body, so they can be relatively
readily partitioned along the few original spatial dimensions.
Due to their physical origin, they usually have a high diameter
and a degree distribution that is both bounded and low.
Conversely, social networks come from non-spatial sources,
and consequently are difficult to partition using any reasonable
number of dimensions. Additionally, social networks have a
low diameter (“small-world”) and a power-law degree distri-
bution (“scale-free”). In a small-world graph, most nodes are
not neighbors of one another, but most nodes can be reached
from every other by a small number of hops [42]. A scale-
free graph has a degree distribution that follows a power law,
at least asymptotically [5]. The fraction of nodes in a scale-
free graph having k connections to other nodes is P(k) ∼ k
−γ
,
where γ is a parameter typically in the range 2 < γ < 3.
Meshes are perhaps the most common mental model for
graphs, since they are typically used in textbook figures.
Unfortunately, they do not capture the challenges posed by
the small-world and scale-free properties of social network
topologies. The small-world property makes them difficult to
partition (few cut edges relative to enclosed edges), while the
scale-free property makes it difficult to load balance a parallel
execution since there can be many orders of magnitude dif-
ference between the work for different vertices. Although the
highest degree vertices are rare, their incident edges constitute
a large fraction of the graph.
III. METHODOLOGY
To provide a representative graph workload, we chose five
popular graph kernels and exercised them with five different
input graphs using three high-performance graph codebases
running on a modern high-end server. Unless otherwise stated,
we measure the full workload of all combinations of code-
bases, kernels, and graphs (75 data points) for each system
configuration.
A. Graph Kernels and Input Graphs
We selected five graph kernels based on their popularity in
applications as well as in other research papers:
1) Breadth-First Search (BFS) is actually only a traver-
sal order, but it is so fundamental to graph algorithms
that we include it in our suite. We convert BFS into a
kernel by tracking the parent vertex in the traversal.
2) Single-Source Shortest Paths (SSSP) computes the
distance to all reachable vertices from a start vertex.
3) PageRank (PR) is way of determining influence
within a graph, and was initially used to sort search
results [38].
4) Connected Components (CC) attaches the same
label to all vertices in the same connected component.
5) Betweenness Centrality (BC) is commonly used
in social network analysis to measure the influence
a vertex has on a graph. A vertex’s betweenness-
centrality score is related to the fraction of shortest
paths between all vertices that pass through the ver-
tex. To keep runtimes tractable, our BC benchmark
starts from only a few root vertices instead of every
vertex.
We selected the input graphs used in our evaluation to be
topologically diverse and Table I lists them. Twitter, road, and
web are all from real-world data, while kron and uniform are
synthetic. Twitter, web, and kron all have the “social network”
topology, as they have low effective diameters and a power-law
degree distribution. Road is an example of a “mesh” topology,
with its high diameter, low average degree, and low maximum
degree. Even though our graph suite includes some of the
largest publicly available real-world graphs, they do not fully
use the memory capacity of our system. As is done in the
Graph500 benchmark, we generate arbitrarily large synthetic
graphs to fill our memory capacity. Our parameters for kron
are chosen to match those of Graph500 [21]. Uniform is low
diameter, like a social network, but its degree distribution is
normal rather than a power law. Hence, in our uniform graph
each vertex tends to be accessed roughly the same number
of times, unlike social networks where a few vertices are
accessed disproportionately often. Uniform represents the most
adversarial graph, as by design it has no locality, however, it
is also the most unrealistic and serves to act as lower bound
on performance.
B. Graph Processing Frameworks
For this study, we use three of the fastest available graph
codebases, which each use a different parallel runtime.
Galois [36] uses its own custom parallel runtime specifi-
cally designed to handle irregular fine-grained task parallelism.
Algorithms implemented in Galois are free to use autonomous
scheduling (no synchronization barriers), which should re-
duce the synchronization otherwise needed for high-diameter
graphs. Additionally, Galois’ scheduler takes into consideration
the plaform’s core and socket topology.
Ligra [40] uses the Cilk [8] parallel runtime and is built on
the abstractions of edge maps and vertex maps. When applying
these map functions, Ligra uses heuristics to determine in
which direction to apply them (push or pull) and what data
structures to use (sparse or dense). These optimizations make
Ligra especially well suited for low-diameter graphs.
GAP Benchmark Suite (GAPBS) [6], [19] is a collec-
tion of high-performance implementations written directly in
OpenMP with C++11. GAPBS is not a framework, so it does
not force common abstractions onto all implementations, but
instead frees each to do whatever is appropriate for a given
algorithm.
剩余9页未读,继续阅读
资源评论


weixin_38557068
- 粉丝: 4
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 基于计算机软件工程的数据库编程技术.docx
- 大数据技术对城市商业银行小微企业授信评审的作用.docx
- 工程项目业主方项目管理.docx
- 物联网联手大数据.docx
- 中小企业网络管理员实用教程(3).ppt
- 基于大数据的公共资源交易监管方式研究.docx
- 通信与广电管理与实务综合案例二.doc
- AIoT赋能办公大数据企业员工双受益.docx
- 软件开发所需要的三种人.doc
- 互联网+背景下中医药学基础课程思政教育实施策略.docx
- 动态网页方案设计书ASP.doc
- 信贷登记咨询系统建设银行接口系统修改升业务需求.doc
- PPT模板:互联网创新科技年度工作报告商业计划书宣传.pptx
- 申报电子商务重点项目情况书面说明(格式).doc
- 施工项目管理中的风险管理应用.docx
- 产品设计课程传统教学模式缺陷及信息化教学价值分析.docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
