LocalityExistsinGraphProcessing-WorkloadCharacterizationonanIvyBridgeServer-计算机科学资源-CSDN下载

196 浏览量 2021-04-22 18:21:42 上传评论收藏 363KB PDF 举报

资源推荐

资源详情

资源评论

Locality Exists in Graph Processing:

Workload Characterization on an Ivy Bridge Server

Scott Beamer Krste Asanovi

c David Patterson

Electrical Engineering & Computer Sciences Department

University of California

Berkeley, California

{sbeamer,krste,pattrsn}@eecs.berkeley.edu

Abstract—Graph processing is an increasingly important ap-

plication domain and is typically communication-bound. In this

work, we analyze the performance characteristics of three high-

performance graph algorithm codebases using hardware per-

formance counters on a conventional dual-socket server. Unlike

many other communication-bound workloads, graph algorithms

struggle to fully utilize the platform’s memory bandwidth and

so increasing memory bandwidth utilization could be just as

effective as decreasing communication. Based on our observations

of simultaneous low compute and bandwidth utilization, we ﬁnd

there is substantial room for a different processor architecture to

improve performance without requiring a new memory system.

I. INTRODUCTION

Graph processing is experiencing a surge of interest, as

applications in social networks and their analysis have grown

in importance [25], [31], [45]. Additionally, graph algorithms

have found new applications in recognition [28], [46] and the

sciences [39].

Graph algorithms are notoriously difﬁcult to execute ef-

ﬁciently, and so there has been considerable recent effort in

improving the performance of processing large graphs for these

important applications. Their inefﬁciency is due to the large

volume and irregular pattern of communication between com-

putations at each vertex or edge. When executed on a shared-

memory multiprocessor, this large volume of communication

is translated into loads and stores in the memory hierarchy.

When executed in parallel on a large-scale distributed cluster,

this communication is translated into messages across the

inter-processor network. Because message-passing is far less

efﬁcient than accessing memory in contemporary systems,

distributed clusters are a poor match to graph processing. For

example, on Graph500, a world ranking of the fastest super-

computers for graph algorithms, the efﬁciency of each core in

a cluster is on average one to two orders-of-magnitude lower

than cores in shared-memory nodes [21]. This communication-

bound behavior has led to surprising results, where a single

Mac Mini operating on a large graph stored in an SSD is able

to outperform a medium-sized cluster [26].

Due to the inefﬁciency of message-passing communication,

the only reason to use a cluster for graph processing is if the

data is too large to ﬁt on a single node [30]. However, many

interesting graph problems are not large enough to justify a

cluster. For example, the entire Facebook friend graph requires

only 1.5 TB uncompressed [3], which can reside in a single

high-end server node’s memory today.

In this paper, we focus on the performance of a shared-

memory multiprocessor node executing optimized graph algo-

rithms. We analyze the performance of three high-performance

graph processing codebases each using a different parallel

runtime, and we measure results for these graph libraries using

ﬁve different graph kernels and a variety of input graphs. We

use microbenchmarks and hardware performance counters to

analyze the bottlenecks these optimized codes experience when

executed on a modern Intel Ivy Bridge server. We derive the

following insights from our analysis:

• Memory bandwidth is not fully utilized - Surprisingly,

the other bottlenecks described below prevent the off-

chip memory system from achieving full utilization on

well-tuned parallel graph codes. In other words, there

is the potential for signiﬁcant performance improve-

ment on graph codes with current off-chip memory

systems.

• Graph codes exhibit substantial locality - Optimized

graph codes experience a moderately high last-level

cache (LLC) hit rate.

• Reorder buffer size limits achievable memory through-

put - The relatively high LLC hit rate implies many

instructions are executed for each LLC miss. These

instructions ﬁll the reorder buffer in the core, prevent-

ing future loads that will miss in the LLC from issuing

early, resulting in unused memory bandwidth.

• Multithreading has limited potential for graph pro-

cessing - In the context of a large superscalar out-

of-order multicore, we see only modest room for

performance improvement on graph codes from mul-

tithreading and that is likely achievable with only a

modest number (2) of threads per core.

We also conﬁrm conventional wisdom that the most efﬁ-

cient algorithms are often the hardest to parallelize, and that

these algorithms have their scaling hampered by load imbal-

ance, synchronization overheads, and non-uniform memory

access (NUMA) penalties. Additionally, we ﬁnd that different

input graph sizes and topologies can lead to very different

conclusions for algorithms and architectures, so it is important

to consider a range of input graphs in any analysis.

Based on our empirical results, we make recommendations

for future work in both hardware and software to improve

graph algorithm performance.

II. GRAPH BACKGROUND

Graph applications are characterized not only by the algo-

rithms used, but also by the structure of the graphs that make

up their workload. A graph’s diameter is the largest number

of vertices that must be traversed in order to travel from one

vertex to another when paths that backtrack, detour, or loop

are excluded from consideration. The degree of a node in a

graph is the number of connections it has to other nodes, and

the degree distribution is the probability distribution of these

degrees over the whole graph.

Commonly used graphs can be divided into two broad

categories named for their most emblematic members: meshes

and social networks [7]. Meshes tend to be derived from

physically spatial sources, such as road maps or the ﬁnite-

element mesh of a simulated car body, so they can be relatively

readily partitioned along the few original spatial dimensions.

Due to their physical origin, they usually have a high diameter

and a degree distribution that is both bounded and low.

Conversely, social networks come from non-spatial sources,

and consequently are difﬁcult to partition using any reasonable

number of dimensions. Additionally, social networks have a

low diameter (“small-world”) and a power-law degree distri-

bution (“scale-free”). In a small-world graph, most nodes are

not neighbors of one another, but most nodes can be reached

from every other by a small number of hops [42]. A scale-

free graph has a degree distribution that follows a power law,

at least asymptotically [5]. The fraction of nodes in a scale-

free graph having k connections to other nodes is P(k) ∼ k

−γ

where γ is a parameter typically in the range 2 < γ < 3.

Meshes are perhaps the most common mental model for

graphs, since they are typically used in textbook ﬁgures.

Unfortunately, they do not capture the challenges posed by

the small-world and scale-free properties of social network

topologies. The small-world property makes them difﬁcult to

partition (few cut edges relative to enclosed edges), while the

scale-free property makes it difﬁcult to load balance a parallel

execution since there can be many orders of magnitude dif-

ference between the work for different vertices. Although the

highest degree vertices are rare, their incident edges constitute

a large fraction of the graph.

III. METHODOLOGY

To provide a representative graph workload, we chose ﬁve

popular graph kernels and exercised them with ﬁve different

input graphs using three high-performance graph codebases

running on a modern high-end server. Unless otherwise stated,

we measure the full workload of all combinations of code-

bases, kernels, and graphs (75 data points) for each system

conﬁguration.

A. Graph Kernels and Input Graphs

We selected ﬁve graph kernels based on their popularity in

applications as well as in other research papers:

1) Breadth-First Search (BFS) is actually only a traver-

sal order, but it is so fundamental to graph algorithms

that we include it in our suite. We convert BFS into a

kernel by tracking the parent vertex in the traversal.

2) Single-Source Shortest Paths (SSSP) computes the

distance to all reachable vertices from a start vertex.

3) PageRank (PR) is way of determining inﬂuence

within a graph, and was initially used to sort search

results [38].

4) Connected Components (CC) attaches the same

label to all vertices in the same connected component.

5) Betweenness Centrality (BC) is commonly used

in social network analysis to measure the inﬂuence

a vertex has on a graph. A vertex’s betweenness-

centrality score is related to the fraction of shortest

paths between all vertices that pass through the ver-

tex. To keep runtimes tractable, our BC benchmark

starts from only a few root vertices instead of every

vertex.

We selected the input graphs used in our evaluation to be

topologically diverse and Table I lists them. Twitter, road, and

web are all from real-world data, while kron and uniform are

synthetic. Twitter, web, and kron all have the “social network”

topology, as they have low effective diameters and a power-law

degree distribution. Road is an example of a “mesh” topology,

with its high diameter, low average degree, and low maximum

degree. Even though our graph suite includes some of the

largest publicly available real-world graphs, they do not fully

use the memory capacity of our system. As is done in the

Graph500 benchmark, we generate arbitrarily large synthetic

graphs to ﬁll our memory capacity. Our parameters for kron

are chosen to match those of Graph500 [21]. Uniform is low

diameter, like a social network, but its degree distribution is

normal rather than a power law. Hence, in our uniform graph

each vertex tends to be accessed roughly the same number

of times, unlike social networks where a few vertices are

accessed disproportionately often. Uniform represents the most

adversarial graph, as by design it has no locality, however, it

is also the most unrealistic and serves to act as lower bound

on performance.

B. Graph Processing Frameworks

For this study, we use three of the fastest available graph

codebases, which each use a different parallel runtime.

Galois [36] uses its own custom parallel runtime speciﬁ-

cally designed to handle irregular ﬁne-grained task parallelism.

Algorithms implemented in Galois are free to use autonomous

scheduling (no synchronization barriers), which should re-

duce the synchronization otherwise needed for high-diameter

graphs. Additionally, Galois’ scheduler takes into consideration

the plaform’s core and socket topology.

Ligra [40] uses the Cilk [8] parallel runtime and is built on

the abstractions of edge maps and vertex maps. When applying

these map functions, Ligra uses heuristics to determine in

which direction to apply them (push or pull) and what data

structures to use (sparse or dense). These optimizations make

Ligra especially well suited for low-diameter graphs.

GAP Benchmark Suite (GAPBS) [6], [19] is a collec-

tion of high-performance implementations written directly in

OpenMP with C++11. GAPBS is not a framework, so it does

not force common abstractions onto all implementations, but

instead frees each to do whatever is appropriate for a given

algorithm.

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38557068

粉丝: 4

Locality Exists in Graph Processing - Workload Characterization ...

最新资源

Locality Exists in Graph Processing - Workload Characterization ...

Locality-sensitive-hashing-master.zip

ExtremeLearningMachine资源共享-Locality-constrained-representation-based-classification-with_2013_Neurocomp.pdf

2004 Locality-sensitive hashing scheme based on p-stable distributions.pdf

matlab精度检验代码-Locality-constrained-Linear-Coding-for-Image-Classificatio

Affinity Accept - Improving Network Connection Locality on Multicore Systems-计算机科学

大数据之数据挖掘课程：海量数据集挖掘 04-LSH-Locality Sensitive Hashing 共52页.pdf

Computationally efficient locality-aware interconnection topology for multi-processor system-on-chip (MP-SoC)

Big Data and Computational Intelligence in Networking-CRC(2018).pdf

matlab实现垃圾邮件分类代码-Locality-constrained-Affine-Subspace-Coding:局部约束仿射子空间编

Locality-Sensitive-Hashing:局部敏感哈希在音频指纹识别中的应用

Python库 | django-locality-0.2.2-py2.7.egg

Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search

FACE HALLUCINATION VIA LOCALITY-CONSTRAINED LOW-RANK REPRESENTATION

Locality-Sensitive-Hashing

Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Reco

Mining of Massive Datasets

P2P(Peer-to-Peer)必看文章

Mining of Massive Datasets数据挖掘

Mining of Massive Datasets.pdf

计算机科学与技术软件工程面试专业知识汇总

斯坦福大学大数据之数据挖掘课程 CS246：海量数据集挖掘 bigdata-stanford CS246.rar

大数据之数据挖掘课程：海量数据集挖掘 15-streams 共46页.pdf

大数据之数据挖掘课程：海量数据集挖掘 20-review 共44页.pdf

大数据之数据挖掘课程：海量数据集挖掘 18-bandits 共41页.pdf

大数据之数据挖掘课程：海量数据集挖掘 12-图论 graphs2 共49页.pdf

大数据之数据挖掘课程：海量数据集挖掘 03-LSH Finding Similar Items 共59页.pdf

Qt 5实现串口调试助手 （源工程文件、0积分下载）

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

Vue2,Vue3,React的对比

自组织网络与无线技术前沿

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）