Graph Stream Algorithms - A Survey
Graph Stream Algorithms - A Survey
Andrew McGregor†
University of Massachusetts
[email protected]
Table 1: Single-Pass, Semi-Streaming Results: Algorithms use O(n polylog n) space unless noted otherwise.
Results for approximating the frequency of subgraphs discussed in Section 2.3.
m to denote the number of nodes and edges in the graph 2.1 Connectivity, Trees, and Spanners
under consideration. For any natural number k, we use One of early motivations for considering the semi-
[k] to denote the set {1, 2, . . . , k}. We write a = b ± c streaming model is that Θ̃(n) space is necessary and suf-
to denote b − c ≤ a ≤ b + c. Many of the algorithms ficient to determine whether a graph is connected. The
are randomized and we refer to events occurring with sufficiency follows from the following simple algorithm
high probability if the probability of the event is at least that constructs a spanning forest: we maintain a set of
1−1/ poly(n). We use Õ(·) to indicate that logarithmic edges H and add the next edge in the stream {u, v} to
factors have been omitted. H if there is currently no path from u to v in H.
2. INSERT-ONLY STREAMS Spanners. A simple extension of the above algorithm
In this section, we consider streams consisting of a also allows us to approximate the distance between any
sequence of unordered pairs e = {u, v} where u, v ∈ two nodes by constructing a spanner.
[n]. Such a stream,
D EFINITION 1 (S PANNER ). Given a graph G, we
S = he1 , e2 , . . . , em i say that a subgraph H is an α-spanner for G if for all
u, v ∈ V ,
naturally defines an undirected graph G = (V, E) where
V = [n] and E = {e1 , . . . , em }. See Figure 1. For dG (u, v) ≤ dH (u, v) ≤ α · dG (u, v) .
simplicity, we will assume that all stream elements are
distinct and therefore the resulting graph is not a multi- where dG (·, ·) and dH (·, ·) are lengths of the shortest
graph1 . We will also consider weighted graphs where paths in G and H respectively.
now each element of the stream, (e, w(e)), defines both
While the connectivity algorithm only added an edge
an edge of the graph and its weight.
if it did not complete a cycle, the algorithm for con-
structing a spanner will add an edge if it does not com-
1 3 4 plete a short cycle.
2
Algorithm 1: Spanner
Figure 1: The graph on four nodes defined by the 1 H ← ∅;
stream S = h{1, 2}, {2, 3}, {1, 3}, {3, 4}i. It will be 2 for each {u, v} ∈ S do
used to illustrate various definitions in later sections. 3 If dH (u, v) > α then H ← H ∪ {{u, v}};
4 return H
1
Although many of the algorithms discussed immediately ex-
tend to the multigraph setting. Other problems such as esti-
mating the number of triangles or distinct paths of length two The fact that the resulting graph is an α-spanner fol-
require new ideas when edges have multiplicity [21, 38]. lows because for each edge (u, v) ∈ G \ H, there must
have already been a path of length at most α in H. Hence, LH ∈ Rn×n where
for any path in G of length d, including a shortest path, (P
there is a corresponding path of length at most αd in H. {i,k}∈E w(i, k) if i = j
LH (i, j) =
The algorithm needs to store at most O(n1+1/t ) edges −w(i, j) otherwise
when α = 2t − 1 for integral t. This follows because
and w(i, j) is the weight of the edge between nodes i
the shortest cycle in H has length 2t + 1 and any such
and j. If there is no such edge, let w(i, j) = 0.
graph has at most O(n1+1/t ) edges [15]. A naive im-
plementation of the above algorithm would be slow and D EFINITION 4 (S PECTRAL S PARSIFICATION ). We
more recent work has focused on developing faster al- say that a weighted subgraph H is a (1 + ) spectral
gorithms [11, 23]. Other work [24] has considered con- sparsification of a graph G if,
structing (α, β)-spanners where H is required to satisfy
xT LH x = (1 ± )xT LG x , ∀x ∈ Rn , (2)
dG (u, v) ≤ dH (u, v) ≤ α · dG (u, v) + β .
where LG and LH are the Laplacians of H and G.
Minimum Spanning Tree. Another generalization of Note that if we replace ∀x ∈ Rn in Equation 2 by
the basic connectivity algorithm is to maintain a min- ∀x ∈ {0, 1}n then we recover Equation 1. Hence, given
imum spanning tree (or spanning forest if the graph is a spectral sparsification of G, we can approximate the
not connected). weight of all cuts in G. We can also approximate other
“spectral properties” of G including the eigenvalues (via
Algorithm 2: Minimum Spanning Tree the Courant-Fischer Theorem), the effective resistances
1 H ← ∅; in the analogous electrical network, and various prop-
2 for each {u, v} ∈ S do erties of random walks. Obviously, any graph G has a
3 H ← H ∪ {(u, v)}; spectral sparsifier since G is a spectral sparsifier of it-
4 If H includes a cycle, remove the largest weight self. What is surprising is that there exists a (1 + )
edge in the cycle from H. spectral sparsifier with at most O(−2 n) edges [12].
5 return H A Simple “Merge and Reduce” Approach. Not only
do small spectral sparsifiers exist but they can also be
constructed in the semi-streaming model [2, 46]. In this
By using the appropriate data structures, the above al-
section, we present a simple algorithm that demonstrates
gorithm can be implemented such that each update takes
the useful “merge and reduce” framework that has been
O(log n) time [60].
useful for other data stream problems [8].
2.2 Graph Sparsification The following algorithm uses, as a black box, any ex-
isting algorithm that returns a (1 + γ) spectral sparsi-
We next consider constructing graph sparsifiers in the fier. Let A be such an algorithm and let size(γ) be an
data stream model. Rather than just determining whether upper bound on the number of edges in the resulting
a graph is connected, these sparsifiers will allow us to sparsifier. As mentioned above, we may assume that
estimate a richer set of connectivity properties such as size(γ) = O(γ −2 n). We will also use the following
the size of all cuts in the graph. We will be interested in easily verifiable properties of a spectral sparsifier:
different types of sparsifier. First, Benczür and Karger [14]
introduced the notion of cut sparsification. • Mergeable: Suppose H1 and H2 are α spectral
sparsifiers of two graphs G1 and G2 on the same
D EFINITION 2 (C UT S PARSIFICATION ). We say that set of nodes. Then H1 ∪ H2 is an α spectral spar-
a weighted subgraph H is a (1 + ) cut sparsification of sifier of G1 ∪ G2 .
a graph G if
• Composable: If H3 is an α spectral sparsifier for
λA (H) = (1 ± )λA (G) , ∀A ⊂ V , (1) H2 and H2 is a β spectral sparsifier for H1 then
H3 is an αβ spectral sparsifier for H1 .
where λA (G) and λA (H) is the weight of the cut (A, V \
A) in G and H respectively. The algorithm is based on a hierarchical partitioning
of the stream. First we partition the input stream of
Spielman and Teng [59] introduced the more general edges into t = m/ size(γ) segments of length size(γ).
notion of spectral sparsification based on approximating For simplicity assume that t is a power of two. Let G0i
the Laplacian of a graph. be the graph corresponding to the i-th segment of edges.
For i ∈ {1, . . . , log2 t} and j ∈ {1, . . . , t/2i }, define
D EFINITION 3 (L APLACIAN ). The Laplacian of an
undirected weighted graph H = (V, E, w), is a matrix Gji = G2i−1
j−1
∪ Gj−1
2i .
For example, if t = 4, we have: presented an algorithm based on the following relation-
ship between T3 and the frequency moments of x, i.e.,
G11 = G01 ∪ G02 , G12 = G03 ∪ G04 ,
Fk = T xkT .
P
G21 = G11 ∪ G12 = G01 ∪ G02 ∪ G03 ∪ G04 = G . L EMMA 2.1. T3 = F0 − 1.5F1 + 0.5F2 .
For each Gji , we define a weighted subgraph Hij using Let T̃3 be the estimate of T3 that results by combin-
the sparsification algorithm A as follows: ing (1 + γ)-approximations of the relevant frequency
j−1
Hi0 = G0i and Hij = A(H2i−1 j−1
∪ H2i ) for j > 0 . moments with the above lemma. Then,
|T̃3 − T3 | < γ (F0 + 1.5F1 + 0.5F2 ) ≤ 8γmn
It follows from the mergeable and composeable proper-
log t
ties that H1 2 is an (1 + γ)log2 t sparsifier of G. If we where the last inequality follows since
set γ = /(2 log2 t) then this is a (1 + ) sparsifier. Fur- max(F0 , F2 /9) ≤ F1 = m(n − 2) .
log t
thermore, it is possible to compute H1 2 while only
storing at most It is possible to (1 + γ)-approximate each of these fre-
quency moments in Õ(γ −2 ) space and so, by setting
2 size(γ) log2 t = O(−2 n log3 n) γ = /(8mn), this implies a (1 + )-approximation al-
edges at any given time. This is because, as soon as we gorithm using Õ(−2 (mn/t)2 ) space.
have constructed Hij , we can forget H2i−1
j−1 j−1
and H2i . A more space-efficient approach proposed by Ahn et
j al. [6], is to use the `0 sampling technique [40]. An al-
Hence, at any given time we will only need to store Hi
gorithm for `0 sampling uses O(polylog n) space and
for at most two values of i for each j.
returns a random non-zero element from x. Let X ∈
2.3 Counting Subgraphs {1, 2, 3} be determined by picking a random non-zero
element of v and returning the associated value of this
Another problem that has received significant atten-
element. Let Y = 1 if X = 3 and Y = 0 otherwise.
tion is counting the number of triangles, T3 , in a graph.
Note that E [Y ] = T3 /F0 . By an application of the
This is closely related to the transitivity coefficient, the
Chernoff bound, the mean of Õ(−2 (mn/t)) indepen-
fraction of paths of length two that form a triangle, and
dent copies of Y equals (1 ± )T3 /F0 with high proba-
the clustering coefficient, i.e.,
bility. Multiplying this by an approximation of F0 yields
1 X T3 (v) a good estimate of T3 . Note that an earlier algorithm
n v deg(v)
using similar space was presented by Buriol et al. [18]
2
but the above algorithm has the advantage that it is also
where T3 (v) is the number of triangles that include the applicable in the setting (discussed in a later section)
node v. Both statistics play an important role in the where edges can be inserted and deleted.
analysis of social networks. Unfortunately, it can be
shown that determining whether a graph is triangle-free Extensions and Other Approaches. Pavan et al. [54]
requires Ω(n2 ) space even with a constant number of developed the approach of Buriol et al. such that the de-
passes and more generally, Ω(m/T3 ) space is required pendence on n in the space used became a dependence
for any constant approximation [17]. Hence, research on the maximum degree in the graph, and a tighter anal-
has focused on designing algorithms whose space will ysis is possible. Pagh and Tsourakakis [53] presented
depend on a given lower bound t ≤ T3 . an algorithm based on randomly coloring the nodes and
counting the number of monochromatic triangles exactly.
Vector-Based Approach. A number of approaches for Algorithms have also been developed for the multi-pass
estimating the number of triangles have been based on model including a two-pass algorithm using Õ(m/t1/3 )
reducing the problem to a problem about vectors. Con- space [17] and an O(log n)-pass semi-streaming algo-
sider a vector x indexed by subsets T of [n] of size three. rithm [13]. Kutzkov and Pagh [49] and Jha et al. [37]
Each T represents a triple of nodes and the entry corre- also designed algorithms for estimating the clustering
sponding to T is defined to be, and transitivity coefficients directly.
xT = |{e ∈ S : e ⊂ T }| . Extending an approach used by Jowhari and Ghodsi
[39], another line of work [39, 41, 50] makes clever use
For example, in the stream S corresponding to the of complex-valued hash functions for counting longer
graph in Figure 1, the entries of the vector are: cycles and other subgraphs. Lower bounds for find-
ing short cycles were proved by Feigenbaum et al. [28].
x{1,2,3} = 3, x{1,2,4} = 1, x{1,3,4} = 2, x{2,3,4} = 2 .
Other related work includes approximating the size of
Note that the number of triangles T3 in G equals the cliques [35], independent sets [34], and dense compo-
number entries of x that equal 3. Bar-Yossef et al. [10] nents [9].
2.4 Matchings of 1 + (n − 1) whereas the optimal solution has weight
A matching in a graph G = (V, E) is a subset of n−1
edges M ⊆ E such that no two edges in M share an 1+(n−1)+1+(n−3)+1+(n−5)+. . . > ,
2
endpoint. Well-studied problems including computing
and hence decreasing makes the approximation factor
the matching of maximum cardinality or maximum total
arbitrarily large.
weight.
Roughly speaking, the problem with setting γ = 0 is
Greedy Single-Pass Algorithms. A simple algorithm that the weight of the “trail” of edges that are inserted
that returns a 2-approximation for the unweighted prob- into M but subsequently removed can be much larger
lem is the following greedy algorithm. than the weight of the final edges in M . By setting γ >
0, we ensure the weights in this trail are geometrically
increasing. Specifically, let Te = C1 ∪ C2 ∪ . . . where
Algorithm 3: Greedy Matching C1 is the set of edges removed when e was added to M
1 M ← ∅; and Ci+1 is the set of edges removed when an edge in
2 for each e ∈ S do Ci was added to M . Then, it is easy to show that for any
3 If M ∪ {e} is a matching, M ← M ∪ {e}; edge e, the total weight of edges in Te satisfies,
4 return M
w(Te ) ≤ w(e)/γ .
By a careful charging scheme [51], the weight of the
The fact that the algorithm returns a 2-approximation optimal solution can be bounded in terms of the weight
follows from the fact that for every edge {u, v} in a max- of the final edges and the trails:
imum cardinality matching, M must include an edge X
with at least one of u or v as an endpoint. At present w(OPT) ≤ (1 + γ) (w(Te ) + 2w(e)) .
this is the best approximation known for the problem! e∈M
The strongest known lower bound is e/(e − 1) ≈ 1.58 Substituting in the bound on w(Te ) and optimizing over
which also applies when edges are grouped by endpoint γ yields a 5.828-approximation. The analysis can be
[30,42]. Konrad et al. [47] considered a relaxation of the extended to sub-modular maximization problems [20].
problem where the edges arrive in a random-order and, The above algorithm is optimal among determinis-
in this setting, they designed an algorithm that achieved tic algorithms that only remember the edges of a valid
a 1.98-approximation in expectation. matching at any time [61]. However, after a sequence
The greedy algorithm can easily be generalized to the of results [25, 26, 62] it is now known how to achieve a
weighted case as follows [27, 51]. Rather than only 4.91-approximation.
adding an edge if there are no “conflicting” edges, we
also add the edge if its weight is at least some factor Multiple-Pass Algorithms. The above algorithm can
larger than the weight of the (at most two) conflicting be extended to a multiple-pass algorithm that achieves
edges and then remove these conflicting edges. a (2 + )-approximation for weighted matchings. We
simply set γ = O() and take O(−3 ) passes over the
data where, at the start of a pass, M is initiated to the
Algorithm 4: Greedy Weighted Matching
matching returned at the end of the previous pass.
1 M ← ∅; Guruswami and Onak showed that finding the size
2 for each e ∈ S do of the maximum cardinality matching exactly given p
3 Let C = {e0 ∈ M : e0 ∩ e 6= ∅} ; passes requires n1+Ω(1/p) /pO(1) space. No exact al-
w(e)
4 If w(C) ≥ (1 + γ) then M ← M ∪ {e} \ C; gorithm is known with these parameters but it is pos-
sible to find an arbitrarily good approximation. Ahn and
5 return M
Guha [3, 4] showed that a (1 + )-approximation is pos-
sible using O(n1+1/p ) space and O(p/) passes. They
It would be reasonable to ask why we shouldn’t add also show a similar result for weighted matching if the
e if w(e) ≥ w(C), i.e., set γ = 0. However, consider graph is bipartite. Their results are based on adapting
what would happen if the stream consisted of edges linear programming techniques and a careful analysis of
the intrinsic adaptivity of primal-dual algorithms. In the
{1, 2}, {2, 3}, {3, 4}, . . . , {n − 1, n} node arrival setting where edges are grouped by end-
point, Kapralov
√ presented an algorithm that achieved a
arriving in that order where the weight of edge {i, i + 1/(1 − 1/ 2πp + o(1/p))-approximation ratio given p
1} is 1 + i for some small value > 0. The above passes. This is achieved by a fractional load balancing
algorithm would return the last edge with a total weight approach.
√
2.5 Random Walks 4. Perform the remaining O( t) steps of the walk
A random walk in an unweighted graph from a node using the trivial algorithm.
u ∈ V is a random sequence of nodes v0 , v1 , v2 , . . . √
where v0 = u and vi is a random node from the set Analysis. First note that |U | is never larger than t be-
Γ(vi−1 ), the neighbors of vi−1 . For any fixed positive cause√|U | is only incremented when ` increases by at
integer t, we can consider the distribution of vt ∈ V . least t and we know that ` ≤ t. The total space re-
Call this distribution µt (u). √ to store the vertices T is Õ(n). When we sam-
quired
In this section, we present a semi-streaming algorithm ple t√edges incident on each node in U , this requires
by Das Sarma et al. [56] that returns a sample from Õ(|U | t) = Õ(t) space. Hence, the total space is
µt (u). Note that it is trivial to sample from µt (u) with Õ(n + t). For the number of passes, note that when we
t passes; in the i-th pass we randomly select vi from the √ a pass to sample edges incident on U , we
need to take
neighbors of the node vi−1 determined in the previous make O( t) hops of progress because either we reach √a
pass. Das Sarma et al. show√ that it is possible to reduce node with an unused short walk or the walk √ uses Ω( t)
the number of passes to O( t). They also present al- samples edges. Hence, including the O( t) passes used
gorithms that use less space at the expense of increasing at the start and
√ end of the algorithm, the total number of
the number of passes. passes is O( t).
Algorithm. As noted above, it is trivial to perform length
t walks in t passes. The main idea of the algorithm to
3. GRAPH SKETCHES
build up a length√t walk by “stitching” together short In this section, we consider dynamic graph streams
walks of length t. Each where edges can be both added and removed. The input
√ of these short walks can be
constructed in parallel in t passes and O(n log n) space. is a sequence
However, we will need to be careful to ensure that all the
S = ha1 , a2 , . . .i where ai = (ei , ∆i )
steps of the final walk are independent. Specifically, the
algorithm starts as follows: where ei encodes an undirected edge as before and ∆i ∈
{−1,
P 1}. The multiplicity of an edge e is defined as fe =
1. Let T (v) be a node sampled from µ√t (v).
i:ei =e ∆i . For simplicity, we restrict our attention to
the case where fe ∈ {0, 1} for all edges e.
2. Let v = T k (u) = T (. . . T (T (u)) . . .) where k is
maximal values such that the nodes in Linear Sketches. An important type of data stream al-
2
U = {u, T (u), T (u), . . . , T k−1
(u)} gorithms are linear sketches. Such algorithms maintain
√ a random linear projection, or “sketch”, of the input. We
are all distinct and k ≤ t. want to be able to a) infer relevant properties of the in-
put from the sketch and b) maintain the sketch in small
The reason that we insist that the nodes in U are disjoint space. The second property follows from the linearity
is because otherwise the next steps of the random walk of the sketch if the dimensionality of the projection is
will not be independent of the previous steps. So far√ we small. Specifically, suppose
have generated a sample v from µ` (u) where ` = k t.
n
We then enter the following loop: f ∈ {0, 1}( 2 )
√
3. While ` ≤ t is the vector with entries equalling the current values of
√ fe and let A(f ) ∈ Rd be the sketch of this vector where
(a) If v 6∈ U , let v ← T (v), ` ← t + `, U ←
we call d the dimensionality of the sketch. Then, when
U ∪ {v}
√ (e, ∆) arrives we can simple update A(f ) as follows:
(b) Otherwise, sample t edges with replacement
incident on each node in U . Find the maximal A(f ) ← A(f ) + ∆ · A(ie )
path from v such that on the i-th visit to node where ie is the vector whose only non-zero entry is a “1”
x, we take the i-th edge that was sampled for in the position corresponding to e. Hence, it suffices to
node x. The path terminates either√ when a store the current sketch and any random bits needed to
node in U is visited more than t times or compute the projection. The main challenge is therefore
we reach a node that is not in U . Reset v to to design low dimensional sketches.
be the final node of this path and increase `
by the length of the path. If we complete the Homomorphic Sketches. Many of the graph sketches
length t random walk during this process we that have been designed so far are built up from sketches
may terminate at this point and return the cur- of the rows of the adjacency matrix for the graph G.
rent node. Specifically, let f v ∈ {0, 1}n−1 be the vector f restricted
to coordinates that involve node v. Then, the sketches A useful feature of existing work [40] on `0 sam-
are formed by concatenating sketches of each f v , i.e., pling is that it can be performed via linear projec-
tions, i.e., for any string r there exists Mr ∈ Rk×d
A(f ) = A1 (f v1 ) ◦ A2 (f v2 ) ◦ . . . ◦ An (f vn ) .
such that the sample can be reconstructed from
Note that the random projections for different Ai need Mr x. For the process to be successful with con-
not be independent but that these sketches can still be stant probability k = O(log2 n) suffices. Conse-
updated as before. quently, given Mr x and Mr y we have enough in-
The algorithms discussed in subsequent sections all formation to determine a random sample from the
fit the following template. First, we consider a basic al- set {i : xi + yi 6= 0} since
gorithm for the graph problem in question. Second, we
Mr (x + y) = Mr x + Mr y .
design sketches Ai such that it is possible to emulate
the basic algorithm given only the projections Ai (f vi ).
For example, for the graph in Figure 1, we have
The challenge is to ensure that the sketches are homo-
morphic with respect to the operations of the basic al- a1 = ( 1 1 0 0 0 0 )
gorithm, i.e., for each operation on the original graph, a2 = ( −1 0 0 1 0 0 )
there is a corresponding operation on the sketches. a3 = ( 0 −1 0 −1 0 1 )
a4 = ( 0 0 0 0 0 −1 )
3.1 Connectivity
where the entries correspond to the sets {1, 2}, {1, 3},
We start with a simple algorithm for finding a span-
{1, 4}, {2, 3}, {2, 4}, {3, 4} in that order. Note that the
ning forest of a graph and then show how to emulate this
non-zero entries of
algorithm via sketches.
a1 + a2 = ( 0 1 0 1 0 0 )
Basic Non-Sketch Algorithm. The algorithm is based
on the following simple O(log n) stage process. In the correspond to {1, 3} and {2, 3} which are exactly the
first stage, we find an arbitrary incident edge for each edges across the cut ({1, 2}, {3, 4}).
node. We then collapse each of the resulting connected The resulting algorithm for connectivity is relatively
components into a “supernode”. In each subsequent stage, simple but makes use of linearity in an essential way:
we find an edge from every supernode to another su-
pernode (if one exists) and collapse the connected com- 1. In a single pass, compute the sketches: Choose
ponents into new supernodes. It is not hard to argue that t = O(log n) random strings r1 , . . . , rt and con-
this process terminates after O(log n) stages and that the struct the `0 -sampling projections Mrj ai for i ∈
set of edges used to connect supernodes in the differ- [n], j ∈ [t]. Then,
ent stages include a spanning forest of the graph. From Ai (f vi ) = Mr1 ai ◦ Mr2 ai . . . ◦ Mrt ai .
this we can obviously deduce whether the graph is con-
nected. 2. In post-processing, emulate the original algorithm:
Emulation via Sketches. There are two main steps to (a) Let V̂ = V be the initial set of “supernodes”.
constructing the sketches for the connectivity algorithm: (b) For iP= 1, . . . , t: for each supernode S ∈ V̂ ,
use i∈S Mrj ai = Mrj ( i∈S ai ) to sam-
P
1. An Appropriate Graph Representation. For each
n
node vi ∈ V , define a vector ai ∈ {−1, 0, 1}( 2 ) : ple an edge between S and another supern-
ode. Collapse the connected supernodes to
1
if i = j < k and {vj , vk } ∈ E form a new set of supernodes.
ai{j,k} = −1 if j < k = i and {vj , vk } ∈ E
Since each sketch Ai has dimension O(polylog n) and
0 otherwise
there are n such sketches to be computed, the final con-
These vectors then have the useful property that nectivity algorithm uses O(n · polylog n) space.
for any subset of nodes {vi }i∈S , the non-zero en-
Extensions and Further Work. Note that the above
tries of i∈S ai correspond exactly to the edges
P
algorithm has O(polylog n) update time but a connec-
across the cut (S, V \ S).
tivity query may take Ω(n) time. This was addressed in
2. `0 -Sampling via Linear Sketches: As mentioned subsequent work by Kapron et al. [44].
earlier, the goal of `0 -sampling is to take a non- An easy corollary of the above the result is that it is
zero vector x ∈ Rd and return a sample j where also possible to test whether a graph is bipartite. This
(
1
follows by running the connectivity algorithm on the
if xj 6= 0 both G and the bipartite double cover of G. The bi-
Pr[sample equals j] = |F0 (x)| .
r 0 if xj = 0 partite double cover of a graph is formed by making two
copies u1 , u2 of every node u of G and adding edges discuss the offline algorithms for sparsification in more
{u1 , v2 }, {u2 , v1 } for every edge {u, v} of G. It can detail.
be shown that G is bipartite iff the number of connected
components in the double cover is exactly twice the num- Sparsification via Sampling. The results in this section
ber of connected components in G. are based on the following generic sampling algorithm:
Since computing each spanning forest sketch used O(n· Minimum Cut. As a warm-up, we show how to esti-
polylog n), the total space used by the algorithm for k- mate the minimum cut λ of a dynamic graph [6]. To do
connectivity is O(k · n · polylog n). this we use the algorithm for constructing k-skeletons
described in the previous section in conjunction with
3.3 Min-Cut and Sparsification Karger’s sampling result. In addition to computing a
In this section we revisit graph sparsification in the 2
Note that their result is actually proved for a slightly different
context of dynamic graphs. To do this we will need to sampling with replacement procedure.
skeleton on the entire graph, we also construct skeletons 4. SLIDING WINDOW
for subsampled graphs. Specifically, let Gi be the graph In this section, we consider processing graphs in the
formed from G by including each edge with probability sliding window model. In this model we consider an
1/2i and let infinite stream of edges he1 , e2 , . . .i but at time t we only
Hi = skeletonk (Gi ) , consider the graph whose edge set consists of the last w
edges,
be a k-skeleton of Gi where k = 3c1 −2 log n. Then,
for W = {et−w+1 , . . . , et } .
We call these the active edges and we will consider the
j = min{i : mincut(Hi ) < k} ,
case where w ≥ n. The results in this section were
we claim that proved by Crouch et al. [22]. Note that some of sampling-
based algorithms for counting small subgraphs are also
2j mincut(Hj ) = (1 ± )λ . (4)
applicable in this model.
For i ≤ blog2 1/qc, Karger’s result implies that all cuts
are approximately preserved and, in particular,
4.1 Connectivity
We first consider testing whether the graph is k-edge
2i · mincut(Hi ) = (1 ± ) mincut(Gi ) . connected for a given k ∈ {1, 2, 3, . . .}. Note that k = 1
However, for i = blog2 1/qc, corresponds to testing connectivity. To do this, it is suf-
ficient to maintain a set of edges F ⊆ {e1 , e2 , . . . , et }
E [mincut(Hi )] ≤ 2−i λ ≤ 2qλ ≤ 2c1 −2 log n along with the time-of-arrival toa(e) for each e ∈ F
such that for any cut, F contains the most recent k edges
and hence, by an application of the Chernoff bound,
across the cut (or all the edges across the cut if there are
we have that mincut(Hi ) < k with high probability.
less than k of them). Then, we can easily tell whether
Hence, j ≤ blog2 1/qc with high probability and Equa-
the graph of active edges is k-connected by checking
tion 4 follows.
whether F would be k-connected once we remove all
Sparsification. To construct a sparsifier, the basic idea edges e ∈ F where toa(e) ≤ t − w. This follows
is to sample edges with probability qe = min{1, t/λe } because if there are k or more edges among the last w
for some value of t. If t = Θ(−2 log2 n) then the edges across a cut, F will include the k most recent of
resulting graph is a combinatorial sparsifier by appeal- these edges.
ing to the aforementioned result of Fung et al. [29]. If The following simple algorithm maintains the set
t = Θ(−2 n2/3 log n) then the resulting graph can be F = F1 ∪ F2 ∪ . . . ∪ Fk
shown to be a spectral sparsifier by combining Equation
3 with the aforementioned sampling result of Spielman where the Fi are disjoint and each is acyclic. We add
and Srivistava [58]. In this section, we briefly outline the new edge e to F1 . If it completes a cycle, we remove
how to perform such sampling. We refer the reader to the oldest edge in this cycle and add that edge to F2 . If
Ahn et al. [6, 7] for details regarding independence is- we now have a cycle in F2 , we remove the oldest edge
sues and how to reweight the edges. in this cycle and add that edge to F3 . And so forth.
The challenge is that we do not know the values of Therefore, it is possible to test k-connectivity in the
λe ahead of time. To get around this we take a very sliding window model using O(kn log n) space. Fur-
similar approach to that used above for estimating the thermore, by reducing other problems to k-connectivity,
minimum cut. Specifically, let Gi be defined as above as discussed in the previous sections, this also implies
and let Hi = skeleton3t (Gi ). For simplicity, we assume the existence of algorithms for testing bipartiteness and
λe ≥ t. We claim that constructing sparsifiers.