0% found this document useful (0 votes)
72 views32 pages

Linear Time Construction of Suffix Tree: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU

The document summarizes Ukkonen's linear-time algorithm for constructing suffix trees. It explains that the algorithm constructs an implicit suffix tree for each prefix of the input string S in phases. Each phase extends the tree for the next prefix by applying extension rules. The algorithm avoids quadratic time complexity by using suffix links to directly access internal nodes representing suffixes.

Uploaded by

Alimushwan Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views32 pages

Linear Time Construction of Suffix Tree: Presented by Dr. Shazzad Hosain Asst. Prof. EECS, NSU

The document summarizes Ukkonen's linear-time algorithm for constructing suffix trees. It explains that the algorithm constructs an implicit suffix tree for each prefix of the input string S in phases. Each phase extends the tree for the next prefix by applying extension rules. The algorithm avoids quadratic time complexity by using suffix links to directly access internal nodes representing suffixes.

Uploaded by

Alimushwan Adnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Linear Time Construction of

Suffix Tree
Presented By
Dr. Shazzad Hosain
Asst. Prof. EECS, NSU
Suffix tree
S=xabxac 1
= abxac 2 S=xabxac
= bxac 3
= xac 4
= ac 5
= c 6
Suffix tree
S=xabxa 1
= abxa 2 S=xabxa
= bxa 3
= xa 4 x
a a
= a 5 b
b x b
x a x
a a
Suffix tree (Example)
Let s=abab, a suffix tree of s contains all the
suffixes of s=abab$

{ $
$ a b
b
b$ $
ab$ a
a $ b
bab$ b $
abab$ $
}
Trivial algorithm to build a Suffix tree
s=abab$
a
b
Put the largest suffix in a
b
$

a b
Put the suffix bab$ in b a
a b
b $
$
{
abab$ a b
b a
bab$ a b
b $
$
}

Put the suffix ab$ in


a b
b a
b
$
a $
b
$
{ a b
b a
abab$ b
$
bab$ a $
b
ab$ $
}

Put the suffix b$ in


a b
b
$
a
a $ b
b $
$
{
abab$ a b
b
bab$ $
a
ab$ a $ b
b $
b$ $
}

Put the suffix $ in $


a b
b
$
a
a $ b
b $
$
{
$
abab$
a b
bab$ b
ab$
$
a
b$ a $ b
b $
$ $
}

We will also label each leaf with the starting point of the corres.
suffix.
$
a b 5
b
$
a
a $ b 4
b $
$ 3
2
1
Naive Construction – More Example
abbcbab#
ab
cbab#
#
6 b 4
abbcbab#
bbcbab#
bcbab#
# ab#
1 7 5
cbab#

bcbab#
3

2
Analysis
Takes O(n2) time to build.

We will see how to do it in O(n) time


Ukkonen’s linear-time Suffix Tree Algorithm

• Implicit Suffix Tree

1. Remove the terminal symbols $ from the edge labels of the tree
2. Then remove any edge that has no label
Implicit Suffix Tree – More Example
{ $
abab$ a b 5
b
bab$ $
a
ab$ a $ b 4
b $
b$ $ 3
2
1
$
}

1. Even though an implicit suffix tree may not have a leaf


for each suffix, it does encode all the suffixes of S
2. Let i denote the implicit suffix tree of the string S[1…i]
Ukkonen’s Algorithm at a High Level

• Construct an implicit suffix tree i for each


prefix S[1..i] of S, starting 1 and incrementing
i by one until m is build, where m is the
length of the string S.
• The true suffix tree for S is constructed from
m , and the time for the
entire algorithm is O(m)
High-level Description of Ukkonen’s
Algorithm
• Ukkonen’s algorithm is divided into m phases.
In phase i+1, tree i+1 is constructed from i
• Each phase i+1 is further divided into i+1
extensions, one for each of the i+1 suffixes of
S[1… i+1].
Naïve Algorithm of Suffix Tree

{ $
a
abab$
b b 5
bab$ $
a
ab$ b $ a 4
b
b$ $ 3
$
1 2
$
}
High-level of Ukkonen’s Algorithm
• Ukkonen’s algorithm is divided into m phases. In phase i+1,
tree i+1 is constructed from i
• Each phase i+1 is further divided into i+1 extensions, one for
each of the i+1 suffixes of S[1… i+1].
ab b

phases
a
: S[1…1] {a}
1
b b
2 : S[1…2] {ab, b}
a
3 : S[1…3] {aba, ba, a} a

extensions 1 2
O (m3)

a
: S[1…1] {a}
1
b b
2 : S[1…2] {ab, b}
a
3 : S[1…3] {aba, ba, a} a

extensions 1 2
Suffix Entension Rules
Let i already there and want to extend for i+1

Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character
S(i+1) is added to the end of the label of that leaf edge.

Rule2: some path from the end of string β starts with character S(i+1). In
this case the string β S(i+1) is already in the tree. So do nothing.

a
: S[1…1] {a}
1
b b
2 : S[1…2] {ab, b}
β S(i+1) a
3 : S[1…3] {aba, ba, a} a
1 2 b
b
4 : S[1…4] {abab, bab, ab, b}
1 2 3 1 2
Suffix Entension Rules
Let, i already there and want to extend for i+1

123456 O (m3)
Let, 5 is drawn for axabxb

Now extend for 6

axabxb
RULE1

xabxb
abxb
bxb
xb RULE3
b RULE2

Rule3: No path from the end of string β starts with character S(i+1), but at
least one labeled path continues from the end of β. Add new node.
Implementation and Speedup, Suffix Links
Definition: Let xα denotes an arbitrary string, where x is a single
character and α a substring (possibly empty). For an internal
node v with path-label xα, if there is another node s(v) with
path-label α, then a pointer from v to s(v) is called a suffix link.

Does root have a suffix link? No, because not an internal node
Every internal node has a suffix link.
Suffix Links – More Example
v ab
cbab#
#
6 b 4
bcbab# S(v)
abbcbab# # ab#
1 7 5
Suffix link cbab#
bcbab#
3

Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal


node will have a suffix link form it by the end of the next extension.

Corollary 6.1.2: In any implicit suffix tree i , if internal node v has


path-label xα, then there is a node s(v) of i with path-label α.
1234567890
MISSISSIPI
P
M I
S
9
10: MISSISSIPI I
I
S S
9 : MISSISSIP S
S S P I I
8 : MISSISSI I P I
S I I 6
S
7 : MISSISS S 8 S
S S
6 : MISSIS P I P S
S P I I
5 : MISSI I I P
I 7
4 : MISS 1 I P I
P
3 : MIS I 5 I
2 : MI 3
2 4
1 : M
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal
node will have a suffix link form it by the end of the next extension.
1234567890 How suffix links help?
MISSISSIPI
P
M I
S
9
10: MISSISSIPI I
I
S S
9 : MISSISSIP S
S S P I I
8 : MISSISSI I P I
S I I 6
S
7 : MISSISS S 8 S
S S
6 : MISSIS P I P S
S P I I
5 : MISSI I I P
I 7
4 : MISS 1 I P I
P
3 : MIS I 5 I
2 : MI 3
2 4
1 : M
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal
node will have a suffix link form it by the end of the next extension.
What is achieved so far?

Not so much. Worst-case running


time is O(m2) for a phase.
Trick1: Skip/Count Trick

There must be a γ path from s(v).


Trick1: Skip/Count Trick
Walking down along γ takes time
proportional to |γ|

Skip/count trick reduces the traversal


time to something proportional to the
number of nodes on the path.

zabcdefghy
Nodes
Edge length 2 2 3 3

But what does it buy in terms of


worst-case bounds?

There must be a γ path from s(v).


Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during
Ukkonen’s algorithm. At that moment , the node-depth of v is
at most one greater than the node depth of s(v).

v=2 s(v)=1

s(v)=3
v=3

v=4 s(v)=5
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during
Ukkonen’s algorithm. At that moment , the node-depth of v is
at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of


Ukkonen’s algorithm takes O(m) time.
In a single extension
– The algorithm walks up at most one edge
– Find suffix link and traverse it
– Walks down some number of nodes
– Applies suffix extension rules
– And may add a suffix link

All operations except down-walk takes constant time


Only needs to analyze down walk time
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during
Ukkonen’s algorithm. At that moment , the node-depth of v is
at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of


Ukkonen’s algorithm takes O(m) time.
In a single extension
– The algorithm walks up at most one edge – Decreases current node-depth by at most one
– Find suffix link and traverse it – Decreases node-depth by at most another one
– Walks down some number of nodes – Each down walk moves to greater node-depth
– Applies suffix extension rules
– Over the entire phase, current node-depth is
– And may add a suffix link decremented by at most 2m times
– Since no node can have depth greater than m,
All operations except down-walk takes constant time the total possible increment to current node-
Only needs to analyze down walk time depth is bounded by 3m over the entire phase
– Total number of edge traversal bounded by 3m
– Since each edge traversal is constant, in a phase
all the down-walking is O(m).
Complexity
• There are m phases
• Each phase takes O(m)
• So the running time is O(m2)

Two more tricks and we are done


Reference
• Chapter 6: Algorithms on Strings, Trees and
Sequences

You might also like