0% found this document useful (0 votes)

18 views

2023 Bioinformatics Algorithms Day 1

Uploaded by

ast56erx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

2023 Bioinformatics Algorithms Day 1

Uploaded by

ast56erx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Bioinformatics

Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland
Phone: +41 31 631 45 19
Email: [email protected]
Day 1
• Introduction and Organization
• Defintion Algorithm
• Fibonacci numbers
• Performance of algorithm
• Travelling salesman problem

17.09.23 Bioinformatics Algorithms 2

Some recources on Bioinformatics and
Algorithms
• An Introduction to Bioinformatics Algorithms
(Neil C Jones and Pavel A Pevzner 2004)

• Bioinformatics Algorithms: An Active Learning Approach Volume I

(Compeau and Pevzner 2015)

• Various pdf’s that will be found on ilias

• To be continued and updated throughout the course...

Bioinformatics Algorithms 3
Slides, Exercises, etc.
• I will provide pdfs of slides, book • If you have questions about the
chapters, R-scripts etc on ilias course, send me an email
[email protected]

• The folder will be updated

continuously
• You can find me in
• Please join the course on ilias
Room D401, 4th floor
Balterstrasse 6
3012 Bern

Bioinformatics Algorithms 4
Organization of the course
• Course language: English
• Written exam at the end of the course (english or german)

• Prerequisites:
• Basic math knowledge
• Basic programming knowledge
• Basic biology knowledge

Bioinformatics Algorithms 5
Exam
• Written exam at the end of the semester

• 2 hours

• Date: TBA

• Mandatory to sign up on KSL (CST) at least 14 days before exam!

17.09.23 Bioinformatics Algorithms 6

Overview
1. Introduction to Algorithms 5. Graph algorithms
• Fibonacci numbers, Travelling Salesman Problem • Introduction to Graphs
• Complexity and Big-O notation • From genetics to graphs
2. Sequence Alignment • Shortest superstring problem, DNA sequencing
• Dynamic Programming 6. Classification and regression
• BLAST • SVM
• Needleman-Wunsch algorithm • Decision Trees
3. Exhaustive search algorithms • Bagging and boosting
• Motig finding problem 7. Phylogenetics
• Search trees • Parsimony
4. Greedy Algorithms • Neighborhood joining algorithm
• Genome rearrangements/Towers of Hanoi • Likelihood of trees
• Permutations
• Sorting by reversals

17.09.23 Bioinformatics Algorithms 7

17.09.23 Bioinformatics Algorithms 8
Answer questions in Biology
• What is the genome of the bear? (genomics)

• How can we explain changes and diversity within/across populations?

(population genetics/genomics)

• Which genes are expressed in healthy vs. diseased tissue?

(transcriptomics)

• How related are two species if we look at their whole genomes?

(phylogenetics / phylogenomics)

17.09.23 Bioinformatics Algorithms 9

Our capabilities for high-throughput measurement
of Biological data has been transformative

• 1990 - 2000
• Sequencing the first human genome took ~10 years and cost ~$2.7 billion

• 2018
• Today, sequencing a genome costs ~$1,000✢ and a “run” takes <3 days✢
~18 Tb per “run” at maximum capacity

✢ on an Illumina HiSeq X Ten — the machine costs ~$10M and sample prep takes a little extra
time.

17.09.23 Bioinformatics Algorithms 10

Tons of Data, but we need Knowledge

e.g., genome sequencing, hallmark limitations are:

• Short “reads” (75 — 250) characters when the texts we’re interested
in are 1,000s to 1,000,000s of characters long.
• Imperfect “reads” — results in infrequent but considerable “errors”;
modifying, inserting or deleting one or more characters in the “read”
• Biased “reads” — as a result of the underlying chemistry & physics,
sampling is not perfectly uniform and random. Biases are not always
known.

17.09.23 Bioinformatics Algorithms 11

Despite these limitations, scientists have taken a very
subtle and nuanced approach . . .

17.09.23 Bioinformatics Algorithms 12

Despite these limitations, scientists have taken a very
subtle and nuanced approach . . .

17.09.23 Bioinformatics Algorithms 13

Answering questions on such a scale becomes
a fundamentally computational endeavor:

• Assembly — Find a likely “super string” that parsimoniously explain

200M short sub-strings (string processing, graph theory)

• Alignment — Find an approximate match for 50M short string in a

5GB corpus of text (string processing, data structure & algorithm
design)

• Phylogenomics — Given a set of related gene sequences, and an

assumed model of sequence evolution, determine how these
sequences are related to each other (statistics & ML)

17.09.23 Bioinformatics Algorithms 14

Introduction to algorithms
What is an algorithm?
Wikipedia:
«[An algorithm is] a self-contained step-by-step set of operations to be performed. »

«An algorithm is an effective method expressed as a finite list of well-defined instructions for
calculating a function. Starting from an initial state and initial input (perhaps empty), the
instructions describe a computation that, when executed, proceeds through a finite number of well-
defined successive states, eventually producing "output" and terminating at a final ending state.
The transition from one state to the next is not necessarily deterministic; some algorithms, known
as randomized algorithms, incorporate random input.»

17.09.23 Bioinformatics Algorithms 15

What is an algorithm?
• finite list of well-defined instructions
• that lead from initial input
• through a finite number of well-defined successive states
• to an output

Recipe
1) Take ingredients
2) Mix them
together in
random way
3) Bake at random
temperature
4) If result is
cupcake: end
5) Else: go to 1

17.09.23 Bioinformatics Algorithms 16

X
X n21

Xn ¼
ðiÞ
jn : (A1) variation in offspring distributions within generations. Then,
i¼1 for each n, the jðiÞ
n are a family of independent identically
distributed random variables. We denote the offspring of
a mutant copy in generation n by jn. Further, we denote
n by
We denote the PGF of jðiÞ
the PGF of jn by

What is an algorithm? ðiÞ

fi;n ðxÞ ¼
N
X

k¼0
fn;k xk ;
ðiÞ
(A2)
fn ðxÞ ¼
N
X

k¼0
fn;k xk ; (A11)

where fn;k ¼ PðjðiÞ

n ¼ kÞ. The PGF of Xn is denoted Fn. We ! n ¼ fn , we im-
where fn,k = P(jn = k). By observing that f
assume that the jðiÞn are independent but not necessarily mediately get
identically distributed random variables. It follows that
N
Fnþ1 ðxÞ ¼ f0 ðf1 ð. . . fn ðxÞÞÞ (A12)
X
k
Fnþ1 ðxÞ ¼ PðXnþ1 ¼ kÞx (A3)
k¼0
from Equation A10. The probability of ultimate survival, p, is
then given by
N X
X N l
X
! Computational
1 2 p ¼ lim f0 ðf1 ð. . . fn ð0ÞÞÞ; (A13)
ðiÞ k
Biological problem ¼
k¼0 l¼0
PðXn ¼ lÞP
i¼1
jn ¼ k x (A4)
solution:
n/N

N
X N
X l
X
! Algorithm
which is Equation 3 in the main text.

¼ k x k : (A5)
ðiÞ
¼ PðXn ¼ lÞ P jn
l¼0 k¼0 i¼1
Appendix B: Derivation of Equation 5
PN P Pl ðiÞ
Note that k¼0 Pð Ii¼1 jðiÞ k
n ¼ kÞx is the PGF of i¼1 jn . Be- Here, we derive an approximation for the probability of
cause the PGF of a sum of independent random variables is establishment if there is no variation in offspring distribu-
the product of the PGFs of the random variables (e.g., Karlin tions within generations.
and Taylor 1975), it follows that Let qn = Fn(0) denote the probability of extinction by
generation n. It is straightforward to see that {qn} is
N
X l
Y
Fnþ1 ðxÞ ¼ PðXn ¼ lÞ fi;n ðxÞ (A6) an increasing sequence that is bounded by 1. Intuitively, this
l¼0 i¼1 can be seen by observing that the probability of extinction in
generation n + 1 has to be larger than the probability
Formal
XN description
! " of extinction in generation n. More formally, it follows be-
! n ðxÞ l ;
Mathematics/Statistics
¼ PðXn ¼ lÞ f (A7) cause the fn are monotonically increasing functions of x.

?
l¼0 Consequently, q = limn/N qn exists and q 2 [0, 1]. The
probability of establishment is then given by p = 1 2 q.
where
We introduce a time-independent branching process,
!1=Xn denoted reference process, and deﬁne
Xn
Y
! n ðxÞ :¼
f fi;n : (A8)
i¼1
dn ðxÞ :¼ fn ðxÞ 2 f*ðxÞ; (B1)

where f* is the PGF of the offspring distribution in the

Consequently, reference process. Since both fn and f* are PGFs, it follows
17.09.23 Bioinformatics Algorithms 17
! " that dn is a bounded and continuously differentiable func-
!n
Fnþ1 ðxÞ ¼ fn f (A9) tion of x 2 [0, 1] and that dn(1) = 0. Furthermore, dn can be
What is an algorithm?

We start with a pair of

rabbits that reproduce Just follow the
each month. Rabbits
mathematical
take one month to
reach maturity. How description to get a
many rabbits will we recursive algorithm.
have after n months?

Fibonacci numbers:
F0 = 0, F1 = 1
Fn = Fn-1 + Fn-2, for n > 2

0,1,1,2,3,5,8,13,21,34,....

17.09.23 Bioinformatics Algorithms 18

Example I: Fibonacci numbers
In R:

fib = function(n)
{
if(n < 2)
return(n)
return(fib(n-2)+fib(n-1))
}

> fib(10)
> 55

17.09.23 Bioinformatics Algorithms 19

Example I: Fibonacci numbers
In R:

fib = function(n)
{
if(n < 2)
return(n)
return(fib(n-2)+fib(n-1))
}

Compuation time «explodes» with large n!

> fib(10) Our algorithm is correct but it is computationally
> 55 unfeasible for large n!

17.09.23 Bioinformatics Algorithms 20

How good is our Fibonacci algorithm?
Whenever we have an algorithm, there are three questions we can ask
about it:

1. Is it correct?

2. How much time does it take, as a function of n?

3. Can we do better?

17.09.23 Bioinformatics Algorithms 21

How good is our algorithm?

Whenever we have an algorithm, there are three questions we can ask

about it:

1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?

3. Can we do better?

17.09.23 Bioinformatics Algorithms 22

How good is our algorithm?

Whenever we have an algorithm, there are three questions we can ask

about it:

1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?
Answer: ?
3. Can we do better?

17.09.23 Bioinformatics Algorithms 23

How good is our algorithm?

Whenever we have an algorithm, there are three questions we can ask

about it:

1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?
Answer: ?
3. Can we do better?
Answer: ?

17.09.23 Bioinformatics Algorithms 24

How much time?
Let’s call T(n) the number of basic operations that are need to perform to get Fn using our algorithm.

The following relation holds:

T(n) = T(n-1) + T(n-2) + 2

We cannot solve this but one can show that

Tn grows faster than and ( 2)n .

Thus, the number of operations needed to calculate

groes exponentially as a function of n.

This is bad: T(200) ≥ F200 ≥ 2 138 ≥ 10 41!

17.09.23 Bioinformatics Algorithms 25

Can we do better?
Why is fib1 so bad?

Many calculations are done

multiple times:

Fn-2 is calculted twice, Fn-3 three

times, Fn-4 four times, ....

This suggests that we can do better!

Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 26
Can we do better?
Why is fib1 so bad?

Many calculations are done

multiple times:

Fn-2 is calculted twice, Fn-3 three

times, Fn-4 four times, ....

This suggests that we can do better!

Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 27
Can we do better?
Why is fib1 so bad?

Many calculations are done

multiple times:

Fn-2 is calculted twice, Fn-3 three

times, Fn-4 four times, ....

This suggests that we can do better!

Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 28
Can we do better?
Why is fib1 so bad?

Many calculations are done

multiple times:

Fn-2 is calculted twice, Fn-3 three

times, Fn-4 four times, ....

This suggests that we can do better!

Question:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 29
Fibonacci II
To avoid computing the same number several times, we store previous calculations in a
vector:

fib1 = function(n)
{
if(n <2) return(n)
f = 1:n; f[1] = 1; f[2] = 1;
if(n >2)
for(i in (3:n))
f[i] = f[i-1] + f[i-2]
return(f[n])
}

17.09.23 Bioinformatics Algorithms 30

Fibonacci II
• fib1 is much faster than fib!
• Number of computations of fib1 as a function of n:
T(n) ≈ n
• Number of computations of fib as a function of n:
T(n) ≈ 2n

These calculations are still a bit vague, so let’s try to formalize this as a
mathematical concept

17.09.23 Bioinformatics Algorithms 31

Example II: Sorting numbers

https://imgur.com/gallery/voutF

17.09.23 Bioinformatics Algorithms 32

How efficient is an algorithm or
piece of code?
Efficiency covers lots of resources, Be careful to differentiate between:
including:
Performance:
• CPU (time) usage How much time/memory/disk/... is
• memory usage actually used when a program is run. This
• disk usage depends on the machine, compiler, etc.
as well as the code.
• network usage
Complexity:
All are important but we will mostly talk How do the resource requirements of a
about time complexity (CPU usage). program or algorithm scale, i.e., what
happens as the size of the problem being
solved gets larger?

17.09.23 Bioinformatics Algorithms 33

How efficient is an algorithm or
piece of code?
Efficiency covers lots of resources, Be careful to differentiate between:
including:
Performance:
•
•
CPU (time) usage
memory usage
Complexity affects How much time/memory/disk/... is
actually used when a program is run. This
•
•
disk usage performance but not the
network usage
depends on the machine, compiler, etc.
as well as the code.

other way around!

Complexity:
All are important but we will mostly talk How do the resource requirements of a
about time complexity (CPU usage). program or algorithm scale, i.e., what
happens as the size of the problem being
solved gets larger?

17.09.23 Bioinformatics Algorithms 34

Complexity
• When we are trying to find the complexity of the algorithm, we are
not interested in the exact number of operations that are being
performed.
• We are interested in the relation of the number of operations to the
problem size: how does the number of operations scale with the
input size
• We are usually interested in the worst case: what is the maximum
number of operations that might be performed for a given problem
size.

17.09.23 Bioinformatics Algorithms 35

Big-O notation
• A concise way to describe the running-time behavior of algorithms is the
Big-O notation

• O(f(n)) means that the running time of an input of size n is limited by a

function of form f(n)

• E.g.: O(n2) means that the run-time grows with the square of n,
this can mean that
T(n) = 0.0005 n2 or
T(n) = 10000000 n2 + 3n + 2

• The important thing is the leading-order term of T(n)!

17.09.23 Bioinformatics Algorithms 36
Big-O notation
Formal definition:

A function f(x) is O (g(x)) if there are positive real

constants c and x0 such that f(x) < c g(x) for all
values of x > x0.

17.09.23 Bioinformatics Algorithms 37

Big-O notation
A function f(n) is O (g(n)) if there are positive real constants c and n0 such that f(n) < c g(n) for all values of n > n0.

What does that mean?

1) We are looking for an upper bound for our function f. ( f(n) < c g(n))
2) We are only interested in the behavior for large values. (n > n0)
3) We don’t care for constant factors. (c can be chosen arbitrarily)

Example:
f(n) = 6 n4 – 2 n3 + 5

f(n) = O(n4)

17.09.23 Bioinformatics Algorithms 38

Big-O notation
How do we get that f(n) = 6 n4 – 2 n3 + 5 is O(n4)?
Assume that n ≥ 1:

6 n4 – 2 n3 + 5 ≤ 6 n4 + 2 n3 + 5
≤ 6 n4 + 2 n4 + 5 n4
≤ 13 n4

We not set c = 13, n0 = 1, and g(n) = n4

And get that f(n) ≤ c g(n) for all n > n0!

17.09.23 Bioinformatics Algorithms 39
Big-O notation
Common notations:
Slow-growing

Notation name

O(1) Constant
O(log(n)) Logarithmic
O((log(n))k) Polylogarithmic
O(n) Linear
O(n2) Quadratic
O(nk) Polynomial
O(kn) Exponential

k is an arbitrary constant Fast-growing

17.09.23 Bioinformatics Algorithms 40

Determining Complexity
A few helpful rules:

1. O(k1 f(n) + k2) = O(f(n)) if k1 and k2 are constants

2. O(f(n)+ g(n)) = O(g(n)) if g(n) > f(n) for large n
e.g.: O(n2 + n3) = O(n3)

3. This means that only the fastest growing term of a function is important!
4. We can ignore constants, i.e., everything that does not depend on n!

17.09.23 Bioinformatics Algorithms 41

Determining Complexity
• Examples:

exp(n)+n4 = O(exp(n))
exp(n) n + log(n) = O(n)
n

log(nk) = O(log(n))
k*log(n) = log(n) n log(n) + n2 = O(n
n^22)

log(n)k = O(log(n)
log(n)^kk) sqrt(n) n3 = O(n 2)
n^3.5

log(kn) = O(n)
n * log(k) = n
10 log(n) + (log(n))3 + 6 n3 = O(n
n^33)

17.09.23 Bioinformatics Algorithms 42

17.09.23 Bioinformatics Algorithms 43

Calculate complexity from code
In general, how can you determine the running time of a piece of code?
It depends on what kinds of statements are used.

Sequence of statements
• statement 1;
• statement 2;
• ...
• statement k;

The total time is found by adding the times for all statements:

Total time = time(statement 1) + time(statement 2) + ... + time(statement k)

17.09.23 Bioinformatics Algorithms 44

Calculate complexity from code
If each staement is simple (constant number of operations, independent of
n), the time for each statement is constant and so is the amount of time for
all statements.

Consequently we get that the complexity of the code is O(1).

If one of the statments requires n basic operations, and the other are simple,
we get that the complexity is linear in O(n).

If one statmenet requires exp(n) operations, and all others are simples, we
get O(exp(n)).
17.09.23 Bioinformatics Algorithms 45
Some examples
What is the complexity of this loop:

for (i in 3:n)
v[i] = v[i-2]+v[i-1]

The loop is repeated n-2 times, and we perform two operations at each
iteration, one addition and one assignment.

Consequently, T(n) = 2(n-2) = 2n – 4 = O(n).

17.09.23 Bioinformatics Algorithms 46

Loops and functions
for (i in 1:n) for(i in 1:n)
for (j in 1:n) f(i)
m[i,j] = a[i] * b[j]
Where f(n) is O(n)

Solution: O(n2)
Solution: O(n2)
The loop is exectued n times and the function inside
the loop is O(n).
The operation is performed n*n Thus T(n) = 1+2+3+... + n = ½ n (n+1) = O(n2)
times.

17.09.23 Bioinformatics Algorithms 47

If/Else
If(condition1)
function(1)
else
function(2)

funciton1 is O(log(n))
function2 is O(n2)

Solution: O(n2)

Either funciton 1 or function 2 is executed.

Since we are interested in the worst case, we only need to consider the funciton with the higher
complexity.