2023 Bioinformatics Algorithms Day 1
2023 Bioinformatics Algorithms Day 1
Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland
Phone: +41 31 631 45 19
Email: [email protected]
Day 1
• Introduction and Organization
• Defintion Algorithm
• Fibonacci numbers
• Performance of algorithm
• Travelling salesman problem
Bioinformatics Algorithms 3
Slides, Exercises, etc.
• I will provide pdfs of slides, book • If you have questions about the
chapters, R-scripts etc on ilias course, send me an email
[email protected]
Bioinformatics Algorithms 4
Organization of the course
• Course language: English
• Written exam at the end of the course (english or german)
• Prerequisites:
• Basic math knowledge
• Basic programming knowledge
• Basic biology knowledge
Bioinformatics Algorithms 5
Exam
• Written exam at the end of the semester
• 2 hours
• Date: TBA
• 1990 - 2000
• Sequencing the first human genome took ~10 years and cost ~$2.7 billion
• 2018
• Today, sequencing a genome costs ~$1,000✢ and a “run” takes <3 days✢
~18 Tb per “run” at maximum capacity
✢ on an Illumina HiSeq X Ten — the machine costs ~$10M and sample prep takes a little extra
time.
• Short “reads” (75 — 250) characters when the texts we’re interested
in are 1,000s to 1,000,000s of characters long.
• Imperfect “reads” — results in infrequent but considerable “errors”;
modifying, inserting or deleting one or more characters in the “read”
• Biased “reads” — as a result of the underlying chemistry & physics,
sampling is not perfectly uniform and random. Biases are not always
known.
«An algorithm is an effective method expressed as a finite list of well-defined instructions for
calculating a function. Starting from an initial state and initial input (perhaps empty), the
instructions describe a computation that, when executed, proceeds through a finite number of well-
defined successive states, eventually producing "output" and terminating at a final ending state.
The transition from one state to the next is not necessarily deterministic; some algorithms, known
as randomized algorithms, incorporate random input.»
Recipe
1) Take ingredients
2) Mix them
together in
random way
3) Bake at random
temperature
4) If result is
cupcake: end
5) Else: go to 1
Xn ¼
ðiÞ
jn : (A1) variation in offspring distributions within generations. Then,
i¼1 for each n, the jðiÞ
n are a family of independent identically
distributed random variables. We denote the offspring of
a mutant copy in generation n by jn. Further, we denote
n by
We denote the PGF of jðiÞ
the PGF of jn by
k¼0
fn;k xk ;
ðiÞ
(A2)
fn ðxÞ ¼
N
X
k¼0
fn;k xk ; (A11)
N
X N
X l
X
! Algorithm
which is Equation 3 in the main text.
¼ k x k : (A5)
ðiÞ
¼ PðXn ¼ lÞ P jn
l¼0 k¼0 i¼1
Appendix B: Derivation of Equation 5
PN P Pl ðiÞ
Note that k¼0 Pð Ii¼1 jðiÞ k
n ¼ kÞx is the PGF of i¼1 jn . Be- Here, we derive an approximation for the probability of
cause the PGF of a sum of independent random variables is establishment if there is no variation in offspring distribu-
the product of the PGFs of the random variables (e.g., Karlin tions within generations.
and Taylor 1975), it follows that Let qn = Fn(0) denote the probability of extinction by
generation n. It is straightforward to see that {qn} is
N
X l
Y
Fnþ1 ðxÞ ¼ PðXn ¼ lÞ fi;n ðxÞ (A6) an increasing sequence that is bounded by 1. Intuitively, this
l¼0 i¼1 can be seen by observing that the probability of extinction in
generation n + 1 has to be larger than the probability
Formal
XN description
! " of extinction in generation n. More formally, it follows be-
! n ðxÞ l ;
Mathematics/Statistics
¼ PðXn ¼ lÞ f (A7) cause the fn are monotonically increasing functions of x.
?
l¼0 Consequently, q = limn/N qn exists and q 2 [0, 1]. The
probability of establishment is then given by p = 1 2 q.
where
We introduce a time-independent branching process,
!1=Xn denoted reference process, and define
Xn
Y
! n ðxÞ :¼
f fi;n : (A8)
i¼1
dn ðxÞ :¼ fn ðxÞ 2 f*ðxÞ; (B1)
Fibonacci numbers:
F0 = 0, F1 = 1
Fn = Fn-1 + Fn-2, for n > 2
0,1,1,2,3,5,8,13,21,34,....
fib = function(n)
{
if(n < 2)
return(n)
return(fib(n-2)+fib(n-1))
}
> fib(10)
> 55
fib = function(n)
{
if(n < 2)
return(n)
return(fib(n-2)+fib(n-1))
}
1. Is it correct?
3. Can we do better?
1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?
3. Can we do better?
1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?
Answer: ?
3. Can we do better?
1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?
Answer: ?
3. Can we do better?
Answer: ?
Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 26
Can we do better?
Why is fib1 so bad?
Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 27
Can we do better?
Why is fib1 so bad?
Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 28
Can we do better?
Why is fib1 so bad?
Question:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 29
Fibonacci II
To avoid computing the same number several times, we store previous calculations in a
vector:
fib1 = function(n)
{
if(n <2) return(n)
f = 1:n; f[1] = 1; f[2] = 1;
if(n >2)
for(i in (3:n))
f[i] = f[i-1] + f[i-2]
return(f[n])
}
These calculations are still a bit vague, so let’s try to formalize this as a
mathematical concept
https://imgur.com/gallery/voutF
• E.g.: O(n2) means that the run-time grows with the square of n,
this can mean that
T(n) = 0.0005 n2 or
T(n) = 10000000 n2 + 3n + 2
Example:
f(n) = 6 n4 – 2 n3 + 5
f(n) = O(n4)
6 n4 – 2 n3 + 5 ≤ 6 n4 + 2 n3 + 5
≤ 6 n4 + 2 n4 + 5 n4
≤ 13 n4
Notation name
O(1) Constant
O(log(n)) Logarithmic
O((log(n))k) Polylogarithmic
O(n) Linear
O(n2) Quadratic
O(nk) Polynomial
O(kn) Exponential
3. This means that only the fastest growing term of a function is important!
4. We can ignore constants, i.e., everything that does not depend on n!
exp(n)+n4 = O(exp(n))
exp(n) n + log(n) = O(n)
n
log(nk) = O(log(n))
k*log(n) = log(n) n log(n) + n2 = O(n
n^22)
log(n)k = O(log(n)
log(n)^kk) sqrt(n) n3 = O(n 2)
n^3.5
log(kn) = O(n)
n * log(k) = n
10 log(n) + (log(n))3 + 6 n3 = O(n
n^33)
Sequence of statements
• statement 1;
• statement 2;
• ...
• statement k;
The total time is found by adding the times for all statements:
If one of the statments requires n basic operations, and the other are simple,
we get that the complexity is linear in O(n).
If one statmenet requires exp(n) operations, and all others are simples, we
get O(exp(n)).
17.09.23 Bioinformatics Algorithms 45
Some examples
What is the complexity of this loop:
for (i in 3:n)
v[i] = v[i-2]+v[i-1]
The loop is repeated n-2 times, and we perform two operations at each
iteration, one addition and one assignment.
Solution: O(n2)
Solution: O(n2)
The loop is exectued n times and the function inside
the loop is O(n).
The operation is performed n*n Thus T(n) = 1+2+3+... + n = ½ n (n+1) = O(n2)
times.
funciton1 is O(log(n))
function2 is O(n2)
Solution: O(n2)
Nevertheless, the time needed to find optimal solution grows more than
exponentially with the number of cities.
Solution is
obviously not
optimal!
Algorithm yields
different
solutions
depending on
starting point.
Local optimization algorithm Local optimization algorithm
Pseudocode:
Finding nearest city takes (n-k) calculations where k is
the number of already visited cities.
pick starting city
for (i in 2:n) All other operations are O(1). Thus, we get:
find nearest city 𝑇 𝑛 = 𝑛 − 1 𝐶 + 𝑛 − 2 𝐶 + …+ 𝑛 − 𝑛 − 3 + (𝑛 − (𝑛 − 2))𝐶
return result = O(n2)
"AntColony" by Saurabh.harsh - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:AntColony.gif#mediaviewer/File:AntColony.gif