Sequence Homology Searching — an Introduction to Applied Bioinformatics
Sequence Homology Searching — an Introduction to Applied Bioinformatics
Contents
>mystery-sequence1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDP
LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
SEPVLKGVKLHYT
In the context of database searching, a query sequence and a reference sequence that we hypothesize to be homologous can be identical to one another, or they can differ
as a result of mutation events. When sequences differ, we’re often then interested in how much they differ, or their pairwise similarity, which can help us identify the most
closely related of several homologs in the reference database. There is an important distinction in the terms homology and similarity: homology is a discrete variable, and
similarity is a continuous variable. A pair of biological sequences either are or are not derived from a common ancestor, but they can be more or less similar to each other.
Saying that two sequences are 80% homologous doesn’t make sense. What people generally mean when they say this is that two sequences are 80% similar, and as a result
they are hypothesizing homology between the sequences.
Similarity between a pair of sequences can be computed in a few different ways. In this text, unless otherwise noted, we’ll compute similar as:
Similarity = 1 − Hamming distance. Recall that Hamming distance is the fraction of positions that differ between aligned sequences. Similarity is therefore the
inverse of that: the fraction of positions that do not differ between aligned sequences.
Sequence homology searching can be implemented in a few ways. In this chapter, we’ll use the local alignment function that we worked with in the Pairwise Alignment
chapter, local_pairwise_align_ssw, run it many times to search one query sequence against many reference sequences, and investigate the highest scoring alignment(s) to
identify the best database match. Remember that you can always get help with a function by passing it as an argument to help:
https://readiab.org/database-searching.html 1/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Parameters
----------
sequence1 : DNA, RNA, or Protein
The first unaligned sequence
sequence2 : DNA, RNA, or Protein
The second unaligned sequence
Returns
-------
tuple
``TabularMSA`` object containing the aligned sequences, alignment score
(float), and start/end positions of each input sequence (iterable
of two-item tuples). Note that start/end positions are indexes into the
unaligned sequences.
Notes
-----
This is a wrapper for the SSW package [1]_.
References
----------
.. [1] Zhao, Mengyao, Wan-Ping Lee, Erik P. Garrison, & Gabor T.
Marth. "SSW Library: An SIMD Smith-Waterman C/C++ Library for
Applications". PLOS ONE (2013). Web. 11 July 2014.
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082138
See Also
--------
skbio.alignment.StripedSmithWaterman
When our reference database starts getting hundreds of millions of bases long (as would be the case if we were searching against 97% OTUs from the Greengenes small-
subunit ribosomal RNA (SSU rRNA) reference database), billions of bases long (as would be the case if we were searching against the human genome) or trillions of bases
long (as would be the case if we were searching against the NCBI non-redundant nucleotide database), runtime becomes an important consideration. For that reason,
learning about heuristic algorithms is an essential part of learning about sequence homology searching. Heuristic algorithms apply some rules (i.e., heuristics) to
approximate the correct solution to a problem in a fraction of the runtime that would be required if we wanted to be guaranteed to find the correct solution. Heuristic
algorithms are very common in bioinformatics, and we’ll use them in several other places in this book.
While we’ll be aligning nucleotide sequences in this chapter, the same concepts apply to protein homology searching.
First, let’s load Greengenes into a list of skbio.DNA sequence objects, and associate the taxonomy of each sequence as sequence metadata.
Next, we’ll just inspect a couple of the sequences we loaded. Notice how the specificity of our taxonomic annotations (i.e., how many taxonomic levels are annotated and
unknown) differs for different sequences.
reference_db[0]
DNA
-----------------------------------------------------------------------
Metadata:
'description': ''
'id': '1111883'
'taxonomy': 'k__Bacteria; p__Gemmatimonadetes; c__Gemm-1; o__; f__;
g__; s__'
Stats:
length: 1428
has gaps: False
has degenerates: False
has definites: True
GC-content: 61.90%
-----------------------------------------------------------------------
0 GCTGGCGGCG TGCCTAACAC ATGTAAGTCG AACGGGACTG GGGGCAACTC CAGTTCAGTG
60 GCAGACGGGT GCGTAACACG TGAGCAACTT GTCCGACGGC GGGGGATAGC CGGCCCAACG
...
1320 GCCGCGGTGA ATACGTTCCC GGGCCTTGTA CACACCGCCC GTCACGCCAT GGAAGCCGGA
1380 GGGACCCGAA ACCGGTGGGC CAACCGCAAG GGGGCAGCCG TCTAAGGT
reference_db[-1]
https://readiab.org/database-searching.html 2/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
DNA
----------------------------------------------------------------------
Metadata:
'description': ''
'id': '4483258'
'taxonomy': 'k__Archaea; p__Crenarchaeota; c__Thermoprotei;
o__Thermoproteales; f__Thermoproteaceae; g__; s__'
Stats:
length: 2123
has gaps: False
has degenerates: False
has definites: True
GC-content: 58.36%
----------------------------------------------------------------------
0 CTGGTTGATC CTGCCGGACC CGACCGCTAT CGGGGTGGGG CTTAGCCATG CGAGTCAAGC
60 GCCCCAGGGA CCCGCTGGGG TGCGGCGCAC GGCTCAGTAA CACGTGGCCA ACCTACCCTC
...
2040 ATAATCTCCT TATTGTCTGA TCCTTATGCA TTTTCCTTTG GCCCATCCCG TGAATACGCG
2100 CGGTGAATAC GTCCCTGCCC CTT
For the sake of runtime, we’re going to work through this chapter using a random sample of sequences from this database. Here we’ll use Python’s random module to
select sequences at random.
import random
We’ll also extract some sequences from Greengenes to use as query sequences in our database searches. This time we won’t annotate them (to simulate not knowing what
organisms they’re from). We’ll also trim these sequences so they’re shorter than the full length references. This will simulate obtaining a partial gene sequence, as is most
common with the current sequencing technologies (as of this writing), but will also help to make the examples run faster.
Note that some of our query sequences may also be in our subsampled reference database and some won’t. This is realistic: sometimes we’re working with sequences that
are exact matches to known sequences, and sometimes we’re working with sequences that don’t match any known sequences (or at least any in the reference database
that we’re working with).
queries = load_taxonomy_query_sequences()
queries = random.sample(queries, k=50)
Let’s inspect a couple of the query sequences that we’ll work with.
queries[0]
DNA
---------------------------------------------------------------------
Metadata:
'description': ''
'id': '1110988'
Stats:
length: 200
has gaps: False
has degenerates: False
has definites: True
GC-content: 46.50%
---------------------------------------------------------------------
0 GGTGAGTAAC ACGTGAGCAA CCTGCCTTTG AGAGAGGGAT AGCTTCGGGA AACTGATGGT
60 AATACCTCAT AACATATATT TAAGGCATCT TAGATATATC AAAGATTTAT CACTCAAAGA
120 TGGGCTCGCG TCTGATTAGC TGGTTGGTGA GGTAGCGGCT CACCAAGGCG ACGATCAGTA
180 GCCGGACTGA GAGGTTGAAC
queries[-1]
DNA
---------------------------------------------------------------------
Metadata:
'description': ''
'id': '547906'
Stats:
length: 200
has gaps: False
has degenerates: False
has definites: True
GC-content: 51.50%
---------------------------------------------------------------------
0 GTGGGCAACC TGCCCTATAC TGGGGAATAA TCACTGGAAA CGGTGACTAA TACCGCATGT
60 CATTGCGAGG GGGCATCCTC TTGCAAGAAA AGGAATTTAT TTCGGTATAG GATGGGCCCG
120 CATCTGATTA GCTAGTTGGT GAGATAACAG CCCACCAAGG CGACGATCAG TAGCCGACCT
180 GAGAGGGTGA TCGGCCACAT
There are a few realistic features of the situation that we’ve set up here that I want you to be aware of.
1. All of the query and reference sequences are homologous. In this case, they are all sequences of the 16S rRNA gene from archaea and bacteria. This may or may not
be the case in real-world applications. Sometimes you’ll work with gene-specific databases such as Greengenes, and sometimes you’ll work with non-specific
databases such as the NCBI nucleotide database (nr). Regardless, the search process is similar.
2. The distance between each query sequence and its most closely related sequences in R will vary widely. Sometimes q will be an exact match to a reference sequence
ri , and sometimes we may have as little as 50% similarity.
As we work through the next sections, imagine that we’re exploring scaling this system up, so that instead of searching just one or a few query sequences against the
reference database, we ultimately want to apply this to search millions of sequences against the database. This would be the real-world problem we faced if we had
collected 16S rRNA sequences from the environment (which would of course be unannotated) using high-throughput DNA sequencing.
https://readiab.org/database-searching.html 3/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Spend a minute looking at this function and try to understand what it’s doing.
import pandas as pd
from skbio.alignment import local_pairwise_align_ssw
Now let’s perform some database searches. You can run the remaining code cells in this section a few times to experiment with searching different query sequences against
the same reference database.
This next cell, which is the one that actually performs the database searches, will take a little bit of time to run (maybe up to a minute or two). There is some code in this
cell that will track the runtime. As it’s running, think about how many query sequences we’re searching against how many reference sequences, and refer back to the
number of sequences in the full reference database. Does this strategy seem scalable to millions of sequences, which as mentioned above might be our ultimate goal?
When you know the per-sequence runtime of this search, estimate how long it would take to do this in seconds for one million sequences. Convert the time in seconds to a
unit that will be more meaningful to you.
import time
start_time = time.time()
current_queries = random.sample(queries, k=4)
results = local_alignment_search(current_queries, reference_db)
stop_time = time.time()
print("Runtime: %1.4f sec per query" % ((stop_time - start_time) / len(current_queries)))
results
https://readiab.org/database-searching.html 4/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
percent alignment
reference taxonomy score
similarity length
query reference
k__Bacteria; p__Proteobacteria;
349639 92.500000 200 325
c__Alphaproteo...
k__Bacteria; p__Proteobacteria;
616682 92.500000 200 325
c__Alphaproteo...
k__Bacteria; p__Proteobacteria;
766020 92.000000 200 320
c__Alphaproteo...
k__Bacteria; p__Proteobacteria;
4714 92.000000 200 320
c__Alphaproteo...
k__Bacteria; p__Gemmatimonadetes;
4452903 86.124402 209 276
c__Gemmatimo...
k__Bacteria; p__Gemmatimonadetes;
212819 87.254902 204 269
c__Gemmatimo...
k__Bacteria; p__Gemmatimonadetes;
1148064 86.893204 206 266
c__Gemmatimo...
k__Bacteria; p__Gemmatimonadetes;
1813677 86.274510 204 264
c__Gemmatimo...
k__Bacteria; p__Proteobacteria;
197286 97.500000 200 375
c__Gammaproteo...
k__Bacteria; p__Proteobacteria;
540402 97.000000 200 370
c__Gammaproteo...
k__Bacteria; p__Proteobacteria;
4217232 97.000000 200 370
c__Gammaproteo...
k__Bacteria; p__Proteobacteria;
736735 96.500000 200 365
c__Gammaproteo...
Now, let’s try to answer our initial question: what is the most likely taxonomic annotation for each of our query sequences? Spend a few minutes reviewing this information,
and write down what you think the most likely taxonomic annotation is for each of the query sequences. Here are some hints to help you out:
The k, p, c, o, f, g, and s refer to kingdom, phylum, class, order, family, genus, and species, respectively. If you see an annotation for a reference sequence that looks like
g__, that means that the genus is unknown for that sequence.
Just as the reference taxonomy annotations don’t always go down to the species level, your taxonomic annotations don’t have to either. Not assigning at a given
level implies that you’re uncertain about what the annotation should be at that level, and it’s usually better just to indicate that you’re uncertain rather than make a
bad guess. If you’re uncertain of what the species is, assign the query s__ and try to decide what the most likely genus is. If you’re uncertain of the genus, assign g__,
and try to decide what the most likely family is…
As you look at each of the reference taxonomy annotations below, refer back to the table above to look at the percent similarity between each query and reference,
and maybe the length of the alignments and their scores. These values give you an idea of how confident you should be in each of your taxonomic annotations.
for q in current_queries:
q_id = q.metadata['id']
print('Closest taxonomies for query %s (in order):' % q_id)
for e in results['reference taxonomy'][q_id]:
print(' ', e)
print()
https://readiab.org/database-searching.html 5/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Because we have taxonomic annotations for all of the Greengenes sequences (though as you probably have noticed by now, they differ in their specificity), we can next
look at taxonomy associated with each of our queries in Greengenes. How do your annotations compare to those from Greengenes, which we’ll print out in the next cell?
for q in current_queries:
q_id = q.metadata['id']
print('Known taxonomy for query %s:\n %s' % (q_id, reference_taxonomy[q_id]))
print()
As we discussed in the previous chapter, the run time of pairwise alignment scales quadratically with sequence length. Database searching, at least in the example we’re
exploring in this chapter, is a bit of a different problem however. Our sequence lengths aren’t changing, but rather it takes a long time because we’re performing a
computationally expensive step, pairwise alignment, many times. Our database is fixed in that the number of sequences in it doesn’t change and the sequences themselves
don’t change. Our query sequences are all exactly the same length in this example (remember that we set that above, when we sliced a single region from reference
database sequences to create our query sequences). Let’s explore how the runtime of this database search scales under these constraints.
https://readiab.org/database-searching.html 6/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
import pandas as pd
import itertools
local_alignment_search_runtimes = tabulate_local_alignment_search_runtime(queries,
reference_db,
n_query_sequences,
n_reference_sequences,
local_alignment_search)
local_alignment_search_runtimes
This table shows that we’ve tried a few variations on number of query sequences but kept the number of reference sequences constant. There is no variance in the query
sequence length, and there is a relatively small amount of variance in reference sequence length (they’re all of the same order of magnitude). There is also relatively little
variance in runtime for fixed numbers of query and reference sequences.
This table clearly shows that there is an increase in runtime with an increasing number of query sequences, which we’d of course expect. What we care about is how
runtime is increasing as a function of number of query sequences. Let’s plot runtime versus the number of query sequences to help us understand that relationship.
What we see here is pretty clearly a linear relationship: runtime ≈ constant × number of query sequences. This is because as we increase the number of query
sequences, we’re increasing the number of pairwise alignments that we need to perform. If we have 5 queries and 10 reference sequences, we compute 5 × 10 = 50
pairwise alignments. If we have 10 queries and 100 reference sequences, we compute 10 × 100 = 1000 pairwise alignments. There are a few practical ways to reduce the
runtime of a process like this.
The first seems obvious, and even silly at first: perform fewer alignments. This could be achieved in a few ways. You could reduce the number of query sequences, though
this might be something a researcher is resistant to: they have some collection of unknown sequences, and they want to know what they all are. You could alternatively
reduce the number of reference sequences, but you might run into the same issues there: we wouldn’t want to exclude reference sequences that might provide us with
useful information about our query sequences. Finally, we might be able to figure out some ways to perform fewer alignments by not searching all of the query sequences
against all of the reference sequences. If we could come up with some procedure to approximate which pairwise alignments were likely to be good (i.e., high scoring) and
which were likely to be bad (i.e., low scoring) that is faster than performing the pairwise alignments, we could apply that procedure and only align a pair of sequences when
we expect to get a high score. That could potentially allow us to reduce the number of alignments we need to perform, and therefore the runtime of the algorithm.
https://readiab.org/database-searching.html 7/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Another approach to reducing the runtime of this process would be to create a faster implementation of the algorithm (though at some point that won’t be possible
anymore), use a faster computer, or run the process in parallel on multiple processors. All of these would be ways to reduce the runtime of the search by some factor f ,
where new runtime ≈ runtime
f
.
In practice, for a production-scale sequence database search application like BLAST, we’d combine these approaches. In the next section we’ll explore ways to reduce the
runtime of database searching for a fixed number of query sequences and a fixed number of reference sequences by reducing the number of pairwise alignments that the
search function will perform.
Heuristic algorithms
As mentioned above, it just takes too long to search individual query sequences against a large database. This problem also isn’t going away anytime soon. While
computers are getting faster (or cheaper), the size of our sequences collections are getting bigger because sequencing is getting cheaper. In fact, many people think that
obtaining DNA sequences is getting cheaper faster than computers are getting cheaper. As our number of query sequences increases because we are able to obtain more
for the same amount of money, and the size of our reference databases increases (because we’re continuously obtaining more sequence data) this will increasingly become
a bigger problem. Figures 1 and 2, respectively, illustrate that these are both real-world issues. Notice that the axes are on a log scale in both cases.
import IPython.display
IPython.display.IFrame(width="600", height="394",
src="https://docs.google.com/spreadsheets/d/1vUkUuZsRlLW5U05rXXUn8B2sDYwShkClRMGa8Wiu6bc/pubcha
rt?oid=1844125885&format=interactive")
import IPython.display
IPython.display.IFrame(width="763", height="371",
src="https://docs.google.com/spreadsheets/d/1vUkUuZsRlLW5U05rXXUn8B2sDYwShkClRMGa8Wiu6bc/pubcha
rt?oid=2103353397&format=interactive")
One way that we can deal with this problem is by recognizing that most of the alignments that are performed in a database search are unlikely to be very good alignments.
An algorithm developer could therefore improve runtime by defining a heuristic (or a rule) that is applied to determine which reference sequences are likely to result in
good alignments, and only aligning the query against those. For it to be useful, making the decision to align or not (i.e., applying the heuristic) must be much faster than
actually performing the pairwise alignment. The heuristic also needs to make good choices about which reference sequences to align the query against. If the algorithm
chooses to not align against a specific reference, that reference is ruled out as a possible result of the database search. A good heuristic for sequence homology searching
would therefore be very unlikely to exclude the best alignment(s). When thinking about heuristic algorithms in general, there are some important considerations:
1. How often does the heuristic algorithm fail to get the right answer (in our case, does it make good choices about which reference sequences to align against)?
2. How much faster is the heuristic than the “complete” approach, and is that reduction in runtime enough to justify not being guaranteed to get the best answer?
Here’s the source code for this. You can see that we’re just wrapping our local_alignment_search function in a function that samples down to p% of the reference
sequences.
import random
def heuristic_local_alignment_search_random(
queries, reference_db, p, n=5, aligner=local_pairwise_align_ssw):
k = int(p * len(reference_db))
database_subset = random.sample(reference_db, k)
return local_alignment_search(queries, database_subset, n=n, aligner=aligner)
Let’s select some new queries and see how the results compare to our known taxonomies.
https://readiab.org/database-searching.html 8/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
for q in current_queries:
q_id = q.metadata['id']
print('Closest taxonomies for query %s (in order):' % q_id)
for e in results['reference taxonomy'][q_id]:
print(' ', e)
print()
for q in current_queries:
q_id = q.metadata['id']
print('Known taxonomy for query %s:\n %s' % (q_id, reference_taxonomy[q_id]))
https://readiab.org/database-searching.html 9/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
What we need now is a way to know how often we get the “right answer”, and how long this heuristic algorithm takes relative to the complete algorithm. We therefore first
need to define what the “right answer” is. How about this: if the most common taxonomy assignment resulting from the database search at taxonomy_levels levels of
taxonomy (i.e., how deep or specific our assignment is) matches the known taxonomy, then our algorithm has achieved the right answer. We can vary taxonomy_levels to
see how the different heuristics perform at different levels.
import collections
First let’s see how this works for our full database search algorithm. What’s the runtime, and how often do we get the correct answer? We’ll start with five levels of
taxonomy (which corresponds to the family level). This step will take a couple of minutes to run, because it’s doing the full database search.
taxonomy_levels = 5
https://readiab.org/database-searching.html 10/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__', 'f__')
2729797
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
864506
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Chromatiales', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__[Marinicellales]',
'f__[Marinicellaceae]')
937848
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
774561
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
4353245
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Sphingomonadales',
'f__Sphingomonadaceae')
Next let’s see how this compares to our random heuristic search algorithm. Try running this a few times, as you might get different answers due to different random
selections of the database.
import functools
heuristic_local_alignment_search_random_10 =
functools.partial(heuristic_local_alignment_search_random, p=0.10)
1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Verrucomicrobia', 'c__Verrucomicrobiae', 'o__Verrucomicrobiales',
'f__Verrucomicrobiaceae')
2729797
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales', 'f__')
864506
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Chromatiales', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__[Marinicellales]',
'f__[Marinicellaceae]')
937848
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
('k__Bacteria', 'p__Verrucomicrobia', 'c__', 'o__', 'f__')
4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__')
774561
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
4353245
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Sphingomonadales',
'f__Sphingomonadaceae')
https://readiab.org/database-searching.html 11/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Again, what’s the runtime, and how often do we get the correct answer? Based on comparison to the full search, what do you think: is this a good heuristic?
After performing many trials of the above searches, I get the correct genus-level assignment about half as often with the random reference database heuristic relative to
the full database search. Your results might differ from that due to differences in the random selection of query and reference sequences. Try running all the cells in this
section a few times.
Go back to the beginning of this section and try running this check based on fewer levels of taxonomy (i.e., decreased taxonomic specificity, such as the phylum) and on
more levels of taxonomy (i.e., increased taxonomic specificity, such as the species level). How does that impact how often we get the right answer?
GC content
One metric of sequence composition that we can compute quickly (because remember, this has to be a lot faster than computing the alignment for it to be worth it) is GC
content. Let’s define a heuristic that only performs a pairwise alignment for the reference sequences that have the most similar GC content to the query sequence. The
number of alignments that we’ll perform will be defined as database_subset_size.
database_subset_size = 500
def heuristic_local_alignment_search_gc(
queries, reference_db, database_subset_size, n=5,
reference_db_gc_contents=None,
aligner=local_pairwise_align_ssw):
results = []
if reference_db_gc_contents is None:
reference_db_gc_contents = \
{r.metadata['id'] : r.gc_content() for r in reference_db}
for q in queries:
query_gc_content = q.gc_content()
database_subset = []
for r in reference_db:
ref_gc_content = reference_db_gc_contents[r.metadata['id']]
# find the difference in GC content between the reference and
# query. we'll sort and select our reference sequences by this
# value
database_subset.append((abs(ref_gc_content - query_gc_content), r))
database_subset.sort(key=lambda x: x[0])
database_subset = [e[1] for e in database_subset[:database_subset_size]]
results.append(local_alignment_search(
[q], database_subset, n=n, aligner=aligner))
return pd.concat(results)
If we run our queries again, how often do we get the right answer? How much did we reduce runtime? Do you think this is a better or worse heuristic than what we
implemented above?
heuristic_local_alignment_search_gc_2 = functools.partial(heuristic_local_alignment_search_gc,
database_subset_size=database_subset_size)
https://readiab.org/database-searching.html 12/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__', 'f__')
2729797
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Comamonadaceae')
864506
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Chromatiales', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Pseudomonadales',
'f__Moraxellaceae')
937848
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Pedosphaerae]', 'o__[Pedosphaerales]', 'f__')
4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__')
774561
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micromonosporaceae')
4353245
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodospirillales',
'f__Rhodospirillaceae')
92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__', 'f__')
Try increasing and decreasing the number of sequences we’ll align by increasing or decreasing database_subset_size. How does this impact the runtime and fraction of
time that we get the correct answer?
kmer content
Another metric of sequence composition is kmer composition. A kmer is simply a word (or list of adjacent characters) of length k found within a sequence. Here are the
kmer frequencies in a short DNA sequence. The overlap=True parameter here means that our kmers can overlap one another.
import skbio
skbio.DNA('ACCGTGACCAGTTACCAGTTTGACCAA').kmer_frequencies(k=5, overlap=True)
{'ACCGT': 1,
'CCGTG': 1,
'CGTGA': 1,
'GTGAC': 1,
'TGACC': 2,
'GACCA': 2,
'ACCAG': 2,
'CCAGT': 2,
'CAGTT': 2,
'AGTTA': 1,
'GTTAC': 1,
'TTACC': 1,
'TACCA': 1,
'AGTTT': 1,
'GTTTG': 1,
'TTTGA': 1,
'TTGAC': 1,
'ACCAA': 1}
In our next heuristic, we’ll only align our query to the reference sequences with the largest fraction of the kmers that are observed in the query sequence are also present in
the reference sequence. This makes a lot of sense to use as an alignment heuristic: we’re only aligning sequences when it looks like they’ll have multiple length-k stretches
of nucleotides that are not interrupted by substitutions or insertion/deletion mutations.
In our next heuristic, we’ll only align our query to the reference sequences with the largest fraction of the kmers that are observed in the query sequence. This makes a lot
of sense to use as an alignment heuristic: we’re only aligning sequences when it looks like they’ll have multiple length-k stretches of nucleotides that are not interrupted by
substitutions or insertion/deletion mutations.
https://readiab.org/database-searching.html 13/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
## TODO: Update this to be Jaccard index (or describe it that way, if it already is)
def fraction_shared_kmers(kmer_freqs1, kmer_freqs2):
"""Compute the fraction of kmers in kmer_freqs1 that are also in kmer_freqs2
Parameters
----------
kmer_freqs1, kmer_freqs2
Returns
-------
float
Raises
------
ValueError
If k < 1.
Notes
-----
k-mer counts are not incorporated in this distance metric.
"""
sequence1_kmers = set(kmer_freqs1)
num_sequence1_kmers = len(sequence1_kmers)
sequence2_kmers = set(kmer_freqs2)
shared_kmers = sequence1_kmers & sequence2_kmers
return len(shared_kmers) / num_sequence1_kmers
def heuristic_local_alignment_search_kmers(
queries, reference_db, database_subset_size, k, n=5,
reference_db_kmer_frequencies=None,
aligner=local_pairwise_align_ssw):
results = []
if reference_db_kmer_frequencies is None:
reference_db_kmer_frequencies = \
{r.metadata['id'] : r.kmer_frequencies(k=k, overlap=True) for r in reference_db}
for q in queries:
query_kmer_frequency = q.kmer_frequencies(k=k, overlap=True)
database_subset = []
for r in reference_db:
ref_kmer_frequency = reference_db_kmer_frequencies[r.metadata['id']]
s = fraction_shared_kmers(query_kmer_frequency, ref_kmer_frequency)
database_subset.append((s, r))
database_subset.sort(key=lambda x: x[0], reverse=True)
database_subset = [e[1] for e in database_subset[:database_subset_size]]
results.append(local_alignment_search(
[q], database_subset, n=n, aligner=aligner))
return pd.concat(results)
k = 7
Let’s apply this and see how it does. How does the runtime and fraction of correct assignments compare to our GC content-based search and our full database search?
heuristic_local_alignment_search_kmers_50 = \
functools.partial(heuristic_local_alignment_search_kmers, k=k,
database_subset_size=database_subset_size)
1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__', 'f__')
2729797
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
864506
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Chromatiales', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__[Marinicellales]',
'f__[Marinicellaceae]')
937848
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
774561
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
4353245
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Sphingomonadales',
'f__Sphingomonadaceae')
https://readiab.org/database-searching.html 14/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Here we’ll compute all of the reference database kmer frequencies. Notice that this step takes about a minute to complete. This is a minute of compute time that we can
save on every database search!
We’ll now pass our pre-computed kmer frequencies into our search function. How does the runtime and accuracy of this search compare to the searches above? This last
database search that we’ve implemented here is very similar to how BLAST works.
heuristic_local_alignment_search_kmers_50 = \
functools.partial(heuristic_local_alignment_search_kmers,
reference_db_kmer_frequencies=reference_db_kmer_frequencies,
k=k, database_subset_size=database_subset_size)
1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__', 'f__')
2729797
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
864506
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Chromatiales', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__[Marinicellales]',
'f__[Marinicellaceae]')
937848
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
774561
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
4353245
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Sphingomonadales',
'f__Sphingomonadaceae')
GAAGCAGCAC
GAACAGAAC
If we tell our search algorithm that we’re interested in the alignment with the fewest number of substitutions, the following alignment would get us zero substitutions, but
there are a lot of bases that look homologous which are not aligned.
GAAGCAGCAC-----
GAA------CAGAAC
https://readiab.org/database-searching.html 15/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
On the other hand, if we want to find the alignment with the fewest number of gaps, this one would get us that result, but we now have a lot of substitution events, and
some regions that clearly look misaligned (such as the CAG sequence in the middle of both).
GAAGCAGCAC
GAACAGA-AC
The alignment score that has been reported by our pairwise aligners helps us to balance these different features, and we can adjust the scoring scheme to weight things
differently (e.g., so that gaps are penalized more or less than certain substitutions). The problem is that the scores are hard to interpret, particularly when we have only one
or a few of them.
Our threshold for defining what we consider to be a high versus low p-value is dependent on how often we’re willing to be wrong. We would set this value, which is usually
referred to as α, to some fraction, and if our p-value is less than α, we say that the alignment is statistically significant. If our p-value is greater than α, we say that our
alignment is not statistically significant.
There are a couple of ways that we could be wrong when we do sequence homology searching, and we need to consider these when we determine what value we want to
define as α. First, we could say a pair of sequences are homologous when they’re not, which would be a false positive or a type 1 error. Or, we could say that a pair of
sequences are not homologous when they are, which would be a false negative, or a type 2 error.
If incurring a false positive about 5% of the time is acceptable (i.e., you’re ok with calling a pair of sequences homologous when they actually are not about one in twenty
times) then you’d set your α to 0.05. Setting α to a value this high likely means that the method will err on the side of false positives, and only infrequently will it say that a
pair of sequences are not homologous when they actually are (i.e., achieve a false negative). If α were set to be very low on the other hand (say, 1 × 10−50 ), then you will
err on the side of false negatives. Only infrequently will you say that a pair of non-homologous sequences are homologous, but you might call many pairs of homologous
sequences non-homologous. You should think of α as a dial. If you turn the dial toward higher values, you’ll increase your false positive rate and decrease your false
negative rate. If you turn the dial toward lower values, you’ll decrease your false positive rate and increase your false negative rate.
There is not a hard-and-fast rule for whether false positives or false negatives are better, which makes choosing α hard. It’s application specific, so you need to understand
the biological question your asking when making this decision, and the ramifications of false positives versus false negatives. In general, when might you prefer to have
false positives? When might you prefer to have false negatives?
First, we’ll define a function that can generate random sequences for us. This will take a scikit-bio sequence object (either skbio.DNA, skbio.RNA, or skbio.Protein) and a
length, and it will randomly generate a sequence of that type and length for us.
import random
def random_sequence(moltype, length):
result = []
alphabet = list(moltype.nondegenerate_chars)
for e in range(length):
result.append(random.choice(alphabet))
return moltype(''.join(result))
We can now run this a few times to generate some random sequences:
random_sequence(skbio.DNA, 50)
DNA
--------------------------------------------------------
Stats:
length: 50
has gaps: False
has degenerates: False
has definites: True
GC-content: 46.00%
--------------------------------------------------------
0 ATCGTTCCTA TTCCCCGAGG CCATAGCTTG TTAAGTAAGT CCATCAGGAT
random_sequence(skbio.DNA, 50)
DNA
--------------------------------------------------------
Stats:
length: 50
has gaps: False
has degenerates: False
has definites: True
GC-content: 60.00%
--------------------------------------------------------
0 GCTAGATGGT GGCGACGCTG GTTCCATCAA ATTGCGCCGC TCGTGCTACC
Next, we need a function that will shuffle the characters in a sequence, and give us a new sequence back. We’ll use this to generate a sequence that is similar (in length and
composition) to our input sequence, but which we know is not homologous. We’ll use Pythons random.shuffle function, which randomly re-orders the order of the
elements in a sequence, but keeps the composition and length of the sequence the same.
def shuffle_sequence(sequence):
# generate a list of the position indices (numbers) in sequence
randomized_order = list(range(len(sequence)))
# randomly rearrange the order of that list
random.shuffle(randomized_order)
# return a new sequence, where the positions are shuffled
return sequence[randomized_order]
Now we can define a random sequence and shuffle it. Notice how the sequences are different (in their order), but their compositions (e.g., length and GC content) are the
same. Shuffling will change the order of the bases, but it won’t change the frequency at which each base is present - it’s exactly analogous to shuffling a deck of cards.
https://readiab.org/database-searching.html 16/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
DNA
--------------------------------------------------------
Stats:
length: 50
has gaps: False
has degenerates: False
has definites: True
GC-content: 54.00%
--------------------------------------------------------
0 GATAAGAAAT TAACCGCGTG GTGGGGAGAC GGCCCTTGTG TGGCTCAATC
shuffle_sequence(seq)
DNA
--------------------------------------------------------
Stats:
length: 50
has gaps: False
has degenerates: False
has definites: True
GC-content: 54.00%
--------------------------------------------------------
0 AGGACGGACA GAGTGCTCGG GTCGTTAGAC CACGAGTGAT TCCGATTTGA
Let’s generate a random query sequence and align it against itself to see what that score would be.
Score: 100.00
Next let’s generate 99 random variants of that sequence with shuffle_sequence and compute the pairwise alignment for each of those variants against the query sequence.
We’ll then look at the distribution of those scores.
def generate_random_score_distribution(sequence1,
sequence2,
n=99,
aligner=local_pairwise_align_ssw):
scores = []
# iterate n times
for i in range(n):
# generate a randomized version of the first sequence
random_sequence = shuffle_sequence(sequence1)
# align that randomized sequence against the second sequence
# and save its score
_, score, _ = aligner(random_sequence, sequence2)
scores.append(score)
# return the n randomized alignment scores
return scores
[10, 11, 10, 10, 13, 8, 13, 10, 10, 11, 12, 12, 14, 12, 9, 9, 11, 11, 11, 13, 12, 11, 12, 11,
12, 11, 15, 9, 12, 14, 16, 8, 13, 10, 10, 8, 11, 12, 16, 8, 12, 10, 12, 21, 10, 10, 17, 12,
13, 11, 10, 8, 12, 10, 12, 16, 15, 14, 12, 11, 12, 11, 10, 11, 14, 10, 10, 12, 9, 10, 16, 10,
10, 8, 12, 13, 10, 13, 14, 11, 10, 11, 14, 15, 11, 9, 10, 10, 11, 11, 12, 8, 13, 12, 11, 12,
10, 11, 12]
How does the actual score of aligning the sequence to itself compare to the score of aligning it to many similar but non-homologous sequences? Let’s plot these to get a
better idea.
plot_score_distribution(actual_score, random_scores)
/usr/share/miniconda/envs/iab2/lib/python3.8/site-packages/seaborn/distributions.py:2557:
FutureWarning: `distplot` is a deprecated function and will be removed in a future version.
Please adapt your code to use either `displot` (a figure-level function with similar
flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:>
What does this tell us about our alignment score and therefore about our alignment? Is it good or bad?
We finally have information that we can use to evaluate an alignment score, and therefore to evaluate the quality of an alignment. Let’s use this information to quantify the
quality of the alignment by computing a p-value. As we described above, this is simply the probability that we would obtain an alignment score at least this good if the
sequences being aligned are not homologous. Since we have a lot of scores now from sequences that are similar but not homologous, if we just count how many are at
least as high as our actual score and divide by the number of scores we compute, that is an empirical (data-driven) way of determining our p-value.
To determine if our alignment is statistically significant, we need to define α before computing the p-value so the p-value does not impact our choice of α. Let’s define α
as 0.05. This choice means if we obtain a p-value less than 0.05 we will consider the alignment statistically significant and accept the hypothesis that the sequences are
homologous.
https://readiab.org/database-searching.html 17/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
def generate_random_score_distribution(sequence1,
sequence2,
n=99,
aligner=local_pairwise_align_ssw):
scores = []
# iterate n times
for i in range(n):
# generate a randomized version of the first sequence
random_sequence = shuffle_sequence(sequence1)
# align that randomized sequence against the second sequence
# and save its score
_, score, _ = aligner(random_sequence, sequence2)
scores.append(score)
# return the n randomized alignment scores
return scores
def fraction_better_or_equivalent_alignments(sequence1,
sequence2,
n = 99,
aligner=local_pairwise_align_ssw):
# align sequence1 and sequence2 and store the score of the alignment
_, actual_score, _ = aligner(sequence1, sequence2)
# compute the distribution of randomized scores
random_scores = generate_random_score_distribution(sequence1,
sequence2,
n,
aligner=aligner)
# count the number of random scores that are at least as good as our
# actual score
count_better = 0
for s in random_scores:
if s >= actual_score:
count_better += 1
# return the number of times we observe a score at least as good as the
# random score divided by the number of scores we computed. we add one
# to the numerator and denominator to account for our actual_score
return (count_better + 1) / (n + 1)
The fraction that we get back here is 0.01, which is lower than α, so we would accept the hypothesis that our sequences are homologous.
A few notes on these empirically defined p-values. First, here’s what the formula for computing this looks like:
number of computed aligned scores greater than or equal to the actual alignment score
p value = number of alignment scores computed
1
The numerator and the denominator both include the actual alignment score, so the lowest p-value that can be achieved is 99+1 , where the 1 in the numerator
corresponds to our actual alignment score (which is of course equal to itself), where the 99 in the denominator is the number of permutations, and the 1 in the
denominator is a constant which corresponds the computation of the actual score. If we increase the number of permutations, say to 999, we could achieve greater
precision (more significant digits) in our p-value.
When we achieve the lowest possible value for a given test, as is the case here, we report the p-value as being less than that value, since we’ve yet to observe a random
alignment score at least that high. For example, here we would report something like:
The alignment of our query and reference sequence was statistically significant, as determined by comparing our actual alignment score to random variants (p < 0.001).
Let’s now try this for some harder cases, where the query and subject sequences are not identical. First, let’s generate a longer subject sequence at random. Then, we’ll
create a random query sequence and compare it. Since we’re doing this in two random steps, we know that these sequences are not homologous. Does the resulting p-
value reflect that?
DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 53.20%
---------------------------------------------------------------------
0 GTGCCTACTG AGGGAGCTCA TCGTGCGCAG CATCATACGC ATAACCGAAT TCTATTAGCC
60 GCCCCGCTCA TGTAGCCTGG CCTTACTTCA AAGAAAGTCC GTAACGTGGA GAAACCGAGG
120 CGGTGCCGCG ACTTGTGCTT CGTTTACGCG GGGCCCGAAT GCTACACTTC GCTCGTCGGA
180 CGTCGATCAT AGTGACTGCG ATGAAAATGA AGACCAACTC TTGGTAACGG TAAGAAAAAG
240 TGGTGCGAGA
DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 45.20%
---------------------------------------------------------------------
0 TATCCGTACG TAGCCCACTT AATGTAGATA TAGTTCTATA AGTCAACAGA TTACACCGCG
60 GAACCGTCAC TAGTTATGTA GCTTCATGAG AATTGCAGGT CGATTATCTG CCCCCACGCT
120 AATAACTAGT GGAATGGCAG CAAGTGGCTT CTACCTGCCT TTATAGAGAG TGACTTGAAT
180 AACATACAAG CGGGCCAAAG CAGGAGGCGA ATTAACTTAG GGCGTAGACA AGTGTCGGGT
240 GTTTGTAGAC
https://readiab.org/database-searching.html 18/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
We’ve now looked at two extremes: where sequences are obviously homologous (because they were the same), and where sequences are obviously not homologous
(because they were both independently randomly generated). Next, we’ll explore the region between these, where this gets interesting. We’ll now create a partially
randomized sequence to create a pair of sequences where the homology is more obscure. We’ll do this again using the Python random module, but this time we’ll introduce
mutations only at some positions to create a pair of sequences that are approximately percent_id identical.
Let’s define a function to do this, and then compute a sequence that is 95% identical to our sequence1.
sequence1
DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 53.20%
---------------------------------------------------------------------
0 GTGCCTACTG AGGGAGCTCA TCGTGCGCAG CATCATACGC ATAACCGAAT TCTATTAGCC
60 GCCCCGCTCA TGTAGCCTGG CCTTACTTCA AAGAAAGTCC GTAACGTGGA GAAACCGAGG
120 CGGTGCCGCG ACTTGTGCTT CGTTTACGCG GGGCCCGAAT GCTACACTTC GCTCGTCGGA
180 CGTCGATCAT AGTGACTGCG ATGAAAATGA AGACCAACTC TTGGTAACGG TAAGAAAAAG
240 TGGTGCGAGA
sequence1_95
DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 52.00%
---------------------------------------------------------------------
0 GTGCCTACTG AGGGAGCTCA TCGTTCGCAG CTTCATACGG GTAACCGAAT TCTATTAGCC
60 GCCCCGCTCA TGTAGCCTGG CCTTACTTCA AAGAAAGTCC GTAACGTGGC GAAAACGAGG
120 AGGTGCCGCG ACTTGTGCTT CGTTTACGCT GGGCCCGAAT GCTACACTTC GTTCGTCGGA
180 CGTCGATCAT AGTGACTGCG ATGAAAATGA ACACCAACTC TTGGTAACGG TAAGAAATAG
240 TGGTGCGAGA
Notice how these sequences are almost identical, but have some differences. Let’s apply our approach to determine if it would identify these sequences as being
homologous based on α = 0.05 .
You likely got a significant p-value there, telling you that the sequences are homologous.
Now let’s simulate much more distantly related sequences by introducing substitutions at many more sites.
sequence1
DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 53.20%
---------------------------------------------------------------------
0 GTGCCTACTG AGGGAGCTCA TCGTGCGCAG CATCATACGC ATAACCGAAT TCTATTAGCC
60 GCCCCGCTCA TGTAGCCTGG CCTTACTTCA AAGAAAGTCC GTAACGTGGA GAAACCGAGG
120 CGGTGCCGCG ACTTGTGCTT CGTTTACGCG GGGCCCGAAT GCTACACTTC GCTCGTCGGA
180 CGTCGATCAT AGTGACTGCG ATGAAAATGA AGACCAACTC TTGGTAACGG TAAGAAAAAG
240 TGGTGCGAGA
sequence1_25
DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 48.80%
---------------------------------------------------------------------
0 GTGGATAATA CGTGATAGCA TGGTGAGGAT GAGTTCACCA ATTGACGCTT TTTCTCAGCG
60 GCCTGGGTCA GGGCACCGAT TAGGGGTGAA TCCTAACTCA TTAGCGTGTC ATGACGGAGC
120 TGTAGCCGCG ACATGTCACT AGGGCTGACT AAGGCGAGAC AGGAGCCTGC CCGTAGATGA
180 GGGAAATAAA AGTTTACCAG ATTATAATTA AGACATCCTC TGGGATACGT TAGGATATGG
240 CGGTCTTTGA
https://readiab.org/database-searching.html 19/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Lets run a simulation to gain some more insight into the limit of detection of this method. We’ll run this approach for pairs of sequences where we vary the percent_id
parameter, and identify when our approach stops identifying sequence pairs as being homologous. This is important to know as a bioinformatician, because it tells us
around what pairwise similarity we will no longer be able to identify homology using this approach.
# First, let's define the range of percent identities that we'll test
percent_ids = np.arange(0.0, 1.0, 0.05)
# Then, we'll define the number of random sequences we'll test at each percent identity
num_trials = 20
# Then, we'll define the sequence length that we want to work with, and num_trials random
sequences
sequence_length = 150
random_sequences = [random_sequence(skbio.DNA, sequence_length) for i in range(num_trials)]
results = []
What does this simulation tell us about our limit of detection for homology (i.e., how similar must a pair of sequences be for us to reliably be able to identify homology
between them)? Is this higher or lower than you expected?
With respect to our simulation, I took a few shortcuts here to keep the runtime low. What are some things that could be improved to make this simulation more robust, if
we weren’t as concerned about runtime?
By J Gregory Caporaso
© Copyright 2014-2021.
https://readiab.org/database-searching.html 20/20