0% found this document useful (0 votes)

9 views20 pages

Sequence Homology Searching — an Introduction to Applied Bioinformatics

This document introduces sequence homology searching using pairwise alignment to identify biological sequence homology, which is crucial for understanding the origins and functions of sequences. It discusses the distinction between homology and similarity, methods for calculating similarity, and the implementation of homology searches using local alignment functions. The document also emphasizes the importance of heuristic algorithms for efficient searching in large databases and provides examples of loading annotated sequences for analysis.

Uploaded by

shecrime248

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views20 pages

Sequence Homology Searching — an Introduction to Applied Bioinformatics

Uploaded by

shecrime248

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

 Contents

Sequence homology searching Defining the problem

Loading annotated sequences Print to PDF
In this chapter we’ll talk about using pairwise alignment to search databases of biological sequences with the goal of identifying sequence homology. We previously
Defining the problem
defined homology between a pair of sequences to mean that those sequences are derived from a common ancestral sequence. Homology searching is an essential part of
A complete homology search function
making inferences about where a biological sequence came from, and/or what it does. In most cases, if you have an unannotated biological sequence, such as the
Reducing the runtime for database searches
following protein sequence, it’s very hard (really, impossible) to know what it is without more information.
Heuristic algorithms
Random reference sequence selection
What a researcher will often do is search this sequence, their query, against some reference database of annotated sequences to learn what function the sequence
Composition-based reference sequence collection
performs (if the reference database contains functional annotation of sequences) and/or what organisms are likely to encode this sequence in their genome (if the
Determining the statistical significance of a pairwise alignment
reference database contains taxonomic annotation of sequences).
Metrics of alignment quality
False positives, false negatives, p-values, and alpha
 Exercise Interpreting alignment scores in context

Exploring the limit of detection of sequence homology searches

Whose genome is the following sequence encoded in? What is its function? Take a minute now to answer these questions using the Protein BLAST homology
search tool on the NCBI website.

>mystery-sequence1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDP
LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
SEPVLKGVKLHYT

In the context of database searching, a query sequence and a reference sequence that we hypothesize to be homologous can be identical to one another, or they can differ
as a result of mutation events. When sequences differ, we’re often then interested in how much they differ, or their pairwise similarity, which can help us identify the most
closely related of several homologs in the reference database. There is an important distinction in the terms homology and similarity: homology is a discrete variable, and
similarity is a continuous variable. A pair of biological sequences either are or are not derived from a common ancestor, but they can be more or less similar to each other.
Saying that two sequences are 80% homologous doesn’t make sense. What people generally mean when they say this is that two sequences are 80% similar, and as a result
they are hypothesizing homology between the sequences.

Similarity between a pair of sequences can be computed in a few different ways. In this text, unless otherwise noted, we’ll compute similar as:
Similarity = 1 − Hamming distance. Recall that Hamming distance is the fraction of positions that differ between aligned sequences. Similarity is therefore the
inverse of that: the fraction of positions that do not differ between aligned sequences.

Defining the problem

As mentioned above, if we want to perform a homology search we’ll have one or more query sequences, and for each we want to know which sequence(s) in a reference
database it is most similar to.

Sequence homology searching can be implemented in a few ways. In this chapter, we’ll use the local alignment function that we worked with in the Pairwise Alignment
chapter, local_pairwise_align_ssw, run it many times to search one query sequence against many reference sequences, and investigate the highest scoring alignment(s) to
identify the best database match. Remember that you can always get help with a function by passing it as an argument to help:

from skbio.alignment import local_pairwise_align_ssw

help(local_pairwise_align_ssw)

https://readiab.org/database-searching.html 1/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

Help on function local_pairwise_align_ssw in module skbio.alignment._pairwise:

local_pairwise_align_ssw(sequence1, sequence2, **kwargs)

Align query and target sequences with Striped Smith-Waterman.

State: Experimental as of 0.4.0.

Parameters
----------
sequence1 : DNA, RNA, or Protein
The first unaligned sequence
sequence2 : DNA, RNA, or Protein
The second unaligned sequence

Returns
-------
tuple
``TabularMSA`` object containing the aligned sequences, alignment score
(float), and start/end positions of each input sequence (iterable
of two-item tuples). Note that start/end positions are indexes into the
unaligned sequences.

Notes
-----
This is a wrapper for the SSW package [1]_.

For a complete list of optional keyword-arguments that can be provided,

see ``skbio.alignment.StripedSmithWaterman``.

The following kwargs will not have any effect: `suppress_sequences`,

`zero_index`, and `protein`

If an alignment does not meet a provided filter, `None` will be returned.

References
----------
.. [1] Zhao, Mengyao, Wan-Ping Lee, Erik P. Garrison, & Gabor T.
Marth. "SSW Library: An SIMD Smith-Waterman C/C++ Library for
Applications". PLOS ONE (2013). Web. 11 July 2014.
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082138

See Also
--------
skbio.alignment.StripedSmithWaterman

When our reference database starts getting hundreds of millions of bases long (as would be the case if we were searching against 97% OTUs from the Greengenes small-
subunit ribosomal RNA (SSU rRNA) reference database), billions of bases long (as would be the case if we were searching against the human genome) or trillions of bases
long (as would be the case if we were searching against the NCBI non-redundant nucleotide database), runtime becomes an important consideration. For that reason,
learning about heuristic algorithms is an essential part of learning about sequence homology searching. Heuristic algorithms apply some rules (i.e., heuristics) to
approximate the correct solution to a problem in a fraction of the runtime that would be required if we wanted to be guaranteed to find the correct solution. Heuristic
algorithms are very common in bioinformatics, and we’ll use them in several other places in this book.

While we’ll be aligning nucleotide sequences in this chapter, the same concepts apply to protein homology searching.

Loading annotated sequences

The first thing we’ll do as we learn about sequence homology searching is load some annotated sequences. The sequences that we’re going to work with are derived from
the Greengenes database. Greengenes is a database of 16S rRNA gene sequences, a component of the archaeal and bacterial ribosome (the molecular machine that drives  Note
translation of mRNA to proteins). This gene is of a lot of interest to biologists because it’s one of about 200 genes that are encoded in the genomes of all known cellular We’re accessing Greengenes sequences here throu
organisms. The sequences in Greengenes are taxonomically annotated, meaning that we’ll have a collection of gene sequences and the taxonomic identity of the organism project. The QIIME default reference project isn’t ac
of QIIME but it’s a convenient resource for accessin
whose genome the sequence is found in. If we search an unannotated 16S rRNA query sequence against this database, we can make inferences about what organism our
sequences from Python. This resource is handy if yo
query sequence is from. experimental purposes, but is outdated and should

First, let’s load Greengenes into a list of skbio.DNA sequence objects, and associate the taxonomy of each sequence as sequence metadata.

reference_taxonomy, reference_db = load_taxonomy_reference_database()

88452 sequences were loaded from the reference database.

Next, we’ll just inspect a couple of the sequences we loaded. Notice how the specificity of our taxonomic annotations (i.e., how many taxonomic levels are annotated and
unknown) differs for different sequences.

reference_db[0]

DNA
-----------------------------------------------------------------------
Metadata:
'description': ''
'id': '1111883'
'taxonomy': 'k__Bacteria; p__Gemmatimonadetes; c__Gemm-1; o__; f__;
g__; s__'
Stats:
length: 1428
has gaps: False
has degenerates: False
has definites: True
GC-content: 61.90%
-----------------------------------------------------------------------
0 GCTGGCGGCG TGCCTAACAC ATGTAAGTCG AACGGGACTG GGGGCAACTC CAGTTCAGTG
60 GCAGACGGGT GCGTAACACG TGAGCAACTT GTCCGACGGC GGGGGATAGC CGGCCCAACG
...
1320 GCCGCGGTGA ATACGTTCCC GGGCCTTGTA CACACCGCCC GTCACGCCAT GGAAGCCGGA
1380 GGGACCCGAA ACCGGTGGGC CAACCGCAAG GGGGCAGCCG TCTAAGGT

reference_db[-1]

https://readiab.org/database-searching.html 2/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

DNA
----------------------------------------------------------------------
Metadata:
'description': ''
'id': '4483258'
'taxonomy': 'k__Archaea; p__Crenarchaeota; c__Thermoprotei;
o__Thermoproteales; f__Thermoproteaceae; g__; s__'
Stats:
length: 2123
has gaps: False
has degenerates: False
has definites: True
GC-content: 58.36%
----------------------------------------------------------------------
0 CTGGTTGATC CTGCCGGACC CGACCGCTAT CGGGGTGGGG CTTAGCCATG CGAGTCAAGC
60 GCCCCAGGGA CCCGCTGGGG TGCGGCGCAC GGCTCAGTAA CACGTGGCCA ACCTACCCTC
...
2040 ATAATCTCCT TATTGTCTGA TCCTTATGCA TTTTCCTTTG GCCCATCCCG TGAATACGCG
2100 CGGTGAATAC GTCCCTGCCC CTT

For the sake of runtime, we’re going to work through this chapter using a random sample of sequences from this database. Here we’ll use Python’s random module to
select sequences at random.

import random

reference_db = random.sample(reference_db, k=5000)

print("%s sequences are present in the subsampled database." % len(reference_db))

5000 sequences are present in the subsampled database.

We’ll also extract some sequences from Greengenes to use as query sequences in our database searches. This time we won’t annotate them (to simulate not knowing what
organisms they’re from). We’ll also trim these sequences so they’re shorter than the full length references. This will simulate obtaining a partial gene sequence, as is most
common with the current sequencing technologies (as of this writing), but will also help to make the examples run faster.

Note that some of our query sequences may also be in our subsampled reference database and some won’t. This is realistic: sometimes we’re working with sequences that
are exact matches to known sequences, and sometimes we’re working with sequences that don’t match any known sequences (or at least any in the reference database
that we’re working with).

queries = load_taxonomy_query_sequences()
queries = random.sample(queries, k=50)

Let’s inspect a couple of the query sequences that we’ll work with.

queries[0]

DNA
---------------------------------------------------------------------
Metadata:
'description': ''
'id': '1110988'
Stats:
length: 200
has gaps: False
has degenerates: False
has definites: True
GC-content: 46.50%
---------------------------------------------------------------------
0 GGTGAGTAAC ACGTGAGCAA CCTGCCTTTG AGAGAGGGAT AGCTTCGGGA AACTGATGGT
60 AATACCTCAT AACATATATT TAAGGCATCT TAGATATATC AAAGATTTAT CACTCAAAGA
120 TGGGCTCGCG TCTGATTAGC TGGTTGGTGA GGTAGCGGCT CACCAAGGCG ACGATCAGTA
180 GCCGGACTGA GAGGTTGAAC

queries[-1]

DNA
---------------------------------------------------------------------
Metadata:
'description': ''
'id': '547906'
Stats:
length: 200
has gaps: False
has degenerates: False
has definites: True
GC-content: 51.50%
---------------------------------------------------------------------
0 GTGGGCAACC TGCCCTATAC TGGGGAATAA TCACTGGAAA CGGTGACTAA TACCGCATGT
60 CATTGCGAGG GGGCATCCTC TTGCAAGAAA AGGAATTTAT TTCGGTATAG GATGGGCCCG
120 CATCTGATTA GCTAGTTGGT GAGATAACAG CCCACCAAGG CGACGATCAG TAGCCGACCT
180 GAGAGGGTGA TCGGCCACAT

Defining the problem

The problem that we are going to address here is as follows. We now have a query sequence (qi ) which is not taxonomically annotated (meaning we don’t know the
taxonomy of the organism whose genome it is found in), and a reference database (R) of taxonomically annotated sequences (r1 , r2 , r3 , . . . rn ). We want to infer a
taxonomic annotation for qi . We’ll do this by identifying the most similar sequence(s) in R and associating their taxonomy with qi . Because we actually do know the
taxonomy of qi (to the extent that we trust the annotations in R), we can evaluate how well this approach works.

There are a few realistic features of the situation that we’ve set up here that I want you to be aware of.

1. All of the query and reference sequences are homologous. In this case, they are all sequences of the 16S rRNA gene from archaea and bacteria. This may or may not
be the case in real-world applications. Sometimes you’ll work with gene-specific databases such as Greengenes, and sometimes you’ll work with non-specific
databases such as the NCBI nucleotide database (nr). Regardless, the search process is similar.
2. The distance between each query sequence and its most closely related sequences in R will vary widely. Sometimes q will be an exact match to a reference sequence
ri , and sometimes we may have as little as 50% similarity.

As we work through the next sections, imagine that we’re exploring scaling this system up, so that instead of searching just one or a few query sequences against the
reference database, we ultimately want to apply this to search millions of sequences against the database. This would be the real-world problem we faced if we had
collected 16S rRNA sequences from the environment (which would of course be unannotated) using high-throughput DNA sequencing.

https://readiab.org/database-searching.html 3/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

A complete homology search function

Let’s define a homology search function that aligns each provided query sequences qi with each of our reference database sequences (r1 , r2 , r3 , . . . rn ). This function will
take as input one or more query sequences, and the reference database. We’ll call the top scoring alignments for each qi the best hits, and we’ll specifically request some
number (n) of best hits for each qi . The output of this function will be a summary of the n best hits for each query sequence, including some technical information about
the alignment and the taxonomy associated with the corresponding reference sequence. We’ll then review the taxonomy annotations for our best hits, and from those
make an inference about the taxonomy annotation for qi .

Spend a minute looking at this function and try to understand what it’s doing.

import pandas as pd
from skbio.alignment import local_pairwise_align_ssw

def local_alignment_search(queries, reference_db, n=5,

aligner=local_pairwise_align_ssw):
results = []
indices = []
for q in queries:
# first we'll compute all of the alignments and their associated scores
hits = []
for r in reference_db:
aln, score, _ = aligner(q, r)
hits.append([r.metadata['id'], score, aln,
r.metadata['taxonomy']])
# then we reverse-sort them by score, and return the n highest
# scoring alignments (this needs to be updated so we only
# ever keep track of the n highest scoring alignments)
best_hits = sorted(hits, key=lambda e: e[1], reverse=True)[:n]
if len(best_hits) == 0:
# if there are no hits, log that information
indices.append((q.metadata['id'], "no matches"))
results.append(("n/a", np.nan, np.nan, np.nan))
else:
# otherwise compile and track some information about the n
# best hits
for r_id, score, aln, r_tax in best_hits:
percent_similarity = (100 * (1. - aln[0].distance(aln[1])))
aln_length = aln.shape[1]
indices.append((q.metadata['id'], r_id))
results.append((r_tax, percent_similarity,
aln_length, score))
index = pd.MultiIndex.from_tuples(indices, names=['query', 'reference'])
columns = ['reference taxonomy', 'percent similarity',
'alignment length', 'score']
results = pd.DataFrame(results, index=index, columns=columns)
return results

Now let’s perform some database searches. You can run the remaining code cells in this section a few times to experiment with searching different query sequences against
the same reference database.

This next cell, which is the one that actually performs the database searches, will take a little bit of time to run (maybe up to a minute or two). There is some code in this
cell that will track the runtime. As it’s running, think about how many query sequences we’re searching against how many reference sequences, and refer back to the
number of sequences in the full reference database. Does this strategy seem scalable to millions of sequences, which as mentioned above might be our ultimate goal?
When you know the per-sequence runtime of this search, estimate how long it would take to do this in seconds for one million sequences. Convert the time in seconds to a
unit that will be more meaningful to you.

import time

start_time = time.time()
current_queries = random.sample(queries, k=4)
results = local_alignment_search(current_queries, reference_db)
stop_time = time.time()
print("Runtime: %1.4f sec per query" % ((stop_time - start_time) / len(current_queries)))
results

Runtime: 2.9513 sec per query

https://readiab.org/database-searching.html 4/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

percent alignment
reference taxonomy score
similarity length

query reference

354971 211339 kBacteria; pFirmicutes;

95.918367 196 352
c__Clostridia; o__...

kBacteria; pFirmicutes; c__Bacilli;

543173 90.816327 196 302
o__Lac...

kBacteria; pFirmicutes; c__Bacilli;

354905 86.666667 195 258
o__Lac...

kBacteria; pFirmicutes; c__Bacilli;

744387 86.082474 194 251
o__Lac...

kBacteria; pFirmicutes; c__Bacilli;

297314 85.204082 196 247
o__Lac...

326869 4445752 kBacteria; pProteobacteria;

95.000000 200 350
c__Alphaproteo...

k__Bacteria; p__Proteobacteria;
349639 92.500000 200 325
c__Alphaproteo...

k__Bacteria; p__Proteobacteria;
616682 92.500000 200 325
c__Alphaproteo...

k__Bacteria; p__Proteobacteria;
766020 92.000000 200 320
c__Alphaproteo...

k__Bacteria; p__Proteobacteria;
4714 92.000000 200 320
c__Alphaproteo...

1111024 4438586 kBacteria; pGemmatimonadetes;

88.613861 202 284
c__Gemmatimo...

k__Bacteria; p__Gemmatimonadetes;
4452903 86.124402 209 276
c__Gemmatimo...

k__Bacteria; p__Gemmatimonadetes;
212819 87.254902 204 269
c__Gemmatimo...

k__Bacteria; p__Gemmatimonadetes;
1148064 86.893204 206 266
c__Gemmatimo...

k__Bacteria; p__Gemmatimonadetes;
1813677 86.274510 204 264
c__Gemmatimo...

1726426 1726426 kBacteria; pProteobacteria;

100.000000 200 400
c__Gammaproteo...

k__Bacteria; p__Proteobacteria;
197286 97.500000 200 375
c__Gammaproteo...

k__Bacteria; p__Proteobacteria;
540402 97.000000 200 370
c__Gammaproteo...

k__Bacteria; p__Proteobacteria;
4217232 97.000000 200 370
c__Gammaproteo...

k__Bacteria; p__Proteobacteria;
736735 96.500000 200 365
c__Gammaproteo...

Now, let’s try to answer our initial question: what is the most likely taxonomic annotation for each of our query sequences? Spend a few minutes reviewing this information,
and write down what you think the most likely taxonomic annotation is for each of the query sequences. Here are some hints to help you out:

The k, p, c, o, f, g, and s refer to kingdom, phylum, class, order, family, genus, and species, respectively. If you see an annotation for a reference sequence that looks like
g__, that means that the genus is unknown for that sequence.
Just as the reference taxonomy annotations don’t always go down to the species level, your taxonomic annotations don’t have to either. Not assigning at a given
level implies that you’re uncertain about what the annotation should be at that level, and it’s usually better just to indicate that you’re uncertain rather than make a
bad guess. If you’re uncertain of what the species is, assign the query s__ and try to decide what the most likely genus is. If you’re uncertain of the genus, assign g__,
and try to decide what the most likely family is…
As you look at each of the reference taxonomy annotations below, refer back to the table above to look at the percent similarity between each query and reference,
and maybe the length of the alignments and their scores. These values give you an idea of how confident you should be in each of your taxonomic annotations.

for q in current_queries:
q_id = q.metadata['id']
print('Closest taxonomies for query %s (in order):' % q_id)
for e in results['reference taxonomy'][q_id]:
print(' ', e)
print()

https://readiab.org/database-searching.html 5/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

Closest taxonomies for query 354971 (in order):

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae;
g__Lactobacillus; s__
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae;
g__Lactobacillus; s__
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae;
g__Lactobacillus; s__
k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__; g__; s__

Closest taxonomies for query 326869 (in order):

k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__; f__; g__; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__; g__; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales;
f__Hyphomicrobiaceae; g__Rhodoplanes; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales;
f__Bradyrhizobiaceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales;
f__Beijerinckiaceae; g__; s__

Closest taxonomies for query 1111024 (in order):

k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__Gemmatimonadales; f__Ellin5301;
g__; s__
k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__; f__; g__; s__
k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__; f__; g__; s__
k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__; f__; g__; s__
k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__KD8-87; f__; g__; s__

Closest taxonomies for query 1726426 (in order):

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales;
f__Enterobacteriaceae; g__Serratia; s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales;
f__Enterobacteriaceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales;
f__Enterobacteriaceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales;
f__Enterobacteriaceae; g__Klebsiella; s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales;
f__Enterobacteriaceae; g__; s__

Because we have taxonomic annotations for all of the Greengenes sequences (though as you probably have noticed by now, they differ in their specificity), we can next
look at taxonomy associated with each of our queries in Greengenes. How do your annotations compare to those from Greengenes, which we’ll print out in the next cell?

for q in current_queries:
q_id = q.metadata['id']
print('Known taxonomy for query %s:\n %s' % (q_id, reference_taxonomy[q_id]))
print()

Known taxonomy for query 354971:

k__Bacteria; p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Lactobacillaceae;
g__Lactobacillus; s__

Known taxonomy for query 326869:

k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__; f__; g__; s__

Known taxonomy for query 1111024:

k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__Gemmatimonadales; f__; g__; s__

Known taxonomy for query 1726426:

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales;
f__Enterobacteriaceae; g__Serratia; s__

Reducing the runtime for database searches

In the examples above, it’s taking on the order of 5-15 seconds to search a single sequence against our subset of Greengenes. This makes sense when you think about the
computations that are being performed. For every sequence in our reference database (5000, if you haven’t modified the database subsampling step) it is computing the F
and T matrices described in the Pairwise Alignment chapter, and then tracing back the matrix to compute the aligned sequences. Given all of that, the fact that
computation only takes 5-15 seconds is pretty incredible. However, that doesn’t change the fact that this doesn’t scale to real-world applications because we’d have to wait
way too long for results. Performing all pairwise alignments is prohibitively expensive for database searching.

As we discussed in the previous chapter, the run time of pairwise alignment scales quadratically with sequence length. Database searching, at least in the example we’re
exploring in this chapter, is a bit of a different problem however. Our sequence lengths aren’t changing, but rather it takes a long time because we’re performing a
computationally expensive step, pairwise alignment, many times. Our database is fixed in that the number of sequences in it doesn’t change and the sequences themselves
don’t change. Our query sequences are all exactly the same length in this example (remember that we set that above, when we sliced a single region from reference
database sequences to create our query sequences). Let’s explore how the runtime of this database search scales under these constraints.

https://readiab.org/database-searching.html 6/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

import pandas as pd
import itertools

def tabulate_local_alignment_search_runtime(queries, reference_db, n_query_sequences,

n_reference_sequences, search_function):
data = []
# we'll iterate over the pairs of number of query sequences
# and number of reference sequences, and compute the runtime
# of the database search three times for each pair (so we
# have some idea of the variance in the runtimes). this is
# achieved here with a nested for loop (i.e., a for loop
# within a for loop).
for nq, nr in itertools.product(n_query_sequences, n_reference_sequences):
for i in range(3):
# select nq query sequences at random
current_queries = random.sample(queries, k=nq)
# select nr reference sequences at random
temp_reference_db = random.sample(reference_db, k=nr)
# run the search and store its runtime
start_time = time.time()
_ = search_function(current_queries, temp_reference_db)
stop_time = time.time()
median_query_sequence_len = np.median([len(q) for q in current_queries])
median_reference_sequence_len = np.median([len(r) for r in temp_reference_db])
data.append((nq, nr, median_query_sequence_len, median_reference_sequence_len,
stop_time - start_time))
runtimes = pd.DataFrame(data=np.asarray(data),
columns=["Number of query seqs", "Number of reference seqs",
"Median query seq length", "Median reference seq length",
"Runtime (s)"] )
return runtimes

# we'll temporarily work with a smaller reference database

# so this will run a lot faster. this will be of fixed size.
n_reference_sequences = [100]
# since our database is smaller, we can work with some slightly
# larger numbers of sequences.
n_query_sequences = [1, 5, 10, 15]

local_alignment_search_runtimes = tabulate_local_alignment_search_runtime(queries,
reference_db,
n_query_sequences,
n_reference_sequences,

local_alignment_search)
local_alignment_search_runtimes

Number of Number of Median query Median reference Runtime

query seqs reference seqs seq length seq length (s)

0 1.0 100.0 200.0 1434.0 0.061491

1 1.0 100.0 200.0 1425.0 0.063632

2 1.0 100.0 200.0 1422.0 0.055392

3 5.0 100.0 200.0 1420.0 0.288887

4 5.0 100.0 200.0 1431.0 0.298724

5 5.0 100.0 200.0 1430.5 0.298148

6 10.0 100.0 200.0 1412.0 0.570900

7 10.0 100.0 200.0 1446.0 0.649386

8 10.0 100.0 200.0 1423.5 0.578227

9 15.0 100.0 200.0 1449.0 0.864610

10 15.0 100.0 200.0 1424.0 0.864101

11 15.0 100.0 200.0 1425.5 0.914325

This table shows that we’ve tried a few variations on number of query sequences but kept the number of reference sequences constant. There is no variance in the query
sequence length, and there is a relatively small amount of variance in reference sequence length (they’re all of the same order of magnitude). There is also relatively little
variance in runtime for fixed numbers of query and reference sequences.

This table clearly shows that there is an increase in runtime with an increasing number of query sequences, which we’d of course expect. What we care about is how
runtime is increasing as a function of number of query sequences. Let’s plot runtime versus the number of query sequences to help us understand that relationship.

import seaborn as sns

ax = sns.regplot(x="Number of query seqs", y="Runtime (s)",
data=local_alignment_search_runtimes)
ax.set_xlim(0)
ax.set_ylim(0)
ax

<AxesSubplot:xlabel='Number of query seqs', ylabel='Runtime (s)'>

What we see here is pretty clearly a linear relationship: runtime ≈ constant × number of query sequences. This is because as we increase the number of query
sequences, we’re increasing the number of pairwise alignments that we need to perform. If we have 5 queries and 10 reference sequences, we compute 5 × 10 = 50
pairwise alignments. If we have 10 queries and 100 reference sequences, we compute 10 × 100 = 1000 pairwise alignments. There are a few practical ways to reduce the
runtime of a process like this.

The first seems obvious, and even silly at first: perform fewer alignments. This could be achieved in a few ways. You could reduce the number of query sequences, though
this might be something a researcher is resistant to: they have some collection of unknown sequences, and they want to know what they all are. You could alternatively
reduce the number of reference sequences, but you might run into the same issues there: we wouldn’t want to exclude reference sequences that might provide us with
useful information about our query sequences. Finally, we might be able to figure out some ways to perform fewer alignments by not searching all of the query sequences
against all of the reference sequences. If we could come up with some procedure to approximate which pairwise alignments were likely to be good (i.e., high scoring) and
which were likely to be bad (i.e., low scoring) that is faster than performing the pairwise alignments, we could apply that procedure and only align a pair of sequences when
we expect to get a high score. That could potentially allow us to reduce the number of alignments we need to perform, and therefore the runtime of the algorithm.

https://readiab.org/database-searching.html 7/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Another approach to reducing the runtime of this process would be to create a faster implementation of the algorithm (though at some point that won’t be possible
anymore), use a faster computer, or run the process in parallel on multiple processors. All of these would be ways to reduce the runtime of the search by some factor f ,
where new runtime ≈ runtime
f
.

In practice, for a production-scale sequence database search application like BLAST, we’d combine these approaches. In the next section we’ll explore ways to reduce the
runtime of database searching for a fixed number of query sequences and a fixed number of reference sequences by reducing the number of pairwise alignments that the
search function will perform.

Heuristic algorithms
As mentioned above, it just takes too long to search individual query sequences against a large database. This problem also isn’t going away anytime soon. While
computers are getting faster (or cheaper), the size of our sequences collections are getting bigger because sequencing is getting cheaper. In fact, many people think that
obtaining DNA sequences is getting cheaper faster than computers are getting cheaper. As our number of query sequences increases because we are able to obtain more
for the same amount of money, and the size of our reference databases increases (because we’re continuously obtaining more sequence data) this will increasingly become
a bigger problem. Figures 1 and 2, respectively, illustrate that these are both real-world issues. Notice that the axes are on a log scale in both cases.

import IPython.display
IPython.display.IFrame(width="600", height="394",
src="https://docs.google.com/spreadsheets/d/1vUkUuZsRlLW5U05rXXUn8B2sDYwShkClRMGa8Wiu6bc/pubcha
rt?oid=1844125885&format=interactive")

Figure 1: Genome sequencing costs.

import IPython.display
IPython.display.IFrame(width="763", height="371",
src="https://docs.google.com/spreadsheets/d/1vUkUuZsRlLW5U05rXXUn8B2sDYwShkClRMGa8Wiu6bc/pubcha
rt?oid=2103353397&format=interactive")

Figure 2: Size of GenBank.

One way that we can deal with this problem is by recognizing that most of the alignments that are performed in a database search are unlikely to be very good alignments.
An algorithm developer could therefore improve runtime by defining a heuristic (or a rule) that is applied to determine which reference sequences are likely to result in
good alignments, and only aligning the query against those. For it to be useful, making the decision to align or not (i.e., applying the heuristic) must be much faster than
actually performing the pairwise alignment. The heuristic also needs to make good choices about which reference sequences to align the query against. If the algorithm
chooses to not align against a specific reference, that reference is ruled out as a possible result of the database search. A good heuristic for sequence homology searching
would therefore be very unlikely to exclude the best alignment(s). When thinking about heuristic algorithms in general, there are some important considerations:

1. How often does the heuristic algorithm fail to get the right answer (in our case, does it make good choices about which reference sequences to align against)?
2. How much faster is the heuristic than the “complete” approach, and is that reduction in runtime enough to justify not being guaranteed to get the best answer?

We’ll now look at a few heuristics in the context of these questions.

Random reference sequence selection

Our first heuristic will be a straw man that we use as a baseline. We’ll select a random p% of the reference sequences to align our query against. This will clearly result in a
large decrease in the number of sequence alignments that we need to perform because we’ll go from performing Rs (the reference database size) sequence alignments to
p × Rs sequence alignments for each query sequence qi .

Here’s the source code for this. You can see that we’re just wrapping our local_alignment_search function in a function that samples down to p% of the reference
sequences.

import random

def heuristic_local_alignment_search_random(
queries, reference_db, p, n=5, aligner=local_pairwise_align_ssw):
k = int(p * len(reference_db))
database_subset = random.sample(reference_db, k)
return local_alignment_search(queries, database_subset, n=n, aligner=aligner)

Let’s select some new queries and see how the results compare to our known taxonomies.

current_queries = random.sample(queries, k=10)

https://readiab.org/database-searching.html 8/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

results = heuristic_local_alignment_search_random(current_queries, reference_db, p=0.10)

for q in current_queries:
q_id = q.metadata['id']
print('Closest taxonomies for query %s (in order):' % q_id)
for e in results['reference taxonomy'][q_id]:
print(' ', e)
print()

Closest taxonomies for query 4366045 (in order):

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae; g__; s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Peptostreptococcaceae; g__;
s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae;
g__Sporomusa; s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__; s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae;
g__Ruminococcus; s__flavefaciens

Closest taxonomies for query 1111024 (in order):

k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__Gemmatimonadales; f__Ellin5301;
g__; s__
k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__C114; f__; g__; s__
k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__Gemmatimonadales; f__; g__; s__
k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__Gemmatimonadales;
f__Gemmatimonadaceae; g__Gemmatimonas; s__
k__Bacteria; p__Verrucomicrobia; c__Opitutae; o__; f__; g__; s__

Closest taxonomies for query 2729797 (in order):

k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales;
f__Oxalobacteraceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales;
f__Comamonadaceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales;
f__Oxalobacteraceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Thiobacterales; f__; g__; s__
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Rhodocyclales; f__Rhodocyclaceae;
g__; s__

Closest taxonomies for query 864506 (in order):

k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__; g__; s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Alteromonadales; f__OM60; g__;
s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Alteromonadales; f__OM60; g__;
s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Alteromonadales;
f__Alteromonadaceae; g__Marinobacter; s__
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Alteromonadales;
f__Alteromonadaceae; g__nsmpVI18; s__

Closest taxonomies for query 937848 (in order):

k__Bacteria; p__Verrucomicrobia; c__[Spartobacteria]; o__[Chthoniobacterales];
f__[Chthoniobacteraceae]; g__; s__
k__Bacteria; p__Verrucomicrobia; c__[Spartobacteria]; o__[Chthoniobacterales];
f__[Chthoniobacteraceae]; g__DA101; s__
k__Bacteria; p__Verrucomicrobia; c__[Methylacidiphilae]; o__S-BQ2-57; f__; g__; s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Thermoanaerobacterales;
f__Thermoanaerobacteraceae; g__Thermoanaerobacter; s__uzonensis
k__Bacteria; p__Chloroflexi; c__S085; o__; f__; g__; s__

Closest taxonomies for query 4336536 (in order):

k__Bacteria; p__OD1; c__SM2F11; o__; f__; g__; s__
k__Bacteria; p__Acidobacteria; c__Acidobacteria-6; o__iii1-15; f__; g__; s__
k__Bacteria; p__Acidobacteria; c__Acidobacteria-6; o__iii1-15; f__; g__; s__
k__Bacteria; p__Acidobacteria; c__Acidobacteria-6; o__iii1-15; f__; g__; s__
k__Bacteria; p__Acidobacteria; c__Acidobacteria-6; o__iii1-15; f__; g__; s__

Closest taxonomies for query 774561 (in order):

k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Micrococcaceae;
g__; s__
k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales;
f__Microbacteriaceae; g__; s__
k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Bogoriellaceae;
g__Georgenia; s__
k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales;
f__Pseudonocardiaceae; g__Pseudonocardia; s__
k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Nocardioidaceae;
g__; s__

Closest taxonomies for query 4353245 (in order):

k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodobacterales;
f__Rhodobacteraceae; g__Paracoccus; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales;
f__Hyphomicrobiaceae; g__Rhodoplanes; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales;
f__Rhodospirillaceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__; f__; g__; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodobacterales;
f__Rhodobacteraceae; g__; s__

Closest taxonomies for query 92144 (in order):

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae;
g__Blautia; s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__; s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__; s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae;
g__[Ruminococcus]; s__
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

Closest taxonomies for query 3394691 (in order):

k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Kiloniellales;
f__Kiloniellaceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales;
f__Rhodospirillaceae; g__; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__; f__; g__; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Sphingomonadales;
f__Sphingomonadaceae; g__Kaistobacter; s__
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__; f__; g__; s__

for q in current_queries:
q_id = q.metadata['id']
print('Known taxonomy for query %s:\n %s' % (q_id, reference_taxonomy[q_id]))

https://readiab.org/database-searching.html 9/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

Known taxonomy for query 4366045:

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Veillonellaceae; g__BSV43;
s__
Known taxonomy for query 1111024:
k__Bacteria; p__Gemmatimonadetes; c__Gemmatimonadetes; o__Gemmatimonadales; f__; g__; s__
Known taxonomy for query 2729797:
k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales;
f__Oxalobacteraceae; g__; s__
Known taxonomy for query 864506:
k__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Chromatiales; f__; g__; s__
Known taxonomy for query 937848:
k__Bacteria; p__Verrucomicrobia; c__[Spartobacteria]; o__[Chthoniobacterales];
f__[Chthoniobacteraceae]; g__heteroC45_4W; s__
Known taxonomy for query 4336536:
k__Bacteria; p__Acidobacteria; c__Acidobacteria-6; o__iii1-15; f__RB40; g__; s__
Known taxonomy for query 774561:
k__Bacteria; p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Micrococcaceae;
g__Microbispora; s__rosea
Known taxonomy for query 4353245:
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodobacterales;
f__Rhodobacteraceae; g__; s__
Known taxonomy for query 92144:
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__; s__
Known taxonomy for query 3394691:
k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__BD7-3; f__; g__; s__

What we need now is a way to know how often we get the “right answer”, and how long this heuristic algorithm takes relative to the complete algorithm. We therefore first
need to define what the “right answer” is. How about this: if the most common taxonomy assignment resulting from the database search at taxonomy_levels levels of
taxonomy (i.e., how deep or specific our assignment is) matches the known taxonomy, then our algorithm has achieved the right answer. We can vary taxonomy_levels to
see how the different heuristics perform at different levels.

Here’s what this would look like:

import collections

def evaluate_search(queries, reference_db, reference_taxonomy, search_function,

taxonomy_levels, n=5, aligner=local_pairwise_align_ssw):
start_time = time.time()
search_results = search_function(current_queries, reference_db, n=n, aligner=aligner)
stop_time = time.time()
runtime = stop_time - start_time
per_query_runtime = runtime/len(queries)
data = []
indices = []
for q in queries:
q_id = q.metadata['id']
indices.append(q_id)
q_known_taxonomy = tuple(reference_taxonomy[q_id].split('; ')[:taxonomy_levels])
q_observed_taxonomies = collections.Counter()
for e in search_results['reference taxonomy'][q_id]:
q_observed_taxonomies[tuple(e.split('; ')[:taxonomy_levels])] += 1
q_observed_taxonomy = q_observed_taxonomies.most_common()[0][0]
data.append((q_known_taxonomy, q_observed_taxonomy))
index = pd.Index(indices, name='Query ID')
data = pd.DataFrame(data, index=index, columns=['Known taxonomy', 'Observed taxonomy'])
number_correct = np.sum(data['Known taxonomy'] == data['Observed taxonomy'])
fraction_correct = number_correct / data.shape[0]
return per_query_runtime, fraction_correct, data

First let’s see how this works for our full database search algorithm. What’s the runtime, and how often do we get the correct answer? We’ll start with five levels of
taxonomy (which corresponds to the family level). This step will take a couple of minutes to run, because it’s doing the full database search.

taxonomy_levels = 5

runtime, fraction_correct, data = evaluate_search(current_queries, reference_db,

reference_taxonomy,
local_alignment_search,
taxonomy_levels=taxonomy_levels)
print('%1.2f seconds per query sequence' % runtime)
print('%1.2f%% correct answers' % (fraction_correct * 100.0))
print('Result details:')
for q_id in data.index:
print(q_id)
print(' ', data['Known taxonomy'][q_id])
print(' ', data['Observed taxonomy'][q_id])
print()

https://readiab.org/database-searching.html 10/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

2.97 seconds per query sequence

70.00% correct answers
Result details:
4366045
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Veillonellaceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Veillonellaceae')

1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__', 'f__')

2729797
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Betaproteobacteria', 'o__Burkholderiales',
'f__Oxalobacteraceae')

864506
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__Chromatiales', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Gammaproteobacteria', 'o__[Marinicellales]',
'f__[Marinicellaceae]')

937848
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')

4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')

774561
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')
('k__Bacteria', 'p__Actinobacteria', 'c__Actinobacteria', 'o__Actinomycetales',
'f__Micrococcaceae')

4353245
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Rhodobacterales',
'f__Rhodobacteraceae')

92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')

3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Sphingomonadales',
'f__Sphingomonadaceae')

Next let’s see how this compares to our random heuristic search algorithm. Try running this a few times, as you might get different answers due to different random
selections of the database.

import functools

heuristic_local_alignment_search_random_10 =
functools.partial(heuristic_local_alignment_search_random, p=0.10)

runtime, fraction_correct, data = evaluate_search(current_queries, reference_db,

reference_taxonomy,
heuristic_local_alignment_search_random_10,
taxonomy_levels=taxonomy_levels)

print('%1.2f seconds per query sequence' % runtime)

print('%1.2f%% correct answers' % (fraction_correct * 100.0))
print('Result details:')
for q_id in data.index:
print(q_id)
print(' ', data['Known taxonomy'][q_id])
print(' ', data['Observed taxonomy'][q_id])
print()

0.30 seconds per query sequence

30.00% correct answers
Result details:
4366045
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Veillonellaceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Peptococcaceae')

1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Verrucomicrobia', 'c__Verrucomicrobiae', 'o__Verrucomicrobiales',
'f__Verrucomicrobiaceae')

937848
('k__Bacteria', 'p__Verrucomicrobia', 'c__[Spartobacteria]', 'o__[Chthoniobacterales]',
'f__[Chthoniobacteraceae]')
('k__Bacteria', 'p__Verrucomicrobia', 'c__', 'o__', 'f__')

4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__')

92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')

3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Sphingomonadales',
'f__Sphingomonadaceae')

https://readiab.org/database-searching.html 11/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
Again, what’s the runtime, and how often do we get the correct answer? Based on comparison to the full search, what do you think: is this a good heuristic?

After performing many trials of the above searches, I get the correct genus-level assignment about half as often with the random reference database heuristic relative to
the full database search. Your results might differ from that due to differences in the random selection of query and reference sequences. Try running all the cells in this
section a few times.

Go back to the beginning of this section and try running this check based on fewer levels of taxonomy (i.e., decreased taxonomic specificity, such as the phylum) and on
more levels of taxonomy (i.e., increased taxonomic specificity, such as the species level). How does that impact how often we get the right answer?

Composition-based reference sequence collection

While the random selection of database sequences can vastly reduce the runtime for database searching, we don’t get the right answer very often. Let’s try some heuristics
that are a bit smarter. How about this: if the overall nucleotide composition of a query sequence is very different than the overall nucleotide composition of a reference
sequence, it’s unlikely that the best alignment will result from that pairwise alignment, so don’t align the query to that reference sequence. Given that, how do we define
“overall nucleotide composition” in a useful way?

GC content
One metric of sequence composition that we can compute quickly (because remember, this has to be a lot faster than computing the alignment for it to be worth it) is GC
content. Let’s define a heuristic that only performs a pairwise alignment for the reference sequences that have the most similar GC content to the query sequence. The
number of alignments that we’ll perform will be defined as database_subset_size.

database_subset_size = 500

def heuristic_local_alignment_search_gc(
queries, reference_db, database_subset_size, n=5,
reference_db_gc_contents=None,
aligner=local_pairwise_align_ssw):
results = []
if reference_db_gc_contents is None:
reference_db_gc_contents = \
{r.metadata['id'] : r.gc_content() for r in reference_db}
for q in queries:
query_gc_content = q.gc_content()
database_subset = []
for r in reference_db:
ref_gc_content = reference_db_gc_contents[r.metadata['id']]
# find the difference in GC content between the reference and
# query. we'll sort and select our reference sequences by this
# value
database_subset.append((abs(ref_gc_content - query_gc_content), r))
database_subset.sort(key=lambda x: x[0])
database_subset = [e[1] for e in database_subset[:database_subset_size]]
results.append(local_alignment_search(
[q], database_subset, n=n, aligner=aligner))
return pd.concat(results)

If we run our queries again, how often do we get the right answer? How much did we reduce runtime? Do you think this is a better or worse heuristic than what we
implemented above?

heuristic_local_alignment_search_gc_2 = functools.partial(heuristic_local_alignment_search_gc,
database_subset_size=database_subset_size)

runtime, fraction_correct, data = evaluate_search(current_queries, reference_db,

reference_taxonomy,
heuristic_local_alignment_search_gc_2,
taxonomy_levels=taxonomy_levels)

print('%1.2f seconds per query sequence' % runtime)

https://readiab.org/database-searching.html 12/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

0.32 seconds per query sequence

10.00% correct answers
Result details:
4366045
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Veillonellaceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Clostridiaceae')

1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__', 'f__')

4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__')

92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')

3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__', 'f__')

Try increasing and decreasing the number of sequences we’ll align by increasing or decreasing database_subset_size. How does this impact the runtime and fraction of
time that we get the correct answer?

kmer content
Another metric of sequence composition is kmer composition. A kmer is simply a word (or list of adjacent characters) of length k found within a sequence. Here are the
kmer frequencies in a short DNA sequence. The overlap=True parameter here means that our kmers can overlap one another.

import skbio

skbio.DNA('ACCGTGACCAGTTACCAGTTTGACCAA').kmer_frequencies(k=5, overlap=True)

{'ACCGT': 1,
'CCGTG': 1,
'CGTGA': 1,
'GTGAC': 1,
'TGACC': 2,
'GACCA': 2,
'ACCAG': 2,
'CCAGT': 2,
'CAGTT': 2,
'AGTTA': 1,
'GTTAC': 1,
'TTACC': 1,
'TACCA': 1,
'AGTTT': 1,
'GTTTG': 1,
'TTTGA': 1,
'TTGAC': 1,
'ACCAA': 1}

In our next heuristic, we’ll only align our query to the reference sequences with the largest fraction of the kmers that are observed in the query sequence are also present in
the reference sequence. This makes a lot of sense to use as an alignment heuristic: we’re only aligning sequences when it looks like they’ll have multiple length-k stretches
of nucleotides that are not interrupted by substitutions or insertion/deletion mutations.

In our next heuristic, we’ll only align our query to the reference sequences with the largest fraction of the kmers that are observed in the query sequence. This makes a lot
of sense to use as an alignment heuristic: we’re only aligning sequences when it looks like they’ll have multiple length-k stretches of nucleotides that are not interrupted by
substitutions or insertion/deletion mutations.

Here’s the source code:

https://readiab.org/database-searching.html 13/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

## TODO: Update this to be Jaccard index (or describe it that way, if it already is)
def fraction_shared_kmers(kmer_freqs1, kmer_freqs2):
"""Compute the fraction of kmers in kmer_freqs1 that are also in kmer_freqs2
Parameters
----------
kmer_freqs1, kmer_freqs2
Returns
-------
float
Raises
------
ValueError
If k < 1.
Notes
-----
k-mer counts are not incorporated in this distance metric.
"""
sequence1_kmers = set(kmer_freqs1)
num_sequence1_kmers = len(sequence1_kmers)
sequence2_kmers = set(kmer_freqs2)
shared_kmers = sequence1_kmers & sequence2_kmers
return len(shared_kmers) / num_sequence1_kmers

def heuristic_local_alignment_search_kmers(
queries, reference_db, database_subset_size, k, n=5,
reference_db_kmer_frequencies=None,
aligner=local_pairwise_align_ssw):
results = []
if reference_db_kmer_frequencies is None:
reference_db_kmer_frequencies = \
{r.metadata['id'] : r.kmer_frequencies(k=k, overlap=True) for r in reference_db}
for q in queries:
query_kmer_frequency = q.kmer_frequencies(k=k, overlap=True)
database_subset = []
for r in reference_db:
ref_kmer_frequency = reference_db_kmer_frequencies[r.metadata['id']]
s = fraction_shared_kmers(query_kmer_frequency, ref_kmer_frequency)
database_subset.append((s, r))
database_subset.sort(key=lambda x: x[0], reverse=True)
database_subset = [e[1] for e in database_subset[:database_subset_size]]
results.append(local_alignment_search(
[q], database_subset, n=n, aligner=aligner))
return pd.concat(results)

k = 7

Let’s apply this and see how it does. How does the runtime and fraction of correct assignments compare to our GC content-based search and our full database search?

heuristic_local_alignment_search_kmers_50 = \
functools.partial(heuristic_local_alignment_search_kmers, k=k,
database_subset_size=database_subset_size)

runtime, fraction_correct, data = evaluate_search(current_queries, reference_db,

reference_taxonomy,
heuristic_local_alignment_search_kmers_50,
taxonomy_levels=taxonomy_levels)

print('%1.2f seconds per query sequence' % runtime)

5.18 seconds per query sequence

1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__', 'f__')

4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')

92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')

3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Sphingomonadales',
'f__Sphingomonadaceae')

https://readiab.org/database-searching.html 14/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

Further optimizing composition-based approaches by pre-computing reference database information

One important feature of composition-based approaches is that, because the reference database doesn’t change very often, we can pre-compute features of the reference
sequences and re-use them. This can help us to vastly decrease the runtime of our heuristic searches. For example, the computation of all of the reference database kmer
frequencies is a lot of work. If we can compute that outside of our database search, we can avoid doing that step for every database search, and therefore remove that
computationally expensive (i.e., slow) step of the process.

Here we’ll compute all of the reference database kmer frequencies. Notice that this step takes about a minute to complete. This is a minute of compute time that we can
save on every database search!

reference_db_kmer_frequencies = {r.metadata['id']: r.kmer_frequencies(k=k, overlap=True) for r

in reference_db}

We’ll now pass our pre-computed kmer frequencies into our search function. How does the runtime and accuracy of this search compare to the searches above? This last
database search that we’ve implemented here is very similar to how BLAST works.

heuristic_local_alignment_search_kmers_50 = \
functools.partial(heuristic_local_alignment_search_kmers,
reference_db_kmer_frequencies=reference_db_kmer_frequencies,
k=k, database_subset_size=database_subset_size)

runtime, fraction_correct, data = evaluate_search(current_queries, reference_db,

reference_taxonomy,
heuristic_local_alignment_search_kmers_50,
taxonomy_levels=taxonomy_levels)

print('%1.2f seconds per query sequence' % runtime)

0.60 seconds per query sequence

1111024
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__Gemmatimonadales', 'f__')
('k__Bacteria', 'p__Gemmatimonadetes', 'c__Gemmatimonadetes', 'o__', 'f__')

4336536
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')
('k__Bacteria', 'p__Acidobacteria', 'c__Acidobacteria-6', 'o__iii1-15', 'f__RB40')

92144
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')
('k__Bacteria', 'p__Firmicutes', 'c__Clostridia', 'o__Clostridiales', 'f__Lachnospiraceae')

3394691
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__BD7-3', 'f__')
('k__Bacteria', 'p__Proteobacteria', 'c__Alphaproteobacteria', 'o__Sphingomonadales',
'f__Sphingomonadaceae')

Determining the statistical significance of a pairwise alignment

One thing you may have noticed is that the score you get back for a pairwise alignment is hard to interpret. It’s dependent on the query and reference sequence lengths
(and possibly their composition, depending on your substitution matrix). So an important question is how to determine how good a given pairwise alignment is. Here we’ll
learn about a statistical approach for answering that.

Metrics of alignment quality

In the examples above, we compared features such as how long the alignment is (relevant for local but not global alignment), the pairwise similarity between the aligned
query and reference, and the score. If you’ve used a system like BLAST, you’ll know that there are other values that are often reported about an alignment, like the number
of substitutions, or the number of insertion/deletion (or gap) positions. None of these metrics are useful on their own. Let’s look at an example to see why.

Imagine we’re aligning these two sequences:

GAAGCAGCAC
GAACAGAAC

If we tell our search algorithm that we’re interested in the alignment with the fewest number of substitutions, the following alignment would get us zero substitutions, but
there are a lot of bases that look homologous which are not aligned.

GAAGCAGCAC-----
GAA------CAGAAC

https://readiab.org/database-searching.html 15/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
On the other hand, if we want to find the alignment with the fewest number of gaps, this one would get us that result, but we now have a lot of substitution events, and
some regions that clearly look misaligned (such as the CAG sequence in the middle of both).

GAAGCAGCAC
GAACAGA-AC

The alignment score that has been reported by our pairwise aligners helps us to balance these different features, and we can adjust the scoring scheme to weight things
differently (e.g., so that gaps are penalized more or less than certain substitutions). The problem is that the scores are hard to interpret, particularly when we have only one
or a few of them.

False positives, false negatives, p-values, and alpha

Remember that an alignment of a pair of sequences represents a hypothesis about homology between those sequences. One way that we think about determining if an
alignment is good or not is to ask: what fraction of the time would I obtain a score at least this good if my sequences are not homologous? This fraction is usually referred to
as our p-value, and this is computed in many different ways. If our p-value is high (e.g., 25%), then our alignment is probably not very good since it means that many non-
homologous pairs of sequences would achieve a score at least that high. If our p-value is low (say 0.001%), then our alignment is probably good since scores that high are
achieved only infrequently.

Our threshold for defining what we consider to be a high versus low p-value is dependent on how often we’re willing to be wrong. We would set this value, which is usually
referred to as α, to some fraction, and if our p-value is less than α, we say that the alignment is statistically significant. If our p-value is greater than α, we say that our
alignment is not statistically significant.

There are a couple of ways that we could be wrong when we do sequence homology searching, and we need to consider these when we determine what value we want to
define as α. First, we could say a pair of sequences are homologous when they’re not, which would be a false positive or a type 1 error. Or, we could say that a pair of
sequences are not homologous when they are, which would be a false negative, or a type 2 error.

If incurring a false positive about 5% of the time is acceptable (i.e., you’re ok with calling a pair of sequences homologous when they actually are not about one in twenty
times) then you’d set your α to 0.05. Setting α to a value this high likely means that the method will err on the side of false positives, and only infrequently will it say that a
pair of sequences are not homologous when they actually are (i.e., achieve a false negative). If α were set to be very low on the other hand (say, 1 × 10−50 ), then you will
err on the side of false negatives. Only infrequently will you say that a pair of non-homologous sequences are homologous, but you might call many pairs of homologous
sequences non-homologous. You should think of α as a dial. If you turn the dial toward higher values, you’ll increase your false positive rate and decrease your false
negative rate. If you turn the dial toward lower values, you’ll decrease your false positive rate and increase your false negative rate.

There is not a hard-and-fast rule for whether false positives or false negatives are better, which makes choosing α hard. It’s application specific, so you need to understand
the biological question your asking when making this decision, and the ramifications of false positives versus false negatives. In general, when might you prefer to have
false positives? When might you prefer to have false negatives?

Interpreting alignment scores in context

In this section, we are going to learn about how to interpret alignment scores by empirically determining if a pairwise alignment that we obtain is better than we would
expect if the pair of sequences we’re working with were definitely not homologous. For a given pair of sequences that we want to align, we’re first going to align them and
compute the score of the alignment. We’re then going to align many pairs of sequences that are similar to the query and reference, but that we know are not homologous.
We’ll do this by shuffling or randomizing the order of the bases in the query sequences, and performing another pairwise alignment.

First, we’ll define a function that can generate random sequences for us. This will take a scikit-bio sequence object (either skbio.DNA, skbio.RNA, or skbio.Protein) and a
length, and it will randomly generate a sequence of that type and length for us.

import random
def random_sequence(moltype, length):
result = []
alphabet = list(moltype.nondegenerate_chars)
for e in range(length):
result.append(random.choice(alphabet))
return moltype(''.join(result))

We can now run this a few times to generate some random sequences:

random_sequence(skbio.DNA, 50)

DNA
--------------------------------------------------------
Stats:
length: 50
has gaps: False
has degenerates: False
has definites: True
GC-content: 46.00%
--------------------------------------------------------
0 ATCGTTCCTA TTCCCCGAGG CCATAGCTTG TTAAGTAAGT CCATCAGGAT

random_sequence(skbio.DNA, 50)

DNA
--------------------------------------------------------
Stats:
length: 50
has gaps: False
has degenerates: False
has definites: True
GC-content: 60.00%
--------------------------------------------------------
0 GCTAGATGGT GGCGACGCTG GTTCCATCAA ATTGCGCCGC TCGTGCTACC

Next, we need a function that will shuffle the characters in a sequence, and give us a new sequence back. We’ll use this to generate a sequence that is similar (in length and
composition) to our input sequence, but which we know is not homologous. We’ll use Pythons random.shuffle function, which randomly re-orders the order of the
elements in a sequence, but keeps the composition and length of the sequence the same.

def shuffle_sequence(sequence):
# generate a list of the position indices (numbers) in sequence
randomized_order = list(range(len(sequence)))
# randomly rearrange the order of that list
random.shuffle(randomized_order)
# return a new sequence, where the positions are shuffled
return sequence[randomized_order]

Now we can define a random sequence and shuffle it. Notice how the sequences are different (in their order), but their compositions (e.g., length and GC content) are the
same. Shuffling will change the order of the bases, but it won’t change the frequency at which each base is present - it’s exactly analogous to shuffling a deck of cards.

seq = random_sequence(skbio.DNA, 50)

seq

https://readiab.org/database-searching.html 16/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

DNA
--------------------------------------------------------
Stats:
length: 50
has gaps: False
has degenerates: False
has definites: True
GC-content: 54.00%
--------------------------------------------------------
0 GATAAGAAAT TAACCGCGTG GTGGGGAGAC GGCCCTTGTG TGGCTCAATC

shuffle_sequence(seq)

DNA
--------------------------------------------------------
Stats:
length: 50
has gaps: False
has degenerates: False
has definites: True
GC-content: 54.00%
--------------------------------------------------------
0 AGGACGGACA GAGTGCTCGG GTCGTTAGAC CACGAGTGAT TCCGATTTGA

Let’s generate a random query sequence and align it against itself to see what that score would be.

query_seq = random_sequence(skbio.DNA, 50)

_, actual_score, _ = local_pairwise_align_ssw(query_seq, query_seq)
print("Score: %1.2f" % actual_score)

Score: 100.00

Next let’s generate 99 random variants of that sequence with shuffle_sequence and compute the pairwise alignment for each of those variants against the query sequence.
We’ll then look at the distribution of those scores.

def generate_random_score_distribution(sequence1,
sequence2,
n=99,
aligner=local_pairwise_align_ssw):
scores = []
# iterate n times
for i in range(n):
# generate a randomized version of the first sequence
random_sequence = shuffle_sequence(sequence1)
# align that randomized sequence against the second sequence
# and save its score
_, score, _ = aligner(random_sequence, sequence2)
scores.append(score)
# return the n randomized alignment scores
return scores

random_scores = generate_random_score_distribution(query_seq, query_seq, 99)

print(random_scores)

[10, 11, 10, 10, 13, 8, 13, 10, 10, 11, 12, 12, 14, 12, 9, 9, 11, 11, 11, 13, 12, 11, 12, 11,
12, 11, 15, 9, 12, 14, 16, 8, 13, 10, 10, 8, 11, 12, 16, 8, 12, 10, 12, 21, 10, 10, 17, 12,
13, 11, 10, 8, 12, 10, 12, 16, 15, 14, 12, 11, 12, 11, 10, 11, 14, 10, 10, 12, 9, 10, 16, 10,
10, 8, 12, 13, 10, 13, 14, 11, 10, 11, 14, 15, 11, 9, 10, 10, 11, 11, 12, 8, 13, 12, 11, 12,
10, 11, 12]

How does the actual score of aligning the sequence to itself compare to the score of aligning it to many similar but non-homologous sequences? Let’s plot these to get a
better idea.

import seaborn as sns

def plot_score_distribution(actual_score, random_scores):

ax = sns.distplot(random_scores, kde=False, label="Random scores", color="b")
ax.plot([actual_score, actual_score], ax.get_ylim(), '--', label="Actual score")
# set the range of the x axis to be zero through 110% of the actual score
ax.set_xlim(0, actual_score + actual_score * 0.1)
ax.legend(loc=9, fontsize='large')
return ax

plot_score_distribution(actual_score, random_scores)

/usr/share/miniconda/envs/iab2/lib/python3.8/site-packages/seaborn/distributions.py:2557:
FutureWarning: `distplot` is a deprecated function and will be removed in a future version.
Please adapt your code to use either `displot` (a figure-level function with similar
flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:>

What does this tell us about our alignment score and therefore about our alignment? Is it good or bad?

We finally have information that we can use to evaluate an alignment score, and therefore to evaluate the quality of an alignment. Let’s use this information to quantify the
quality of the alignment by computing a p-value. As we described above, this is simply the probability that we would obtain an alignment score at least this good if the
sequences being aligned are not homologous. Since we have a lot of scores now from sequences that are similar but not homologous, if we just count how many are at
least as high as our actual score and divide by the number of scores we compute, that is an empirical (data-driven) way of determining our p-value.

To determine if our alignment is statistically significant, we need to define α before computing the p-value so the p-value does not impact our choice of α. Let’s define α
as 0.05. This choice means if we obtain a p-value less than 0.05 we will consider the alignment statistically significant and accept the hypothesis that the sequences are
homologous.

Here’s what all of this looks like:

https://readiab.org/database-searching.html 17/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

def fraction_better_or_equivalent_alignments(sequence1,
sequence2,
n = 99,
aligner=local_pairwise_align_ssw):
# align sequence1 and sequence2 and store the score of the alignment
_, actual_score, _ = aligner(sequence1, sequence2)
# compute the distribution of randomized scores
random_scores = generate_random_score_distribution(sequence1,
sequence2,
n,
aligner=aligner)

# count the number of random scores that are at least as good as our
# actual score
count_better = 0
for s in random_scores:
if s >= actual_score:
count_better += 1
# return the number of times we observe a score at least as good as the
# random score divided by the number of scores we computed. we add one
# to the numerator and denominator to account for our actual_score
return (count_better + 1) / (n + 1)

print("Fraction of alignment scores at least as good as the alignment score: %r" %

fraction_better_or_equivalent_alignments(query_seq, query_seq, 99))

Fraction of alignment scores at least as good as the alignment score: 0.01

The fraction that we get back here is 0.01, which is lower than α, so we would accept the hypothesis that our sequences are homologous.

A few notes on these empirically defined p-values. First, here’s what the formula for computing this looks like:

number of computed aligned scores greater than or equal to the actual alignment score
p value = number of alignment scores computed

1
The numerator and the denominator both include the actual alignment score, so the lowest p-value that can be achieved is 99+1 , where the 1 in the numerator
corresponds to our actual alignment score (which is of course equal to itself), where the 99 in the denominator is the number of permutations, and the 1 in the
denominator is a constant which corresponds the computation of the actual score. If we increase the number of permutations, say to 999, we could achieve greater
precision (more significant digits) in our p-value.

print("Fraction of alignment scores at least as good as the alignment score: %r" %

fraction_better_or_equivalent_alignments(query_seq, query_seq, 999))

Fraction of alignment scores at least as good as the alignment score: 0.001

When we achieve the lowest possible value for a given test, as is the case here, we report the p-value as being less than that value, since we’ve yet to observe a random
alignment score at least that high. For example, here we would report something like:

The alignment of our query and reference sequence was statistically significant, as determined by comparing our actual alignment score to random variants (p < 0.001).

Let’s now try this for some harder cases, where the query and subject sequences are not identical. First, let’s generate a longer subject sequence at random. Then, we’ll
create a random query sequence and compare it. Since we’re doing this in two random steps, we know that these sequences are not homologous. Does the resulting p-
value reflect that?

sequence1 = random_sequence(skbio.DNA, 250)

sequence1

DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 53.20%
---------------------------------------------------------------------
0 GTGCCTACTG AGGGAGCTCA TCGTGCGCAG CATCATACGC ATAACCGAAT TCTATTAGCC
60 GCCCCGCTCA TGTAGCCTGG CCTTACTTCA AAGAAAGTCC GTAACGTGGA GAAACCGAGG
120 CGGTGCCGCG ACTTGTGCTT CGTTTACGCG GGGCCCGAAT GCTACACTTC GCTCGTCGGA
180 CGTCGATCAT AGTGACTGCG ATGAAAATGA AGACCAACTC TTGGTAACGG TAAGAAAAAG
240 TGGTGCGAGA

sequence2 = random_sequence(skbio.DNA, 250)

sequence2

DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 45.20%
---------------------------------------------------------------------
0 TATCCGTACG TAGCCCACTT AATGTAGATA TAGTTCTATA AGTCAACAGA TTACACCGCG
60 GAACCGTCAC TAGTTATGTA GCTTCATGAG AATTGCAGGT CGATTATCTG CCCCCACGCT
120 AATAACTAGT GGAATGGCAG CAAGTGGCTT CTACCTGCCT TTATAGAGAG TGACTTGAAT
180 AACATACAAG CGGGCCAAAG CAGGAGGCGA ATTAACTTAG GGCGTAGACA AGTGTCGGGT
240 GTTTGTAGAC

print("Fraction of alignment scores at least as good as the alignment score: %r" %

fraction_better_or_equivalent_alignments(sequence1,sequence2))

Fraction of alignment scores at least as good as the alignment score: 0.93

https://readiab.org/database-searching.html 18/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics
We’ve now looked at two extremes: where sequences are obviously homologous (because they were the same), and where sequences are obviously not homologous
(because they were both independently randomly generated). Next, we’ll explore the region between these, where this gets interesting. We’ll now create a partially
randomized sequence to create a pair of sequences where the homology is more obscure. We’ll do this again using the Python random module, but this time we’ll introduce
mutations only at some positions to create a pair of sequences that are approximately percent_id identical.

Let’s define a function to do this, and then compute a sequence that is 95% identical to our sequence1.

def partially_randomize_sequence(percent_id, sequence):

result = []
for c in sequence:
if random.random() < percent_id:
result.append(str(c))
else:
# choose a base at random that is not the current base
# i.e., simulate a substitution event
result.append(choice([r for r in sequence.nondegenerate_chars if r != c]))
return sequence.__class__(''.join(result))

sequence1_95 = partially_randomize_sequence(0.95, sequence1)

sequence1

sequence1_95

DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 52.00%
---------------------------------------------------------------------
0 GTGCCTACTG AGGGAGCTCA TCGTTCGCAG CTTCATACGG GTAACCGAAT TCTATTAGCC
60 GCCCCGCTCA TGTAGCCTGG CCTTACTTCA AAGAAAGTCC GTAACGTGGC GAAAACGAGG
120 AGGTGCCGCG ACTTGTGCTT CGTTTACGCT GGGCCCGAAT GCTACACTTC GTTCGTCGGA
180 CGTCGATCAT AGTGACTGCG ATGAAAATGA ACACCAACTC TTGGTAACGG TAAGAAATAG
240 TGGTGCGAGA

Notice how these sequences are almost identical, but have some differences. Let’s apply our approach to determine if it would identify these sequences as being
homologous based on α = 0.05 .

print("Fraction of alignment scores at least as good as the alignment score: %r" %

fraction_better_or_equivalent_alignments(sequence1, sequence1_95))

Fraction of alignment scores at least as good as the alignment score: 0.01

You likely got a significant p-value there, telling you that the sequences are homologous.

Now let’s simulate much more distantly related sequences by introducing substitutions at many more sites.

sequence1_25 = partially_randomize_sequence(0.25, sequence1)

sequence1

sequence1_25

DNA
---------------------------------------------------------------------
Stats:
length: 250
has gaps: False
has degenerates: False
has definites: True
GC-content: 48.80%
---------------------------------------------------------------------
0 GTGGATAATA CGTGATAGCA TGGTGAGGAT GAGTTCACCA ATTGACGCTT TTTCTCAGCG
60 GCCTGGGTCA GGGCACCGAT TAGGGGTGAA TCCTAACTCA TTAGCGTGTC ATGACGGAGC
120 TGTAGCCGCG ACATGTCACT AGGGCTGACT AAGGCGAGAC AGGAGCCTGC CCGTAGATGA
180 GGGAAATAAA AGTTTACCAG ATTATAATTA AGACATCCTC TGGGATACGT TAGGATATGG
240 CGGTCTTTGA

print("Fraction of alignment scores at least as good as the alignment score: %r" %

fraction_better_or_equivalent_alignments(sequence1, sequence1_25))

Fraction of alignment scores at least as good as the alignment score: 0.26

https://readiab.org/database-searching.html 19/20
7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

Exploring the limit of detection of sequence homology searches

In the example above, we know that our input sequences are “homologous” because sequence1_25 and sequence1_95 are both derived from sequence1. Our method
detected that homology for sequence1_95, when we simulated very closely related sequences, but not for sequence1_25, when we simulated much more distantly related
sequences. This gives us an idea of the limit of detection of this method, and is a real-world problem that biologists face: as sequences are more divergent from one
another, detecting homology becomes increasingly difficult.

Lets run a simulation to gain some more insight into the limit of detection of this method. We’ll run this approach for pairs of sequences where we vary the percent_id
parameter, and identify when our approach stops identifying sequence pairs as being homologous. This is important to know as a bioinformatician, because it tells us
around what pairwise similarity we will no longer be able to identify homology using this approach.

# First, let's define the range of percent identities that we'll test
percent_ids = np.arange(0.0, 1.0, 0.05)
# Then, we'll define the number of random sequences we'll test at each percent identity
num_trials = 20
# Then, we'll define the sequence length that we want to work with, and num_trials random
sequences
sequence_length = 150
random_sequences = [random_sequence(skbio.DNA, sequence_length) for i in range(num_trials)]

results = []

for percent_id in percent_ids:

# at each percent_id, we'll track the p-values for each trial (random sequence)
p_values = []
for sequence in random_sequences:
# partially randomize the sequence, compute its p-value, and record that p-value
sequence_at_percent_id = partially_randomize_sequence(percent_id, sequence)
p = fraction_better_or_equivalent_alignments(sequence, sequence_at_percent_id)
p_values.append(p)
results.append((percent_id, np.median(p_values), np.mean(p_values)))
pd.DataFrame(results, columns=["Percent id between query and subject",
"Median p-value", "Mean p-value"])

Percent id between query and subject Median p-value Mean p-value

0 0.00 0.545 0.5495

1 0.05 0.750 0.6420

2 0.10 0.670 0.6015

3 0.15 0.665 0.6225

4 0.20 0.425 0.5160

5 0.25 0.390 0.4450

6 0.30 0.280 0.3005

7 0.35 0.160 0.3180

8 0.40 0.010 0.1065

9 0.45 0.010 0.0815

10 0.50 0.010 0.0225

11 0.55 0.010 0.0100

12 0.60 0.010 0.0100

13 0.65 0.010 0.0100

14 0.70 0.010 0.0100

15 0.75 0.010 0.0100

16 0.80 0.010 0.0100

17 0.85 0.010 0.0100

18 0.90 0.010 0.0100

19 0.95 0.010 0.0100

What does this simulation tell us about our limit of detection for homology (i.e., how similar must a pair of sequences be for us to reliably be able to identify homology
between them)? Is this higher or lower than you expected?

With respect to our simulation, I took a few shortcuts here to keep the runtime low. What are some things that could be improved to make this simulation more robust, if
we weren’t as concerned about runtime?

By J Gregory Caporaso
© Copyright 2014-2021.

https://readiab.org/database-searching.html 20/20

Module 5
No ratings yet
Module 5
23 pages
B.I Sec 4.
No ratings yet
B.I Sec 4.
18 pages
Pairwise Alignment Prelab PDF
No ratings yet
Pairwise Alignment Prelab PDF
87 pages
3.7
No ratings yet
3.7
22 pages
BLAST and Sequence Alignment
No ratings yet
BLAST and Sequence Alignment
36 pages
Lab Report 3 Bioinformatics
No ratings yet
Lab Report 3 Bioinformatics
18 pages
Sequence Analysis - Alignment
No ratings yet
Sequence Analysis - Alignment
57 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
5.Pairwise Alignment
No ratings yet
5.Pairwise Alignment
85 pages
3
No ratings yet
3
107 pages
lecture1_Loi
No ratings yet
lecture1_Loi
52 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
2. Sequence alignment
No ratings yet
2. Sequence alignment
25 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
Sequences Alignments (Similarity & Homology)
No ratings yet
Sequences Alignments (Similarity & Homology)
32 pages
Bioinformatics a Practical Guide to the Analysis of Genes and Proteins 2020_9898c0fe74d854816428924c4df9b9e2
No ratings yet
Bioinformatics a Practical Guide to the Analysis of Genes and Proteins 2020_9898c0fe74d854816428924c4df9b9e2
34 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
37 pages
Pertsemlidis and Fondon 2011_BLAST
No ratings yet
Pertsemlidis and Fondon 2011_BLAST
10 pages
lec-02
No ratings yet
lec-02
103 pages
Sequence Comparison Part 1
No ratings yet
Sequence Comparison Part 1
31 pages
Sequence Alignment
No ratings yet
Sequence Alignment
7 pages
Lecture 3
No ratings yet
Lecture 3
39 pages
Chapter 2 Bioinformatics
No ratings yet
Chapter 2 Bioinformatics
9 pages
Bioinformatics I
No ratings yet
Bioinformatics I
39 pages
Unit 3 Sequence Alignment and Phylogenetic Tree
No ratings yet
Unit 3 Sequence Alignment and Phylogenetic Tree
70 pages
Unit-3 (1)
No ratings yet
Unit-3 (1)
44 pages
Gene Orthologies, Comparative Genomics & Sequence Alignment: DR Z Chikwambi Biotechnology
No ratings yet
Gene Orthologies, Comparative Genomics & Sequence Alignment: DR Z Chikwambi Biotechnology
95 pages
Retrieval of Data
No ratings yet
Retrieval of Data
22 pages
Database Similarity Searching: Irit Orr Shifra Ben Dor
No ratings yet
Database Similarity Searching: Irit Orr Shifra Ben Dor
76 pages
Bioinfo-Ders-7-ALLIGNMENT_1
No ratings yet
Bioinfo-Ders-7-ALLIGNMENT_1
55 pages
L6-Pairwise Seq Alignment
No ratings yet
L6-Pairwise Seq Alignment
70 pages
Sequence Alignment
No ratings yet
Sequence Alignment
27 pages
Basic Concept of Sequence Similarity Identity and Homology
No ratings yet
Basic Concept of Sequence Similarity Identity and Homology
17 pages
Lecture 6- Sequence Analysis
No ratings yet
Lecture 6- Sequence Analysis
28 pages
Alignments & Phylogenetic Trees: Lesk, A. 2 Ed
No ratings yet
Alignments & Phylogenetic Trees: Lesk, A. 2 Ed
18 pages
Genomics and Similarity search
No ratings yet
Genomics and Similarity search
43 pages
AsBioinfo-Ders-7-ALLIGNMENT_1
No ratings yet
AsBioinfo-Ders-7-ALLIGNMENT_1
9 pages
LO5 Pairwise Sequence Alignment
No ratings yet
LO5 Pairwise Sequence Alignment
11 pages
BT302_L3_PSA
No ratings yet
BT302_L3_PSA
47 pages
Lecture 3 and 4 LSM2241
No ratings yet
Lecture 3 and 4 LSM2241
6 pages
5 Database Similarity Search BLAST
No ratings yet
5 Database Similarity Search BLAST
47 pages
Week 3
No ratings yet
Week 3
42 pages
Sequence Alignment: Sequence Alignment Is The Most Important Task in Bioinformatics!
No ratings yet
Sequence Alignment: Sequence Alignment Is The Most Important Task in Bioinformatics!
13 pages
Introduction To Bioinformatics Lecture 3
No ratings yet
Introduction To Bioinformatics Lecture 3
20 pages
Sequence Analysis in Bioinformatics
No ratings yet
Sequence Analysis in Bioinformatics
18 pages
Lecture 6 Evolutionary Sequence Alignment Algorithms
No ratings yet
Lecture 6 Evolutionary Sequence Alignment Algorithms
26 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
Unit 2.1
No ratings yet
Unit 2.1
77 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
BIO 401 (phylogenetics and sequence alignments)
No ratings yet
BIO 401 (phylogenetics and sequence alignments)
3 pages
2 Blast Similarity Search 2
No ratings yet
2 Blast Similarity Search 2
2 pages
2006 Liviu P. Dinu, Andrea Sgarro, 2006. A Low-Complexity Distance For DNA Strings
No ratings yet
2006 Liviu P. Dinu, Andrea Sgarro, 2006. A Low-Complexity Distance For DNA Strings
14 pages
Unit Ii
No ratings yet
Unit Ii
14 pages
Sequence Alignment
No ratings yet
Sequence Alignment
36 pages
Pairwise Sequence Alignment
No ratings yet
Pairwise Sequence Alignment
12 pages
Pairwise Sequence Alignment
No ratings yet
Pairwise Sequence Alignment
10 pages
Proclust:: Improved Clustering of Protein Sequences With An Extended Graph-Based Approach
No ratings yet
Proclust:: Improved Clustering of Protein Sequences With An Extended Graph-Based Approach
58 pages
Introduction to Bioinformatics, Sequence and Genome Analysis
From Everand
Introduction to Bioinformatics, Sequence and Genome Analysis
Jerry H. Swift
No ratings yet
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Bioinformatics - A Student's Com
No ratings yet
Bioinformatics - A Student's Com
295 pages
GRADE 10 SCIENCE Lesson Dna
No ratings yet
GRADE 10 SCIENCE Lesson Dna
1 page
Micro para Transes 1.1
No ratings yet
Micro para Transes 1.1
4 pages
Test Bank for Concepts of Genetics 11th Edition by Kluginstant download
100% (5)
Test Bank for Concepts of Genetics 11th Edition by Kluginstant download
35 pages
Bio 2
No ratings yet
Bio 2
39 pages
GENBIO2 MOD9 The Explaination Behind Every Transformation.
No ratings yet
GENBIO2 MOD9 The Explaination Behind Every Transformation.
22 pages
6 Quarter 1 Module 6-STRUCTURE-OF-DNA
100% (3)
6 Quarter 1 Module 6-STRUCTURE-OF-DNA
21 pages
Biology A Level Handbook 2023
No ratings yet
Biology A Level Handbook 2023
34 pages
Gizmo_CladogramsSE
No ratings yet
Gizmo_CladogramsSE
8 pages
DNA Replication and Protein Synthesis
No ratings yet
DNA Replication and Protein Synthesis
35 pages
WS4 Evolution
No ratings yet
WS4 Evolution
2 pages
G10 Sci SLM Q4 Wk4correctedBeta Tested
No ratings yet
G10 Sci SLM Q4 Wk4correctedBeta Tested
26 pages
Revision Tut 2 Block 1
No ratings yet
Revision Tut 2 Block 1
10 pages
Transcription in Prokaryotes and eukaryotes (2)
No ratings yet
Transcription in Prokaryotes and eukaryotes (2)
31 pages
MYP Unit Plan 4
No ratings yet
MYP Unit Plan 4
5 pages
Genome 371 Lecture 01
No ratings yet
Genome 371 Lecture 01
1 page
DLP Draft 3RD Quarter
No ratings yet
DLP Draft 3RD Quarter
25 pages
Bio Nig
No ratings yet
Bio Nig
16 pages
work 1
No ratings yet
work 1
4 pages
2.7 DNA Replication Trancription and Translation
No ratings yet
2.7 DNA Replication Trancription and Translation
12 pages
Unit 5 Lesson 16 Chromosomes, Genes and DNA
No ratings yet
Unit 5 Lesson 16 Chromosomes, Genes and DNA
6 pages
Theoretical Analysis of Mutation Hotspots and Their DNA Sequence Context Specificity
No ratings yet
Theoretical Analysis of Mutation Hotspots and Their DNA Sequence Context Specificity
21 pages
CH 03 CaseStudy With Worksheet
100% (1)
CH 03 CaseStudy With Worksheet
4 pages
PDF
0% (1)
PDF
79 pages
Essay 3
No ratings yet
Essay 3
3 pages
Biology
No ratings yet
Biology
36 pages
04 DNA Replication Student Copy 2026
No ratings yet
04 DNA Replication Student Copy 2026
14 pages
AP Biology Unit 6 Student Notes
No ratings yet
AP Biology Unit 6 Student Notes
57 pages
Chapter 21 Nucleic Acids and Protein Synthesis
No ratings yet
Chapter 21 Nucleic Acids and Protein Synthesis
21 pages
D 1.2 HL Protein Synthesis - Student Notes
No ratings yet
D 1.2 HL Protein Synthesis - Student Notes
10 pages

Sequence Homology Searching — an Introduction to Applied Bioinformatics

Uploaded by

Sequence Homology Searching — an Introduction to Applied Bioinformatics

Uploaded by

7/15/24, 2:35 PM Sequence homology searching — An Introduction to Applied Bioinformatics

Sequence homology searching Defining the problem

Exploring the limit of detection of sequence homology searches

Defining the problem

from skbio.alignment import local_pairwise_align_ssw

Help on function local_pairwise_align_ssw in module skbio.alignment._pairwise:

local_pairwise_align_ssw(sequence1, sequence2, **kwargs)

State: Experimental as of 0.4.0.

For a complete list of optional keyword-arguments that can be provided,

The following kwargs will not have any effect: `suppress_sequences`,

If an alignment does not meet a provided filter, `None` will be returned.

Loading annotated sequences

reference_taxonomy, reference_db = load_taxonomy_reference_database()

88452 sequences were loaded from the reference database.

reference_db = random.sample(reference_db, k=5000)

5000 sequences are present in the subsampled database.

Defining the problem

A complete homology search function

def local_alignment_search(queries, reference_db, n=5,

Runtime: 2.9513 sec per query

354971 211339 k__Bacteria; p__Firmicutes;

k__Bacteria; p__Firmicutes; c__Bacilli;

k__Bacteria; p__Firmicutes; c__Bacilli;

k__Bacteria; p__Firmicutes; c__Bacilli;

k__Bacteria; p__Firmicutes; c__Bacilli;

326869 4445752 k__Bacteria; p__Proteobacteria;

1111024 4438586 k__Bacteria; p__Gemmatimonadetes;

1726426 1726426 k__Bacteria; p__Proteobacteria;

Closest taxonomies for query 354971 (in order):

Closest taxonomies for query 326869 (in order):

Closest taxonomies for query 1111024 (in order):

Closest taxonomies for query 1726426 (in order):

Known taxonomy for query 354971:

Known taxonomy for query 326869:

Known taxonomy for query 1111024:

Known taxonomy for query 1726426:

Reducing the runtime for database searches

def tabulate_local_alignment_search_runtime(queries, reference_db, n_query_sequences,

# we'll temporarily work with a smaller reference database

Number of Number of Median query Median reference Runtime

0 1.0 100.0 200.0 1434.0 0.061491

1 1.0 100.0 200.0 1425.0 0.063632

2 1.0 100.0 200.0 1422.0 0.055392

3 5.0 100.0 200.0 1420.0 0.288887

4 5.0 100.0 200.0 1431.0 0.298724

5 5.0 100.0 200.0 1430.5 0.298148

6 10.0 100.0 200.0 1412.0 0.570900

7 10.0 100.0 200.0 1446.0 0.649386

8 10.0 100.0 200.0 1423.5 0.578227

9 15.0 100.0 200.0 1449.0 0.864610

10 15.0 100.0 200.0 1424.0 0.864101

11 15.0 100.0 200.0 1425.5 0.914325

import seaborn as sns

<AxesSubplot:xlabel='Number of query seqs', ylabel='Runtime (s)'>

Figure 1: Genome sequencing costs.

Figure 2: Size of GenBank.

We’ll now look at a few heuristics in the context of these questions.

Random reference sequence selection

current_queries = random.sample(queries, k=10)

results = heuristic_local_alignment_search_random(current_queries, reference_db, p=0.10)

Closest taxonomies for query 4366045 (in order):

Closest taxonomies for query 1111024 (in order):

Closest taxonomies for query 2729797 (in order):

Closest taxonomies for query 864506 (in order):

Closest taxonomies for query 937848 (in order):

Closest taxonomies for query 4336536 (in order):

Closest taxonomies for query 774561 (in order):

Closest taxonomies for query 4353245 (in order):

Closest taxonomies for query 92144 (in order):

Closest taxonomies for query 3394691 (in order):

Known taxonomy for query 4366045:

Here’s what this would look like:

def evaluate_search(queries, reference_db, reference_taxonomy, search_function,

runtime, fraction_correct, data = evaluate_search(current_queries, reference_db,

2.97 seconds per query sequence

runtime, fraction_correct, data = evaluate_search(current_queries, reference_db,

354971 211339 kBacteria; pFirmicutes;

kBacteria; pFirmicutes; c__Bacilli;

kBacteria; pFirmicutes; c__Bacilli;

kBacteria; pFirmicutes; c__Bacilli;

kBacteria; pFirmicutes; c__Bacilli;

326869 4445752 kBacteria; pProteobacteria;

1111024 4438586 kBacteria; pGemmatimonadetes;

1726426 1726426 kBacteria; pProteobacteria;