0% found this document useful (0 votes)
8 views6 pages

Qualitative Analysis of Biomolecules: 1. The Human Genome

Uploaded by

aysepolat7000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Qualitative Analysis of Biomolecules: 1. The Human Genome

Uploaded by

aysepolat7000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Qualitative analysis of biomolecules

In this exercise, you will answer questions and do an exercise in R to learn a few examples of
different methods of qualitative analysis of biomolecules.
Objective: The goal of these questions and the exercise is to familiarize you with DNA sequencing
methodology, as well as the basic principles underlying sequence alignment and assembly.

1. The human genome


Answers the following questions:
Question 1.1
What is genomics and how does it relate to genetics?
- Genetics study the genes and the passed traits, while genomics is the investigation of
the whole genome. Although they are both related to genetics.

Question 1.2
What is coding and non-coding DNA? Which major classes of elements are found in each type of
DNA?
- Coding DNA is called exons and is a DNA sequence coding for protein, while non-coding
DNA is called introns and gets transcribed to RNA, but not further translated, since they
are eliminated by the spliceosome.

Question 1.3
What are the major types of variation in the genome between individuals?
- The single nucleotide polymorphism or SNP is the major cause of variations in the
genome of individuals. SNPs are when a nucleotide is exchanged with another
nucleotide in the ratio of 1:300 nucleotides in the genome.

2. DNA sequencing
Answers the following questions:
Question 2.1
Briefly describe the principles of reversible-terminated sequencing. Include in your answer, what a
Phred score represents and define the following terms: Fragment, read length, single-end reads, and
paired-end reads.
- Reversible terminated sequencing also called next generation sequencing (NGS) is a
method where a part of a DNA is precisely sequenced, although not in a whole, but as
small fragments of DNA. A fragment is what a DNA sequence is broken down to,
consists of 400-1000 basepairs and produces a read. A read length is the length of the
fragment and is between 50-200 basepairs. The reads can be separated into single-end
and paired-end reads, where a single-end read is a sequencing of just one end of the
fragment and a paired-end read is when the read is sequenced in both ends of the
fragment.
A phred score is a method for measuring base calling accuracy when Illumina
sequencing (Illumina). The larger phred score, the better quality of sequence, whereas
an acceptable phred score is above 20.

Question 2.2
What is the name of standard data format for DNA sequences? Provide an example of a 10-basepair
long sequence in this format.
- A standard format for DNA sequences is called FASTA format, which is used to describe
nucleotide and peptide sequences.
>example_sequence_of_10_basepairs
ATGGCGTCCT
Met (M), ala (A), ser (S) and a thymine nucleotide.

Perform the exercise called “Sequence alignment and assembly in R” and answer the
following questions:

In the section: Performing automated global alignments using various gap penalty schemes
Question 3.1: Which gap penalty strategy do you think is most suitable, if we assume that the two
DNA sequences are from the open-reading frame of the same gene, but in two evolutionary distant
species?
- The first alignment in R, where the gap opening is set to a penalty of 0, and a penalty of
gap extensions to 3, meaning that the program is more prone to insert gaps than
mismatches, since it gives us a better alignment, although this is not biologically
favorable, considering that it shifts the reading frame. This alignment gives us a score of
-46.

- The second pairwise alignment in R gives us a score of -148, with a gap opening penalty
of 0 and gap extension penalty of 16, resulting in individual small gaps.
- The third alignment has a gap opening penalty of 8 and a gap extension penalty of 3,
making it more prone to an extension instead of a gap, leading to a score of -94.
Evolutionary this alignment is the one, that makes most sense, since it is more likely to
mismatch than gaps, resulting in a global alignment, that gives an affine gap penalty.

- Normally a higher score will be seen as the best answer, but here we also must
consider, what is evolutionarily favorable. Therefore, the third one is the best fitting
biologically.

Question 3.2: Use the pairwiseAlignment function to perform local alignment using affine gaps
with opening penalty of 8 and extension penalty of 3. What is the alignment found by the local
alignment approach?
- When compared to the third sequence with a score of -94 with the same gap opening
and extension, this score of 14 is much better. But this is not surprising as the
alignment was performed locally, the program allows the sequence to align right where
the sequence has most common features, and therefore is not forced to insert gap
opening penalties to align the two sequences.

In the section: Evaluating the significance of pairwise overlaps through shuffling the subject string
Question 3.3: Based on the histogram, do you think that there is more overlap between s1 and s2
than expected by random chance?
- The histogram is made randomized, making its alignment score -150-(-160), whereas
the real alignment score is given to -94, which is way lesser than the randomized
alignment.
Question 3.4: Does minor changes to the gap scoring strategy affect this conclusion?
- When the gap opening penalty is edited to 6, and the extension is kept at 3, the score is
-86, which is not a significantly different from the real alignment score.

Question 3.5: What is the Z-score of the alignment and what does the Z-score indicate?
- The Z score for the real alignment with a score of -94, is calculated to be 4.93. Since the
Z score is higher than 3, the sequences do not align by chance, but do align by nature. A
Z score lower than 3 would correlate by chance.

In the section: Genome assembly by finding the Eulerian path through a De Bruijn graph
Question 3.6: How many possible k-mers are there if you set k = 8 instead of k = 7?
- When k is edited to 8, the k-mers formed are 41, instead of 42. This means that we
have 41 fragments with a length of 8 each, instead of 42 fragments with 7 of length.

Question 3.7: Find the unique Eulerian path through the graph. What is the sequence of the original
DNA fragment?
- The DNA fragment can be read to
CTCAGATCCAATGATTATTCTCCATTGTGCAAGATTTCTTATGGGCTTCCTACTTCCCCTGAAAG
AAGATCAGCATTCTTATCATGGTGGAG

Question 3.8: Use BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch) to


identify the species the DNA fragment is derived from
- Using the BLAST function, the DNA fragment is found to be a part of chromosome 1 on
the homo sapiens species.
References:
Illumina. Quality Scores for Next-Generation Sequencing. Retrieved from
https://www.illumina.com/documents/products/technotes/technote_Q-Scores.pdf

You might also like