0% found this document useful (0 votes)
66 views84 pages

Bio in For Matics

This document outlines an elective course on bioinformatics and next generation sequencing. The 60 hour course will introduce students to concepts and tools in genomics, computational genomics, and genomic data science. It provides learning objectives, prerequisites, and a detailed schedule of topics that will be covered, including molecular biology, sequencing technologies, computing concepts, and statistics.

Uploaded by

matt medmedmedic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views84 pages

Bio in For Matics

This document outlines an elective course on bioinformatics and next generation sequencing. The 60 hour course will introduce students to concepts and tools in genomics, computational genomics, and genomic data science. It provides learning objectives, prerequisites, and a detailed schedule of topics that will be covered, including molecular biology, sequencing technologies, computing concepts, and statistics.

Uploaded by

matt medmedmedic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 84

DEPARTMENT OF ANATOMY

Elective topic for 1st Year MBBS

Name of Elective Bioinformatics NGS

Location Computer Lab and Genetics Lab

Name of Internal preceptor(s) Dr. Balarama Kaimal, Dr. Sunil Mathew

Learning objectives of Elective ● Next generation sequencing experiments (NGS)

● Concepts, and tools for NGS

o Understand,

o Analyze, and

o Interpret data

● Genomic technologies used in genomic data


science

o Common tools

▪ command line

▪ software implementation

● Python, R intro

● DNA, RNA, and epigenetic patterns

● Genome analysis

Duration of course 60 hours (4 credits)

Number of students that can be 25 – 30 students


accommodated in this elective

Prerequisites for elective Python basics and basics of genetics

Learning resources for students Presentations, Links to tools, Dataset for Analysis.

List of activities of student Genomics explanation


participation
What Is Computational Genomics explanation

What Is Genomic Data Science explanation

Explain Just Enough Molecular Biology


Importance Of Molecules in Molecular Biology in Genomics

The Human Genome Project brief overview

Molecular Biology Structure overview

From Genes to Phenotypes overview

Polymerase Chain Reaction overview

Next Generation Sequencing overview

Application Of Sequencing overview

Computer Science overview

Algorithm’s overview

Memory And Data Structures overview

Efficiency brief idea

Software Engineering overview

Computational Biology Software overview

Statistics importance

Common mistakes overview

The Central Dogma Statistics overview

Data Sharing Plans overview

Getting Help with Statistics guide

Plotting Your Data practice

Sample Size and Variability understanding

Statistical Significance understanding

Multiple Testing overview

Study Design, Batch Effects and Confounding


understanding

Portfolio entries required 1. Documentation of presentation notes

2. Documentation of study

Logbook entry required Completion of posting signed by faculty In-charge and HOD.

Assessment Attendance, Presentation of worked up case, Day to day


activity in computer lab and genetic lab, MCQs

Other comments For doctors in these fields seeking to gain familiarity in data
science and statistical tools to interact better with the data
in their genetic datasets.
Bioinformatics 

Course Introduction

Introduces you to the basic biology of modern genomics and the experimental tools used to
measure this biology. 

It will cover technology used to "read" DNA or RNA, which in turn creates the data that
provides the raw material for genome analysis. 

The course will also give a brief introduction to key concepts in computing and statistics that
you will need to understand how next generation sequence data is analyzed. 
Content Overview and Time Estimate

TOPIC Hours

WHY GENOMICS? 03:30 Hours

WHAT IS GENOMICS? 04:30 Hours

WHAT IS COMPUTATIONAL GENOMICS? 02:00 Hours

WHAT IS GENOMIC DATA SCIENCE 02:30 Hours

JUST ENOUGH MOLECULAR BIOLOGY 02:00 Hours

IMPORTANCE OF MOLECULES IN MOLECULAR


BIOLOGY 02:00 Hours

THE HUMAN GENOME PROJECT 05:30 Hours

MOLECULAR BIOLOGY STRUCTURE 03:00 Hours

FROM GENES TO PHENOTYPES 03:00 Hours

POLYMERASE CHAIN REACTION 02:00 Hours

NEXT GENERATION SEQUENCING 02:00 Hours

APPLICATION OF SEQUENCING 02:00 Hours

WHAT IS COMPUTER SCIENCE? 01:00 Hours

ALGORITHMS 01:00 Hours

MEMORY AND DATA STRUCTURES 02:30 Hours

EFFICIENCY 01:00 Hours

SOFTWARE ENGINEERING 03:00 Hours


WHAT IS COMPUTATIONAL BIOLOGY
SOFTWARE? 03:30 Hours

WHY CARE ABOUT STATISTICS? 01:00 Hours

WHAT WENT WRONG? 01:00 Hours

THE CENTRAL DOGMA STATISTICS 01:00 Hours

DATA SHARING PLANS 01:00 Hours

GETTING HELP WITH STATISTICS 01:00 Hours

PLOTTING YOUR DATA 01:00 Hours

SAMPLE SIZE AND VARIABILITY 02:00 Hours

STATISTICAL SIGNIFICANCE 02:00 Hours

MULTIPLE TESTING 02:00 Hours

STUDY DESIGN, BATCH EFFECTS AND


CONFOUNDING 02:00 Hours

TOTAL COURSE HOURS 60 Hours

WHY GENOMICS?

● Genomics is the study of the genomes inside of us.


● Genome
- Full Set of genetic information,
- Required for Function of Organism.
● Genomics
- Study of Genomes
▪ Structure: Appearance, Biochemical Sequence
▪ Function: DNA function
▪ Evolution: Sequence change over time
▪ Mapping: Where are genes? (Interesting bits)
● We are 99.9% identical or even more than that.

Common questions hope to be answered by genomics.


● What is it that is driving all these differences?
● Why is one person tall and another person short?
● Why does one person live to be 100 another person lives to be not 100?
● Why does one person get cancer and another person not?
● Many of these things we suspect are driven by our genomes, and we want to understand
that.

We are all the same genetic material.

● We are all > single cell > few apparently identical cells> embryo > whole person.
● Entire program of development is encoded in our genome, we don't yet understand it well.

Genetic Code determines us.

● Code in our cells > determines all the different cell types
● Example neuron or skin cell.
- Genome inside of a neuron = Genome inside of your skin cells.
● What is going on in that cell even though it has the same program?
- Same code > executing a different program > neuron vs skin cell.

Genetic Code determines our problems.

● Genomics in cancer?
● Cancer =~ genetic disease.
● Normal Cells genetic code =~ Cancer cell genetic code
● Cancer = dividing without any check on their division.
● We define cancers by the type of cell that started the cancer.
- Skin cancer - melanoma.
- Blood cancers - leukaemia.

Genetic Code determines variations in our problems.

● The Consequences of different cancers are very different.


● What do our genes have to do with any of this?
o Mutations in cancers are also different.
▪ DNA damages it.
● Accident in replication.
o Cells divide > entire genome copied > uncontrolled.
● Central Dogma
o Francis Crick
▪ Co-discoverers of the structure of DNA over 50 years ago.
o It is not an absolute dogma. But the central dogma of biology.
o Information flows in
▪ A single direction from your genome, generally for understanding
▪ Producing Proteins: DNA > RNA > Proteins
▪ Step 1 Copying: DNA is turned into RNA.
● Exons > transcribe.
● RNA copy of DNA
▪ Step 2 Conversion: RNA > Protein (3 step process)
● Step 1: 4 letters base nucleic acids (
● Step 2: 20 amino acids (9 essential)
● Step 3: Those mix and match to make up different proteins for cell.
o protein might be 300 or 400 amino acids.
● 3 Stop Codon: UAG, UAA, UGA.
o Proteins in your body
▪ most of the functional work (metabolism etc)
▪ moving things around in the cells.

However, this not the whole picture (it is not one way)

o Information can flow the other way.


o Thought: If everything just flowed from the DNA to the proteins
o impossible for the cells to behave differently.
o So, what is going on?
o So, the proteins themselves, go back and bind to that DNA and modify.
o DNA modifications change the genes that get turned on and off.
o 2 backflow changing DNA.
o Proteins can self-regulate in this way > Change in DNA.
o Modifiers / methylation marks > Change in DNA
o Measuring Backflow in genomics
o Sequencing
o Use case: Understanding cancer, then we must go and get some cancer cells and
figure out
Sequencing

o Heart of genomics, and the genomics


o Major reason of faster improvement is genome technology has gotten
o Faster
o More Efficient
o A sequencer today
o highest super sequencer single run > trillion nucleotides of DNA.

Human Genome Project

o Started in 1989 with the goal of sequencing one human genome in 15 years
o Published the human genome in 2001 - 12 years (3 years early)
o Today you can get a sequencer in a single lab, a single investigator, few days = ~ 100 time
more efficient
Future Feasibility

o More Data and Faster Data.


o Faster Technology and Computing.
o Cheaper and Cheaper Data, Technology and Computing.
o $25 to $30 million to $1,000 dollars today for full human genome sequencing

https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Where is all this data?

o trillions of bases of data generated.


o Download this data and study it ourselves.
o Even though this data has been published,
o It is always a learning tool.
o There is always something to find
o Biggest data
o National Centre for Biotechnology Information or NCBI.
o Sequence Read Archive or SRA.

WHAT IS GENOMICS? 04:30 Hours

● Genomics is the study of genomes.


● Genome is all the molecular material inside your cells that defines how your body works and
how our bodies work.
● Genome Structure (DNA Structure)
o Nucleotide > DNA > Chromosome (folded by histones)
o Nucleotides, As, Cs, Gs, and Ts
● Humans:
o Approx. 3 billion nucleotides
● Humans
o 23 chromosome pairs.
o 22 of them are, identical
o 2 nearly identical copies.
▪ man: x and y
▪ woman: x and x
o Middle portion: centromere
o Ends portions: telomere (special sequence)

So that is the structure of the genome. But the function of a genome is,

involves many, many other things. That's what we're, really when we sequence

a genome, what we're really getting at, is trying to understand the function. So the function is all

the things the genome does. So in that long, long string of letters, we have encoded

everything that your body can do. So we have the the instructions for

how you make all the organs in your body. For how your body develops

from a single embryo, into all these complicated tissues. And we also have instructions for how your
The body does things like respiration and metabolism. And even such complicated

things such as building a brain. Another aspect of genomics is evolution. Evolution is a very broad
topic. When we're talking about

evolution of genomes, we talk about how genomes themselves

change over time, and we usually mean over evolutionary time, and those are very,

very long time periods. We know from having sequenced the human

genomes, and now having sequenced many humans individual genomes that all of

we are nearly identical to one another. So our genomes from generation to

generation barely change at all and most of those changes are small, random

changes which don't have any effect. But we can compare our genomes

to genomes of other creatures such as chimpanzees which

are our closest relatives. Diverge from us approximately

6 million years ago. And we can compare our genomes to

genomes of much more distant things like fruit flies, nematodes, or even bacteria. And when we do
that, we discover that,

we share a surprising amount of sequence. Similarly, even with things as

remote from us as bacteria. And when you think about it, that sort

of makes sense because the, even though, we're, obviously,

very different from bacteria, we do very, very different things, every living thing

on the planet uses DNA as its basic code. Every living thing on the planet has to do

certain things in common because of that. For example, every living thing has to copy its

DNA in order to make copy of cells. So bacteria use a very similar

mechanism to copy DNA as, as humans, so we find the genes in bacteria are, in

those cases, similar to genes in humans. [SOUND] And then finally,

when we talk about mapping genomes, we're actually beyond, mapping sometimes

refers to just sequencing itself, capturing the sequence, but

once we capture the sequence, the first thing we do after that is

try to figure out where the genes are, and we'll talk a lot more about

that later in the course. But, genes the word genes has, has changed

a lot over the past 50 to 100 years. Today, we think about, one way to

think about it is, is say an inhe, inheritable unit that is something that

you can inherit from your parents. Another way to think about is, is a small
section of the genome that encodes a protein which in turns does,

has some function. And that's usually what we're

talking about and genomics is, is the parts of the genome

that encode proteins or that encode little bits of sequence

that can turn into functional elements. [SOUND] So here's another definition

of genomics that emphasizes not just the study of genomes themselves but

also a little bit about for application. So why are we interested

in genomes at all. There are many, many applications

of genomics, today, and the, the list is growing, rapidly as, as we

get better and better at sequencing, and as we discover more things

that we can do with genomes. But it certainly includes medicine,

many applications, in, in pharmacy, and if you're not looking

at human genomes, agriculture and, and other areas of science. So genomics is a relatively

new field of biology. But it's in some in, in many people's

minds, it's part of biology. But let me just emphasize a few of

the differences between genomics and more traditional biology and

genetics, which have a big overlap. So one way you can distinguish genomics

from more traditional biology. Genetics is the, traditionally when we,

when we study genomes, or genes, we studied one at a time. Not that long ago,

back in maybe the 1980s, it was a common, a common experiment was to,

to study one gene and to spend months or years studying that one gene, or

people would spend their whole careers. You've been studying one gene and

writing about that gene, and trying to figure out what that,

what function, what the function of that gene was in said people or

in say, a model version like mouse. But today,

because we can capture the whole genome, we often look at all the genes at once or

at large collections of genes all at once, so that would be a more

genomics way of studying. The technology has been the real driver of

this, so that's a very big differences and that's really what's allowed genomics to

grow the way it has the past 20 years. In the past, we could only do targeted,

what we call low throughput experiments, studying one gene at a time


instead of one organism. Today, we can do experiments where we

simultaneously measure the activity of thousands of genes all at once

using the new genomics technology. So that's a big difference

between genomics and earlier ways of doing biology. And then the hard part, where things

differ between traditional and genomics science. In the past, you still had to be clever,

you had to understand biology and genetics well. Experiments were not

necessarily easy to do. Might, they might take years. Today, some of the experiments, you can

still do the same kind of experiments, some of the experiments are easier to do,

but we generate so much data that the data

itself is overwhelming. So, because the technology has gotten

very clever, very efficient, and very expensive, we're now looking

at doing experiments where we, where we measure thousands or

tens of thousands of genes all at once, in multiple tissues, in multiple samples. And we know going
into such experiments

that there's probably a lot of really interesting results

that are going to come out. But when we get the data,

we discover that, oh my god, there's so much data here that it's hard to figure

out where the interesting results are. So we have, we have to deal with big data problems

that we didn't have to deal with before. We have to deal with, statistical uncertainty in ways

that we didn't have to deal before. And we have to deal with large-scale

computation in ways that were never necessary for earlier,

earlier types of biological studies.

WHAT IS COMPUTATIONAL GENOMICS? 02:00 Hours (Wiki –


Random) Own writings
Contributions of computational genomics research to biology

● proposing cellular signalling networks


● proposing mechanisms of genome evolution
● predict precise locations of all human genes using comparative genomics techniques with
several mammalian and vertebrate species
● predict conserved genomic regions that are related to early embryonic development
● discover potential links between repeated sequence motifs and tissue-specific gene
expression
● measure regions of genomes that have undergone unusually rapid evolution
Centre of Computational Genomics

WHAT IS GENOMIC DATA SCIENCE 02:30 Hours


So, this lecture is about

what genomic data science is. So, I'm going to try to quickly introduce

the major disciplines that comprise genomic data science and tell you what,

what we consider to be part of it. So, genomic data science is at the

intersection of biology, statistics and computer science. And we're calling it genomic data science.
But it actually goes by many other names,
some of which you might have heard, such as computational genomics,

computational biology, bioinformatics, or statistical genomics. Regardless of which name you use,
the

activities of genomic data scientists do all sort of fall into these same and

to similar categories. So what do we do in,

in genomic data science? We start with some, some,

let's talk about sort of humans, we start with some subjects. By the way, it doesn't have to be
humans,

it could be mice or model organisms. But let's say we're talking about humans. We'd want to go and

collect some samples from those humans. So, those could just be,

if we're looking at normal development, we might just collect some

skin cells from normal people. And then prepare those samples in the lab,

send them for sequencing. The sequencer then generates enormous

amounts of data that we have to do something with. So what we typically will do

when we're looking at humans is, we take those sequences which

are very short fragments. We call those reads,

because you're reading the DNA, so I'll use the term reads, and

others will in this course as well. To mean a little bit of sequences

come off a sequencer which is part of someone's genome. And we take those reads and we align

them to the reference human genome, and there is one reference genome, for now. Which
represents, for now,

sort of average northern European male. Oh, in the near future we're probably

going to have many other reference genomes as well. But in any case, we align these

reads to the reference genome and that tells us how the differ

from the reference genome. And also lets us kind of compile them

all up on one another and see how, see what sort of differences

there are even within the genome. Now, remember for every genome of a person you actually

have two copies of every chromosome. And that's because you got one

copy from mom and one from dad. So one of the things we study for example

at this phase, is how do your copies of those, of each gene differ between

the two copies that you have. And then once we've done some kind of

analysis we may deposit that data in a public database such as one


of the databases at N, at NCBI. And we apply many other types of analyses to these data sets to
make more biological

conclusions about what's going on. So that's sort of very, very broadly

speaking, what genomic data science is. So we start typically

with experimental design. So if you're, if you're going to do

genomics, you have to first think about, well, I have a question,

a scientific question I want to answer. And I have to decide, well,

how much data do I need? How many subjects do I need? What kind of data we get? I've been
talking about sequence data so far, but actually there's

various modifications we can do. Some new technologies that go by

names like ChIP-seq and methyl-seq. They'll let you study other

things about the genome as well. So the first thing you do is you, you

come up with an experimental design which if everything goes will you hope will

allow you to answer those questions. And that's really critical. If you don't, these experiments, even
though we're generating a lot of

data far more cheaply than before, it's still is expensive and

time consuming to do these experiments. And you don't want to reach

the end of your experiment and discover that you simply don't have

the right data to answer your question. So once you've decided on

your experimental design, you generate your data,

these, then your typical, then you'll take those big data sets and

as I was saying a few minutes ago. The first thing you would typically do is

align them to the reference genome and find out how they're,

how they differ from that genome and assemble them together in some form. So that might, that
might mean

that you're looking at say, if you've captured RNA from the,

from the cells. When you align them to

the genome you can see, for every gene in the genome,

how much of that gene was present. So, that's the kind of

thing that you would, you'd be doing at the pa, part where

you're aligning and assembling things. You might be assembling the genes that

were expressed in a set of cells or tissue you were studying. So, another big s, important step,
in this process is preprocessing and normalization of the data. Now these, because these data

sets are very big, there's, there's various types of,

of biases that can, that can come out. And this is sort of the, the law of large

numbers comes into play a lot here. Even, all of the technology that

we're talking about, even though I'm talking about it like it's very nice,

clean technology, it's not perfect. So the sequencing machine itself,

it makes mistakes. The process of collecting

the data introduces biases. We might collect more of

some tissues than others. We might collect some genes might be

representative of higher levels because of biases, not because of true bio,

biological differences. The sequences machines

themselves sometimes have have errors that are not random and

are hard to identify. And sometimes they have other errors that

are random and we can identify those and get rid of those. So we basically want to take

these big data sets and apply some computational and

statistical methods. To, to sort of correct as best we can

any kinds of systematic errors or any sort of bias that are in that data

before we move onto the next steps. So once we've got the data normalized,

preprocessed and normalized we then apply

a variety of techniques. That come from fields within statistics,

within computer science and within biology to,

to make our conclusions. So, so for example, in statistics and

machine learning scientists develop a wide variety of techniques, some of which

we'll talk about later in this course. To, to go from this preprocessed

normalized data to scientific conclusions. So, in order to then take

that normalized data and make this conclusions one of the things

that people within the field of genomic data science themselves

do is develop software. So when we've, once we've,

when we have an experiment and I'll just take one as an example. One that I'm, I'm,

very familiar with is RNAC. because RNAC, which we'll talk about

probably time and again in this course, is, is a set of experiments or


an experimental protocol. Where you can capture just the genes that

are being turned on in a set of cells. And you measure those by sequencing. And this is a pro, this is

an experimental, an experimental paradome. It's very popular. You can use it to study cancer. You
can use, use it to study development. You can use it to study evolution and

hundreds or thousands of different

kinds of conditions. And because of that,

the experimental protocol for doing the RNAC part is more or

less standardized. That means that we can design

software that will in a more or less standardized way go from this raw sequence data to something
which is more

amenable to making biological conclusions. What you really want when

you're doing RNA seq-, RNA sequencing is you would like

to know okay, I had my samples. I want to know in each sample

what genes were turned on and what levels were those genes present in? So there are,

there are software packages to do that, are developed by people in,

in genomic data science. And software development

of course requires, not just getting the code

to work one time. But also documenting it, making sure

it works for lots of different cases. So there's a lot of software engineering

kind of principles, and, and algorithmic principles that are important

in doing that sort of development. So another thing that we study

in genome data science is broa, another broad category of,

of questions, or population genomics. So, I've mostly been talking about

individual sorts of questions like what causes a cancer to be a cancer? But we can study other things
too, such

as we can look at a population and ask, what makes a particular group of people

more susceptible to a particular type of disease? Or why do some people have, resistance

to a particular type of disease? So, or why do some people have certain

traits in their, in their population? So population genomics refers to looking

at, rather than looking at one person at a time or small groups, look at whole

populations and fi, and, and, and studying their genomes

to see how they differ. And how those differences lead to, to recognizable differences in,
in the people in those populations. So, so another way to,

to characterize genomics and to characterize some work in,

in genomic data science is, when we collect experiments of different

types and pull them all together. So you might call that

integrative genomics, you could also call it systems biology,

another popular term. But that refers to computational and

statistical activities where we're taking data that comes from sequencing

experiments, say of different types. As well as other kinds of measurements. They might be
measurements

on what we call the proteon, measurements separate from the DNA. And integrating this all
together

to make biological conclusions. So we would call that

integrative genomics.

JUST ENOUGH MOLECULAR BIOLOGY 02:00 Hours

This lecture is about cell biology and it's just enough cell biology

to familiarize you with some of the terms that we're using

in the rest of this course. We're not trying to give you

a detailed background in cell biology, which is a very dense and complex topic,

and you can spend years studying it. In fact, people spend their

whole career studying it. But cell biology and molecular biology

include many complicated sounding terms, and it's important that you

understand what these words mean, because we're going to be using

them throughout the course. So at the most basic level, we divide

cellular organisms into three domains, eukaryotes, archaea, and

bacteria, and archaea and bacteria are sometimes grouped

together and called prokaryotes. And the distinction we make between these

two very fundamental groups of living things is that eukaryotes have cell

nuclei and prokaryotes do not. Now evolutionarily these three domains


of life split off from one other very, very long ago, near the beginning

of the evolution of life on Earth. Bacteria and archaea split off a little

bit later, after eukaryotes had split off from one another, and

bacteria and archaea separated. But bacteria and archaea, in many ways, at the cellular

description level, seem very similar. The big difference between eukaryotes and prokaryotes is that
prokaryotes do not

have a cell nucleus and eukaryotes do. And the cell nucleus evolved a long, long time ago as a way to
sequester

our DNA from the rest of the cell. So at the most, at the cellular level,

eukaryotic cells and prokaryotic cells look rather different. The biggest difference is that

a eukaryotic cell has a nucleus. So even single-celled

eukaryotes like yeast, which is a very common single celled

eukaryote- you all probably eat it every day if you have bread- even yeast have

nuclei, just like, just like our cells do. So, humans, yeast and

everything in between are eukaryotes. Our cells have nuclei in them, and other organelles are small,

cellular structures as well. Prokaryotes don't have that. So, in particular,

the DNA in a prokaryote, you can think of it as just

sort of floating around. It's not really floating but you can

think of the DNA in the prokaryotic cell as just kind of loosely

organized inside the cell and in eukaryotes,

there's a bit more organization. In a eukaryotic cells,

DNA is sequestered deep inside. So if you look at this picture here,

inside the eukaryotic cell, which is surrounded by a wall, there is

a nucleus which has its own membrane or wall surrounding it and inside that

nucleus is where we put our DNA. And our DNA is organized

of course into chromosomes, which are these very very

long molecules of DNA. And all your chromosomes

are inside each cell, and within those cells are inside

the nucleus of each cell. And there's one slight exception

about your DNA, which is that one of the organelle's inside of the

eukaryotic cell is called a mitochondrion. Mitochandrion, and actually there


are multiple mitochondria in each cell, so the mitochondrion has its own DNA. In the human, in the
human genome,

mitochondrion genome is very, very small, it's a tiny fraction of one

percent of our DNA But, you might have heard mitochondria called the

powerhouses of the cell because the genes inside your mitochondria are responsible

for a good bit of energy metabolism. So even though it's very small,

doesn't have that many genes, it's absolutely critical to life. And, as an interesting sort of side note,

we believe that the mitochondrion, a very long time ago, billions of years ago,

was originally an independent prokaryote that was absorbed by the ancestor of all

eukaryotic cells into the cell itself and then it's become part of every

eukaryote now on the, on the planet. So all eukaryotes have mitochondria and,

they've diverged since, since the eu, the various eukaryote

species starting, started to diverge, and most of them are very small. Back to the cell itself. So cells
undergo a cycle, characteristic

cell cycle that many people study. It's not important that you know all

the names of the parts of the cycle like metaphase and anaphase. But cells, in order to,

during the course of their life, they have a fairly well-defined

cycle where they undergo different, different processes, one of which, the

most critical is the process of division. So, the way you go from a single celled,

progenitor cell from a fertilized egg to a whole organism is the cell

wants to divide many times and throughout your life, your cells are

constantly dying and have to be replaced. In order to replace damaged tissue, your

body has to divide which means cells have to divide to replace the cells that are,

that are damaged So that, that part of the process is called mitosis, where

a cell separates into two daughter cells, producing two essentially identical cells,

and for the purposes of our, of, of our discussion,

it's important that you realize that, that you understand just

the couple of words here. So, when the cell undergoes mitosis,

the DNA in the cell has to replicate. That is, it has to make two copies

where there was one before. So, inside the cell before devising

two copies the DNA replicates. Now you have two copies of every

chromosome where there was just, well you actually start off with
two copies every chromosome. After replication you have four copies,

and now the cell has to figure out through

a complicated procedure during mitosis, how to separate those, those copies and,

and that's what happens. The copies that are made are actually

separated very reliably into different physical compartments

in the cell as it/s dividing, and as a result you get two daughter cells or

as we call them, that are identical to the original cell, and both

the daughter cells are also diploid, so there's another word I haven't used yet,

diploid. So not all eukaryotes are diploid,

but we are. And diploid means that we have

two copies of every chromosome. So our DNA exists in two copies. One copy from mom and one
copy from dad. And of course, in diploid organisms it's

important that every time a cell divides it maintains these same two

copies of every chromosome. One other,

another important aspect of cell biology is that cells don't always divide and

produce two identical copies. Of course, if you are replacing

existing cells that have been damaged, say you've got a cut in your skin and

new skin cells have to be grown, you want the skin cells to be just like

just like the ones that were there before. But during the course of development, cells need to
develop into

different types of cells. This is a much more complicated process

than we have time to go through. But all the cell types in

our bodies start as, more basic cells that we call stem cells that

are, have the capability of dividing and differentiating into

different types of cells. So, this process of cell differentiation

involves a slightly different cell division process where the two daughter

cells are not identical to each other. And they go down what we

call developmental paths that allow them to differentiate into,

into mature cells of different types. So all the cells, for example,

as this slide is showing, all the cells in your blood start as what we call

a multi-potent hematopoietic stem cell, which can divide and produce many

different types of blood cells. And the cells,


the cells circulating in your blood, all came from these stem cells. And the stem cell populations in
you

body are maintained over time as well. So there's a special type of cell

division I also want to mention in this brief introduction, and

that's called meiosis. So, during sexual reproduction, something

different happens in cell division, and that's, that's the main point

we want to get across here. When you're producing egg cells or sperm

cells for the next generation of your species, you start off with copies

of chromosomes from the mother and the father, and those are all put together

before meiosis in a single compartment, and then there's a special process that,

that, that happens, it's not by design it just happens

by accident, called recombination. Where the cells can, the chromosomes,

because they are very, very similar they can kind of stick to each and we have

this process called crossing over that occurs where part of, let's say chromosome

one from mom, can cross over with a part of chromosome one from dad and you get

resulting in the, in the daughter cell, you can have a copy of chromosome one

which is partly mom and partly dad. Now these recombination events

are relatively rare, but not that rare. In, in humans, in fact,

we estimate that in each generation, there's approximately one

crossover read per chromosome. So, most of your chromosomes, although you

got every bit of your chromosomal DNA from your parents, your actual chromosomes end

to end are not likely to be identical to any of,

to either of your parents' chromosomes. So recombination is one of the big reasons

that family- besides mutation- is one of the other reasons why family

members are not alike. So, I'll just conclude by mentioning, by,

by saying that during, of cell division to produce children,

every child gets different, has different recombination’s. So even though every child

of the same set of parents has essentially the same DNA as the two

parents, there are an almost in, almost infinite number of

possible recombination’s. They're not quite infinite, but a, a huge

number of recombination’s that occur. And so every child actually has


a different set of chromosomes, although you can clearly trace

them back to both parents. And that's one of the big sources

of diversity in our population.

IMPORTANCE OF MOLECULES IN MOLECULAR BIOLOGY


02:00 Hours
This lecture is about important molecules of molecular biology. And I am only going to talk about
few key molecules. Because they're the ones that we talk about throughout the course. And its
important that you know their names and you know what they are. There are thousands of
molecules, that we could talk about. They are all important, so I don't mean to imply by saying
important molecules, that these are the only important molecules in molecular biology. But they're
among the most critical important and critically important molecules in determining how your
genome functions. And if you, if you get confused or you worry whether or you wonder whether this
is that important, just remember that every one of these molecules is in every cell in your body. And
if you had a powerful enough microscope, you could take a little, you could just point it at your skin
right now and zoom in, you could see all these molecules I'm going to talk about. So DNA is the
molecule that comprises all of our genetic material and it's comprised of four different nucleotides.
AGCT is how we write them, but they're actually named adenine, guanine, cytosine and thymine and
biochemically they have slightly different structures. Adenine and guanine are called purines and
they have a two ring structure shown here. And thymine and cytosine are pyrimidines and they have
a one ring structure. And you don't need to know this structure or remember this. But just you just
its good to know that the C and T are kind of similar to one another and a little smaller. And the A
and G are similar to one another and a little bigger. So the way the DNA is constructed is that these
molecules bind together in a very specific way. A's always bind to T's and G's always bind to C's. And
this is true across all living things, everybody has the same DNA and we all have the same structured
DNA. So this rule is very useful because it means, and this was discovered way back in, in the 1950s
when Watson and Crick discovered the structure of DNA. They immediately realized that this binding
property means that when you have one of the strands, you've already you also know what the
other is. So if I give you one of the strands of DNA with ACG and T, you know what the other strand
is. Because everywhere there's an A on one strand, there must be a T on the other strand. And, and
corresponding everywhere, there's a G on one strand, there must be a C on the other strand. So, this
provides a mechanism, as, as Watson famously observed when they discovered the structure, for
how DNA copies itself and passes itself on, from one generation to another. So, the structured DNA
then, in these long strings, gets put together in this, famous double helix structure. So the molecules,
of the As and, bind to Ts and the Gs bind to Cs and you've built these long ladders. And the ladders
are twisted in a long helix. And every one of your cells has all of your DNA in it in the structures. And
as, as we've said before, the DNA in our, in our genome is organized into 23 chromosome pairs. Each
of these chromosomes is a very, very long string like this. The longest chromosomes in the human
genome are the order of 250 million nucleotides long. So there extremely long molecules very tightly
coiled up and packed together inside of the nucleus of every cell in your body.

Play video starting at :3:1 and follow transcript3:01


So the way that we're going to write the data the DNA sequence itself looks like this. So we'll write
As Cs Gs and Ts. We don't write these chemical structures. We abbreviate with these four letters.
And DNA actually has a direction, a strandedness. And we call and we call this the based on the
biochemical properties one end of the DNA is the Phi-prime end and the other is the A three prime
end. And that has to do with the structures of those biochemical molecules, but you don't really
need to remember that. Just remember that we always write it in the same direction where five
prime is always first and three prime is always second. And we try to write the opposite strand the
other way around. So the, the strand that goes five prime to three pri, to three prime, because we're
writing things that way, we call that the positive or plus strand. And the other strand, the reverse
complement, is the negative strand.

Play video starting at :3:48 and follow transcript3:48

So, the other, another critical molecule for how our bodies work and how our genomes work is RNA.
So RNA is almost exactly like DNA except for a couple of important differences. The, the most
obvious difference is that we, we have, we don't have the T anymore. We don't have thymine
instead we have uracil. So when DNA gets copied or transcribed into RNA, the As get replaced by As,
Gs get replaced by Gs, Cs get replaced by Cs, but Ts get replaced by Us or uracil. And then you build
an R molecule which, unlike DNA, is single stranded. So RNA are not double stranded although they
can form double stranded complexes. But in general RNA is single stranded and it is from this RNA
template that we create proteins. Which are the other critical molecule in a cell. So RNA has, again,
the same, similar biochemical structures. The, the uracil or the U molecule is very similar to the T
molecule.

Play video starting at :4:42 and follow transcript4:42

So we write RNA the same way we write DNA, five prime to three prime, only we're going to replace
all the Ts with Us. So if you see a string of, of letters and there are Us in it instead of Ts, you know
immediately that that's RNA, not DNA. So and important distinction genetically is that DNA is the
stuff of inheritance. DNA is what cells carry with them from one cell generation to another whenever
a cell divides, it creates DNA that replicates the DNA in the original cell. The RNA uses a template to
make proteins, but the RNA is not actually the stuff of inheritance. However, it is in most cases an
identical copy of the DNA, that it was formed from. So we use these, these molecules to encode how
the cell works, DNA is basically a program, that we read out. And the read out program starts with
RNA and then it goes to make, and the RNA is used to make proteins. So proteins that are also long
molecules, not nearly as long as DNA. They're typically hundreds or sometimes thousands of amino
acids long. So amino acids are more complicated molecules that there's a picture of one here that
are strung together as well to make proteins.

Play video starting at :5:52 and follow transcript5:52

And the translation rule we the, the translation rules were worked out in the 1960s. In some
groundbreaking molecular biology work related to, at the, very founding of the field of molecular
biology.

Play video starting at :6:5 and follow transcript6:05

Scientists figured out essentially one codon at a time how you translate RNA into proteins. So I just
use, use the word codons. So the way that RNA gets turned into proteins is that the every
combination of three of the letters of RNA encodes it an amino acid.

Play video starting at :6:26 and follow transcript6:26


So there's, there are 64 possible such combinations. Of those 64 combinations, 61 of them encode
amino acids, and three of them are stop codons. So the translation machinery reads along the RNA
molecule three nucleotides at a time. And for every three nucleotides, it creates an amino acid. And
those are built together into a long string of amino acids which is what we call a protein. And when
the RNA translation machinery hits one of these stop codons, one of these three special codons that
don't encode anything, it stops and that's the end of the protein. So that's how it knows.

Play video starting at :7: and follow transcript7:00

So we write protein sequences also in a particular direction.

Play video starting at :7:4 and follow transcript7:04

And here's just showing a translation of a particular set of nine nucleotides producing three amino
acids. We write proteins in, in the 20-letter alphabet we use to abbreviate the amino, amino acids.
There are 20 amino acids that comprise essentially all of our proteins. So actually to tell the whole
story, there are more than 20 amino acids, there's actually 22. The 21st amino acid was discovered
not that long ago, and the 22nd amino acid also not that long ago. And these amino acids are
primarily used in other forms of life besides humans. So there are a few exceptions to almost every
rule in biology. But in general, the way to think about human biology is that we have 64 possible
codons. 61 of them encode amino acids and they encode exactly 20 amino acids. That 21st amino
acid, when it was discovered, turned out to be, one of the stop codons that's once in a while used to
encode amino acid.

THE HUMAN GENOME PROJECT 05:30 Hours

0:01
What was the Human Genome Project? 
In this lecture, we're going to talk about the history of this groundbreaking project 
that set us, that set us where we, and brought us where we are today.
Play video starting at ::10 and follow transcript0:10
So the Human Genome Project was first proposed in the late 1980s 
by scientists at the US Department of Energy. 
Many people don't realize it was not the National Institute of Health, 
the NIH, that proposed it, but rather the, the DOE. 
And the reason the DOE was interested in genomics was they were studying 
effects of radiation on DNA. 
But never mind that. 
The project was proposed, initially, it was considered to be extremely ambitious, 
and many scientists were actually against it. 
It was, the idea was that it would be biology's Manhattan project, by far, 
the largest project that biology had ever taken on. 
But as scientists started to discuss it in the late 80s, 
it quickly gained momentum and soon it was approved and and 
the NIH joined in and then many other countries joined as well. 
So the project officially started in 1989 as a joint effort of the NIH and 
the DOE in the United States, plus many other countries. 
Outside of the US, the Sanger Centre in England was the largest sequencing center.
Play video starting at :1:5 and follow transcript1:05
So the goal of the project was very simple. 
The human genome is 3 billion base pairs long. 
In the 1980s, sequencing was still very new technology. 
The automated sequencing that was available at the time was very slow and 
expensive. 
It was around $10 a base to sequence in the 1980s. 
So that was really expensive, and that would cost $30 billion. 
But the scientists who were proposing it said, well, 
we know things are getting faster and more efficient, so 
we're going to assume that pro, that prices will drop by a factor of ten. 
And we'll probably be able to sequence the genome for $1 a base. 
So they came up with an estimate of $3 billion, 
which is the number that has been widely reported as what the project cost, and is 
probably a bit of an overestimate because cost went down quite a bit more than that. 
But anyway, that was the goal, sequence all 3 billion base pairs 
in the human genome for $1 a base, and finish it in 15 years, by 2005. 
Now, one reason people were opposed to this was that we knew already at the time, 
and we certainly know very well now, 
that only about 1.5% of your DNA encodes proteins. 
So people said, well, most of the DNA, they thought was just junk, 
was stuff that wasn't really biologically important or useful. 
We now know that that's not really true, but at the time it was widely believed 
that most of the sequencing of, would be dedicated to learning, to learning 
sequences that didn't really have any biological function or consequence. 
So some scientists who were opposed to it were opposed to it because they thought it 
would be a waste of time and a waste of money that would be better spent 
trying to target the genes in the genome. 
Nonetheless, the project took off and quickly gained momentum.
Play video starting at :2:28 and follow transcript2:28
So, you might have heard about the Human Genome Project as a race. 
Well, in the early 1990s it wasn't really a race. 
I'm going to get to that in just a few minutes. 
So what, the way the project started was that scientists around the world worked 
on what, what were called maps. 
So the idea of the project and the plan for the overall project from the beginning 
was that we would take large chunks of DNA, and these were about 150,000 base 
pairs long, that were called bacterial artificial chromosomes or BACs. 
So we take these chunks and 
we could grow those chunks up in E.coli bacteria, make as many copies you wanted, 
and we could sequence those chunks, and then stitch those pieces together. 
So this seems like, and that was because at the time the best we could do in terms 
of sequencing technology when we were sequencing DNA was to sequence 
little tiny fragments from a slightly larger chunk. 
And these bacterial artificial chromosomes, 
or BACs, were about the largest chunks people thought they could handle. 
So the, the real problem was could you assemble those little tiny fragments or 
wreathes together, and we knew we could do it for 150,000 base pair chunks. 
So the, the, the problem though is we have a bacterial artificial chromosome, if you, 
it was easy to create these BACs, but 
you had to figure out where they went on the genome before you sequenced them. 
So the, the idea was to develop develop libraries they were called, 
with hundreds of thousands or millions of BACs in them. 
Then select those BACs, figure out where they went in the genome, and create 
a tiling path, basically aligning the, the pieces of, of BAC DNA across the genome. 
And then finally, when those maps were done, we would sequence the BACs. 
And the idea was as mapping would go on, the funders would fund that effort, and 
then sequencing, meanwhile, will get more efficient, 
and when we finally got around the sequencing, it will all be $1 a base. 
So that was the idea, and that was, that was moving along steadily throughout 
the early 1990s, as well as technology development. 
But then, something happened that kind of changed the game rather dramatically. 
In 1995, a small non-profit research institute called TIGR, The Institute for 
Genomic Research, sequenced the first complete bacterial genome ever to be done, 
the genome of Haemophilus influenzae, which is an infectious bacteria. 
This genome is about 1.8 million bases and had 7, has 1,742 genes. 
And this project was led by Craig Venter, who was the founder of TIGR, and 
Hamilton Smith, a professor at Hopkins, who also was a Nobel laureate, 
is a Nobel laureate. 
So what was different? 
Why would this change things? 
This is a tiny genome, bacteria are far smaller than human's, 
about 1,000 times smaller. 
What was different was that this was done through whole genome sequencing, 
whole genome shotgun sequencing. 
Where you didn't create these maps, but instead, 
you took the whole genome, you fragmented it into tiny, into many, many tiny pieces, 
tens of thousands of tiny pieces. 
Then you just randomly sequenced those pieces, and by oversampling, 
that is by sequencing every part of the genome many times over, you could 
then use a computer program called an assembler to put it back together. 
And people had never done this for 
something on the order of a whole genome, even a whole bacterial genome before. 
So this was dramatic and, and 
certainly changed the field of microbial genomics at the time. 
And everybody in the microbial world was very excited and 
started proposing to sequence microbial genomes this way. 
Meanwhile though, the human genome continued as planned, sequencing or 
mapping these 150,000 base pair chunks. 
So then things changed again a few years later in 1998, 
and this is where the race really began. 
So a new sequencing machine was developed by a company called Applied Biosystems. 
And this machine was not dramatically more efficient than the other, 
than the previous machines, but it was significantly faster, and it was easier. 
It used capillaries to do sequencing. 
That is tiny, tiny little plastic straws and the DNA would flow through those. 
And it let you add, and 
it let you automate the sequencing in a way that wasn't really possible before. 
So with funding from Applied Biosciences Applied Biosystems Craig Venter, 
Ham Smith and others left TIGR to form a for-profit company called Celera Genomics. 
And this company's goal, its entire purpose for 
being created was to sequence the human genome. 
Not only were they planning to sequence the human genome, but what they proposed 
at the time was that they would do it through whole genome shotgun sequencing. 
That is, they would take the entire human genome, 1,000 times larger than 
a bacterial genome, they would break that up in to lots of little pieces, 
millions and millions of little pieces, sequence those, and 
somehow assemble them back together to create the whole genome. 
Now this method didn't have to do the mapping. 
The mapping was still going on, the BAC mapping was still going on in the publicly 
funded Human Genome Project, but Celera wasn't going to do that. 
They were going to skip all that, and go straight to sequencing.
Play video starting at :6:46 and follow transcript6:46
So this certainly caused a kerfuffle, and the race began. 
So NIH, which had, up until then had been funding eight large centers in the US, 
merged its efforts into three even larger centers which still exist today. 
The Sanger Institute, or the Sanger Centre at the time, ramped up their effort, and 
other, and other the other groups in the public also did. 
And, and everyone in the public effort started accelerating their, 
their sequencing efforts. 
Now, they were still doing BAC-by-BAC sequencing but they started sequencing 
the BACs much faster than they, than they, than they had been before.
Play video starting at :7:16 and follow transcript7:16
So this was really a race. 
So despite that, 
some people were skeptical about Celera's ability to assemble an entire 
animal genome using this whole genome shotgun sequencing technique. 
No one had really done it for anything larger than a, 
than a large bacterial genome, which is millions of base pairs, not billions. 
However, soon after the formation of the company, Celera sequenced and 
published the complete genome of the fruit fly, Drosophila melanogaster. 
Now, drosophila is about 130 million base pairs long, 
so still much smaller than human, about 20 times smaller. 
But much larger, about 20 times larger than any bacterial genome or 
than any genome that had been sequenced and 
assembled through the whole genome shotgun technique up to that time. 
So that was a success, it was published in 2000, and it proved that this whole genome 
shotgun technique would, could scale up by a factor of 20. 
And there was really no technical reason why it wouldn't scale up by another factor 
of 20, and in fact, that's in, that's what eventually happened. 
So this really proved that Celera meant business, and 
it really spurred the public effort to, to accelerate their, their work even further. 
So the race really heated up in 1999 and 2000. 
In 1999, Craig Venter announced that Celera would finish their work by 2001. 
Actually, originally he really announced 2003 because the public effort said they 
were going to finish in 2005. 
The public effort quickly responded by saying they would also finish in 2003. 
Then in 1999 Venter announced that Celera would finish, in fact, in 2001. 
And soon thereafter, within a matter of weeks, NIH and the Sanger Centre announced 
to the public, Human Genome Project would finish a draft genome by 2001 as well. 
So everybody was racing. 
Now, by the way, 
these are what the scientific leaders of the projects were doing. 
The actual people doing the work, 
I was one of those people, were mostly just panicking. 
because we didn't really have any plan to finish that quickly but 
we figured we would have to give it a shot. 
So in 2000 as, as the work really did seem to be getting close to completion for 
a draft genome, NIH, the Sanger Centre, and 
Celera Genomics talked about publishing jointly. 
So there was a considerable effort to make this into one 
final project that everybody would say we all did together. 
However, in late 2000, those talks fell apart and two papers were planned.
Play video starting at :9:28 and follow transcript9:28
So that's what happened. 
In June of 2000 Bill Clinton and Tony Blair, the leaders of the US and 
the UK at the time, jointly announced the completion of the human genome. 
And you can see in the slide that Craig Venter is shaking hands with 
Francis Collins, 
who was the head of the Human Genome Research Institute at the time. 
So it was announced in 2000 and, 2, in the year 2000, that both groups were done and 
that it was a tie. 
And that's kind of how, how it played out. 
Now at this point, the paper wasn't done and in fact, 
at the time of this announcement, the genomes weren't done either. 
But we knew we had about six months to get them done, 
those of us who were actually in the trenches doing the work. 
And so everybody put all their effort into, into, 
as quickly as possible finishing up this draft genome. 
So now, whose genome did we sequence, by the way? 
So, when you talk about sequencing, the genome, each of us on the planet, and 
there are billions of us, each of us has a different genome. 
Now, our genomes are all very, very similar, probably only differing by about 
one position in a thousand, but they're all different. 
The Human Genome Project sequenced one genome which was a mosaic of about a dozen 
volunteers who contributed DNA, all anonymously to the Human Genome Project. 
All of them were Northern European in origin, so 
they all had a similar genetic background. 
And the assembly of this original genome represents that one sort of mosaic of 
a small collection of, of individuals of Northern European descent. 
Now since then, 
we've gone on to sequence the genomes of other people from other populations. 
But at the time, that was what we did. 
It wasn't one person's genome, but a few people's genome.
Play video starting at :10:53 and follow transcript10:53
So what did the genome tell us? 
Why do we do this? 
So, one of the major goals of the human genome is the, 
of the Human Genome Project was to identify all the genes. 
Identify all their sequences, eventually figure out what they all do, and 
use that to develop better treatments and improve human health. 
So just, this is just one of the first papers, what I'm showing you here is one 
of the first papers ever to attempt to estimate the number of human genes. 
This is a paper that appeared back 1964 in the journal Nature, so 
40 years before the Human Genome Project completed. 
There was, this was actually soon after the genetic code was kind of worked out, 
very soon after the genetic code was worked out. 
And what this scientist named Vogel did was he, he looked at, the act, the, 
the first two genes whose sequence had been determined were human hemoglobin 
sub units. 
And these are very kind of small genes, 
they're about 146 amino acids long, and he knew how much they weighed. 
We, we had pretty good measurements of how much those amino acids weighed. 
We also had pretty good measurements of how much DNA weighed. 
So you could say, well, 146 amino acids, so you'd know how much the DNA encoding 
that weight is 3, 3 nucleic, nucleotides for each amino acid. 
So you, and you could also measure roughly the weight of the genome in a cell. 
So he basically took the weight of one gene, he divided it into the weight 
of the genome, and he assumed that basically the genome was, 
was just gene after gene, end to end, encoded there. 
Now remember, this was 1964, no one knew, knew otherwise. 
We now know that only about 1 to 2% of the genome encodes genes, but 
that wasn't known at the time. 
So, using this estimate and not knowing anything about exons and introns and 
about all the intergenic DNA and so-called junk DNA, 
you would come up with a number of around 6.7 million genes, which is wildly off. 
So the whole Human Genome Project presumably would give us a much better 
fix on this number. 
We're, there's many many other things, of course, that we can learn from the genome, 
but this is one of the kind of simple messages we should be able to get out of 
the genome, is how many genes do we have and what are they.
Play video starting at :12:48 and follow transcript12:48
So the two papers appeared in February 2001. 
They actually appeared simultaneously, with lots of hoopla. 
The public effort published its genome in Nature, and 
here's a picture of the cover from the time. 
And they estimated 30 to 40,000 genes. 
Now, one interesting thing about that and that was the official estimate in 
the paper, was that seems like a very imprecise number. 
And 15 years ago, or 10 years earlier, no one would imagine that the genome 
wouldn't finish and we wouldn't know precisely how many genes there were. 
But it turns out that it's much harder to figure out exactly what the genes are from 
the DNA sequence than, than we had realized. 
So they gave a very rough estimate of 30 to 40,000, and that number 
was a lot smaller than estimates that had been discussed even as recently as a year 
earlier when people were still proposing 100,000 genes in the human genome. 
So that was a surprise that there were so few. 
The other paper was published in Science, 
this is the paper led by the group at Celera Genomics, 
which included scientists from around the US and, and Europe as well. 
And that number was much more precise-seeming, 26,588 genes. 
But there was an additional approximately 12,000 likely genes 
that were also described in that paper. 
So those two numbers were pretty, so 
if you add those together it's in the high 30,000 range. 
So that, that was consistent with what was in the Nature paper. 
And let me just mention that I was on this paper, 
buried in this long author list as well.
Play video starting at :14:3 and follow transcript14:03
So we, we had some new estimate of the number of genes 
seemingly precise in one paper, imprecise in the other, but 
if you read the Science paper, you see it's also a very imprecise number there. 
So we didn't actually know how many genes there were 
even though the genome was sort of finished. 
And by the way, this was a draft genome, and 
the draft only covered about 92% of the genome. 
So today the genome that we have is still, technically speaking, a draft. 
It's still not finished, but it covers well over 99.9% of the genome.
Play video starting at :14:31 and follow transcript14:31
So that, that, that gene count number is an interesting story in itself. 
So if you look at how it's evolved over the past 40 years since, or now 50 years, 
since that, that early 1964 paper, which estimated millions of genes, 
starting around 1990, people were, 
were bandying about estimates generally in the 100,000 range. 
But this chart shows a number of publications that kept moving that number 
around starting at 100,000 going down to, 
there were estimates published in the 60,000 range, the 50,000 range, and so on. 
And it gradually decreased until today we believe the number is around 22 to 23,000. 
But we're still not sure of the precise number, even today, 
we don't have a precise number of human genes. 
And an important caveat here is, this gene count the way it was originally proposed 
and the way I'm describing it now, refers to the number of protein-coding genes. 
That is a, a piece of human genome, 
a piece of DNA that gets transcribed into RNA, gets translated into a protein. 
Everyone agrees we call that a gene. 
However, for at least the last 20 years, 
we've known that there's some number of genes in your genome 
where the DNA gets transcribed into the RNA, and the RNA itself has a function. 
We would call that an RNA gene, it never gets translated into a protein. 
In the late 2000s, starting in the late 2000s, 
through a new technique called RNAC, we've learned that there are many thousands, 
really probably tens of thousands of RNA genes in the genome. 
So this gene count that we've been looking at for the past several decades 
only looks at one side of the coin, we're only looking at protein-coding genes. 
We know the number's quite a bit higher if you look at RNA genes. 
And that number is even less precise today. 
Over the, over the coming decade, we probably will get a better handle on it. 
But I wouldn't promise we'd have a precise answer to this question 
even ten years from now.
Play video starting at :16:11 and follow transcript16:11
So let's just review. 
The Human Genome Project started in the late 80s. 
It officially began in 89 and 
1990 with the goal of sequencing 3 billion base pairs. 
Did they achieve that? 
Yes, they did. 
The goal was to do it for $1 a base. 
How did they do on that? 
Well, at the time the genome was published, it cost about a $1 for a read. 
A read was a single sequence coming off a sequencer, it was about 700 bases long. 
So not only did they achieve that goal, 
they were 700 times cheaper than they thought it would be. 
So that was really a dramatic success. 
They were, the in terms of time, they wanted to get it done by 2005. 
It was done in 2001, at least the papers were published in 2001, a draft genome. 
So you could say they dramatically exceeded all of their goals. 
And now, the cost today, the fi, the final note is that the cost in 2001, 
$1 per read, $1 per 700 bases, 
seemed quite, quite dramatically better than we'd expected. 
And it did continue to slowly, to slowly drop for a few years thereafter. 
But then, due to dramatic changes in technology, what's called next generation 
sequencing, the cost has dropped far, far more than that. 
So today, it costs about $1 for 3 million bases, 
which is 4,000 fold cheaper than it cost even when the human genome was finished. 
And that's what's led to this tremendous explosion in all 
sorts of genome sequencing experiments that we're, that we're experiencing today.

MOLECULAR BIOLOGY STRUCTURE 03:00 Hours


In this lecture, we're going to talk about molecular biology structures, 
by which I mean structures in the DNA and other molecules inside the cell that 
you need to know about to understand the rest of the course. 
So these are mostly terms and, and descriptions of functions that, 
that go on inside every cell of your body that kind of govern how, 
how, how you're molecular, how molecular biology works and how genetics works. 
So let's start with DNA. 
So DNA itself, as we've said before, is this very, very long molecule. 
It's inside every cell in your body. 
In humans we have 23 chromosome pairs, each of those are very, 
very long molecules of DNA. 
So the DNA itself, if you stretched it out, 
would be about two meters long inside each cell. 
Cells are, are, of course, microscopic. 
So in order to fit into a cell it has to be wrapped up in very, 
a very efficient way. 
So it's actually wrapped around other molecules called histones in what you can 
see as sort of this beads on a string, just, structure, on the, on the left. 
And those histones are wrapped together in these organized 
slightly longer structures. 
And those together, are, are then coiled up and super coiled in, in even 
bigger structures which eventually form the chromosomes that you see on the right. 
So DNA is, is coiled around itself, 
the coils are then coiled around themselves in a very complicated way. 
Now for DNA to be transcribed and translated, 
it has to unwrap itself a bit so we'll talk a little bit more about that next.
Play video starting at :1:26 and follow transcript1:26
So, another kind of structure a very different kind of structure is not so 
much a physical one, but a sequence structure. 
And that's what we call repeats, so you'll hear a lot about repeats in, 
in data analysis of DNA. 
So we just need to know what these terms are. 
And there are many different kinds of repeats. 
Broadly speaking, we could call them, 
we could classify them in two different, two different types. 
Tandem repeats, which are repeats of sequence, identical sequence, 
that occur one after the other. 
And here's an example of a five base repeat ATTCG 
that occurs three times in a row. 
And these then, by the way tandem repeats don't have to be identical. 
And they can be very long. 
There are cases where you have sequences that are hundreds of base pairs long 
that occur thousands of times in a row. 
And in fact the centromeres inside of each in, 
in each chromosome comprise largely of repeats that are on the order 
of 180 base pairs long that occur hundreds of thousands of times in a row. 
So repeats can get very long and complicated. 
We also have interspersed repeats which really just means the same idea, 
repetitive sequence that's identical or near identical. 
But it's scattered around across the chromosomes in widely separate places and 
across different chromosomes. 
And these repeats will cause, sometimes cause problems for 
other types of analysis because they all look so similar to each other.
Play video starting at :2:35 and follow transcript2:35
And if you have a very short sequence, remember when we read DNA 
sequences we're only reading a few hundred base pairs at a time, so, 
it's not unusual to get a sequence which consists entirely of repeats. 
And that means we don't really know where it came from, 
because repeats occur all over the genome. 
So there's also structures of RNA that are important to understand. 
So a typical human messenger RNA, 
which is what we get our proteins from, looks like this. 
We have some sequence which is shown in green which is the coding sequence. 
And sometimes when we talk about MRNA or 
RNA we're only talking about what encodes proteins, that's just the green part. 
But it's important to understand that actually the part that copied from 
the DNA, the transcribed portion of the DNA which goes into messenger RNA 
is longer than that. 
So, typically you'll have some stretch at the beginning of the, of the messenger RNA 
that's not translated, and we call that the UTR or untranslated region. 
And because it's at the beginning, 
and remember, the beginning is the 5 prime end, we call it the 5 prime UT, UTR. 
And on the other end we have the 3 prime UTR which is usually a longer stretch for 
various reasons. 
That's also not translated. 
And somewhere in the middle, is the coding sequence. 
So we want to know what the protein sequence is, if we can identify the UTR's 
and get rid of those, then we can read off the coding sequences from that, 
we can directly translate our amino acid sequences. 
And another important, very important feature of eukaryotic cells, 
not just human cells, is that they get, they get a poly-A tail added to them. 
So after transcription occurs, after you copy the RNA into, the DNA into RNA, 
and remove the introns, a long series of A's gets added. 
And we're going to use that as, as a, as a, 
kind of a hook to grab these out of the cell with some technologies. 
So things are a little more complicated than what I just showed, so 
actually the DNA that gets transcribed is much longer than that picture. 
So DNA gets transcribed into RNA and 
that DNA that eventually produces a protein includes introns as well. 
So in this picture you see a gene with five exons shown in different col, 
as different colored rectangles. 
In between those exons are sequences that we call introns. 
Those are actually chopped out or spliced out and discarded by the cell, 
recycled by the cell. 
And the remaining parts, the exons, 
are the part that get translated into a protein. 
So that coding sequence on the previous slide was just the exons 
concatenated together. 
But an important feature of this, one thing that it allows the cell to do, 
this structure, is that, well, you're, 
while you're in the process of splicing out and removing the, the introns, 
you can do, you can do different things with putting the exons together. 
You can combine them in different combinations. 
So that's called alternative splicing. 
And this is very, a very common phenomenon. 
When it was first discovered it was considered to be a very unusual and 
a probably rare phenomenon, but 
now we know that over 90% of human genes undergo some form of alternative splicing. 
Meaning that even when you know the complete sequence of the DNA and 
even when you know what gets transcribed into RNA, 
you still have more work to do to figure out exactly what proteins might be formed. 
So by forming different combinations of exons into different mature messenger 
RNAs, you can make different proteins from the same original gene part of the genome.
Play video starting at :5:38 and follow transcript5:38
So, proteins themselves have structure, and these have been studied for 
many decades. 
Proteins themselves are the string of amino acids; however, 
these are long proteins comprised of amino acids. 
Sort, long se, molecules comprised of amino acids. 
Not nearly as long as DNA, 
typical proteins are a few hundred to a few thousands amino acids long. 
And they form very complicated structures but 
they have some secondary structure as well so when the two common secondary 
structures of proteins form are called beta sheets and alpha helices. 
So the amino acids themselves can form these sort of flat structures called 
sheets or beta sheets. 
And they can also coil up into these helical structures that are called 
alpha helices. 
And you see examples of those over on the left. 
And on the right, you see examples of many different proteins, 
actual protein structures and there are thousands and 
thousands of proteins whose structures have been solved. 
That is biophysicists have figured out what the mature structure of those 
proteins is. 
And their structure confers upon them their function. 
So this is why we study their, study their structure. 
So it's important in understanding protein function to know which parts of 
the protein are say, buried on the inside of the structure and which ones are on 
the outside, which are probably the more active parts of the protein. 
So that's protein structure. 
So another aspect of molecular structure are, so, 
sort of aspect of structure is transcription factors. 
So we, we, even when we know what the genes are, 
remember every cell in your body has the same genes, but cells behave differently. 
Skin cells behave very differently from blood cells. 
So why is that? 
That's because different genes are turned on. 
Different genes are active in different cells. 
So one way that you control the activity of cells is there are proteins that 
are also of course encoded by your genome, that go back and bind to the DNA itself, 
bind to those chromosomes, and can control the expression of proteins. 
That is, you can have a gene, whose, whose product is a protein, 
we would call this a transcription factor, that goes and binds upstream or 
downstream, usually upstream of another gene. 
And can accelerate the production of that gene, or decelerate or inhibit the, 
the, the production of that gene. 
So these are called transcription factors. 
And they're a very, very important control mechanism that the cell uses to decide 
how much of a protein to produce.
Play video starting at :7:43 and follow transcript7:43
And finally, one other kind of important structure or 
mechanism is epigenetic structures or epigenetic mechanisms. 
So epigenetic means, sort of beyond genetics. 
That is, these are, these are factors that are outside the DNA itself, 
the directly inherited material, but nonetheless, control something about how 
your, how your cell functions and how your genes are expressed. 
And one important very important epigenetic structure is a methyl group. 
So DNA can get tagged with these methyl groups, then we say the DNA is methylated. 
And an interesting factor of this methylation is that it can get 
passed on through the cell cycle. 
So once you've tagged the DNA with this methylation you can, 
you can then, enha, then when the cell divides, the same positions can get, 
can get methylated in the daughter cells that are produced from that. 
And these methylation marks, we now know, also, are, are, 
a controlling factor that affects how much of a gene, or when a gene is expressed. 
So methylation marks are an important factor and they are outside of the normal, 
the normal genetic factors in that they are they are passed, 
they are not part of the DNA, but they are inherited, 
if you want to think of it that way, from one cell to another. 
It's important to realize they are not inherited when a new, 
a new individual is produced, 
they're just inherited as cells divide to produce new cells of the same type. 
So, epigenetic mechanisms are another type of structure that we have to, 
that we have, that we study and 
we talk about and use as a way of understanding the function of the cell

FROM GENES TO PHENOTYPES 03:00 Hours


In this lecture, we're going to talk about genotype to phenotype and 
how we move from one to the other. 
So what I mean by genotype is a collection of all the sequences of all the genes 
in your, in your cells. 
Or we can think of it as all the mutations of those genes. 
So your genotype is the sequences of your genes which determine how your, 
how your body functions, how your cells function. 
And whether or not you have certain traits or diseases. 
Phenotype, by contrast, refers to basically everything else. 
Any other trait that I can recognize in someone that we can observe. 
It could be something like your hair color, your eye color, 
your height, your weight. 
It could be something like a genetic disease, or it could be some other aspect 
of your, of your health, of your physical traits, or even your personality. 
So the phenotype is basically everything other than the genotype. 
And we know that some things about your phenotype are determined by your genes and 
those mutations. 
And we want to understand when we study genomics, 
we want to understand how the genotype is linked to the phenotype. 
And there are many kinds of experiments we can do to do that, 
but let's talk about what we mean by genotype and phenotype in this lecture. 
So phenotype is a set of something we observe. 
And here's an example based on, 
on Mendel's famous pea, pea breeding experiments from the 19th century. 
Where we have green peas and yellow peas being bred together. 
So the green peas so every, every so 
just like humans and animals plants also are diploid. 
because we have two copies of every, of every chromosome, so 
it means we have two copies of every gene. 
And those genes are a little different from one another. 
So the green pea here has two copies of the green color gene, 
we label that as y in lower case.
Play video starting at :1:41 and follow transcript1:41
And the yellow pea has one copy of the green gene and 
one copy of the yellow gene which is labeled as Y with uppercase. 
And the reason we labelled it as uppercase here is we want to also 
bring across the notion of dominant traits. 
So they're, we, loosely speaking, think of traits as being recessive and dominant. 
There are things in between as well. 
But a recessive trait is one where you only have the trait, the phenotype 
if you have two copies of the gene that has that, that carries that trait. 
The dominant trait, in this case the yellow, the yellow phenotype is a dominant 
trait, is one where you'll have the trait even if just one of your genes has, 
has that trait. 
So, If I take this yellow pea on the left and breed it with the green pea on the, 
on the top. 
There's four possible combinations of offspring because each of those two peas 
contributes one of its copies at random to the offspring. 
So the offspring can have, from the green pea, they have to get, 
because they are both copies of the, gene are a lowercase y, 
they have to get a lowercase green y. 
From the yellow pea they can get either a lowercase green y or 
an uppercase yellow, capital Y. 
So if they, along the top, 
let's see what happens when the offspring inherit one copy of the dominant trait, 
the yellow gene, then the peas themselves will be yellow. 
Now, the bottom, we see that the peas there are green because they've inherited 
two copies of the, of the recessive trait, the green trait.
Play video starting at :3:6 and follow transcript3:06
So that's how genotype is simple. 
And a simple model, that is in fact a good model for 
many of our genes, that's how genotype affects phenotype.
Play video starting at :3:14 and follow transcript3:14
So, here we're not talking about the specific nucleotides, but 
rather the whole gene is described as recessive or dominant. 
But what, really what we know that means is underneath that 
some mutation in that gene that confers this, this phenotype.
Play video starting at :3:27 and follow transcript3:27
So since the, the, expand, 
the expansion of whole genome sequencing over the past decade. 
Or so, the scientific community has been engaged in a process to sequence now 
thousands of individual genomes around the world. 
So we're mapping out genetic variation across the world. 
And one thing that we've, that we've already, that we've started to compile, 
and one thing that we've learned, from these sequencing projects. 
Is that different people in different parts of the world have typical mutations 
that, that that are common to that, that region of the world. 
So if we look at, at people in Europe, Africa, Asia, and North America, 
as you're, as we're showing here, we'll see different patterns of mutations. 
And these mutations might be simple nucleotide polymorphisms or snips, S-N-Ps. 
Or it might be larger changes to the DNA like insertions of deletions of larger 
chunks of DNA. 
And we can map this out and 
as we've been mapping this out we've been able to characterize different populations 
by their genotype by the mutations in their genome. 
So, here's an example of a study that was done a few years ago 
where using genome sequence from people across Europe. 
And you see the map of Europe at the top. 
You can take the genetic information and 
using a statistical method called principle components analysis, 
you can extract the, the, the axis upon which these genes vary the most. 
And reduce all genetic information, which of course is millions of sequences, 
down to this two dimensions. 
And I won't go into how that's done, but you can reduce this into two dimensions so 
that you can just plot it. 
And here we have plotted the two principle components with the most variants from 
this set of samples of people across Europe on two dimensions. 
And you can see, if you tilt it just the right way that the colors that correspond 
to these, to these individuals correspond pretty closely to the countries of origin. 
And so the map in the upper right is colored by country of origin and 
the people whose genomes were sequenced for color by the country they come from. 
And you can see that, that for example, 
people who that come from Spain on the lower left tend to cluster 
together genetically according to these two principal components. 
So it's not too surprising that we see this happening because people 
tend to breed, that is to, to, to marry and, and 
raise children with other people live in the same area. 
Which means that genetic traits will tend to be regionalized in this way.
Play video starting at :5:47 and follow transcript5:47
So diving down a little bit more precisely using, using services such as 23andMe, 
you could actually today get part of your genome sequence. 
And you can you can have that your genotype directly measured, and, and, and 
learn a little bit about how it affects some traits. 
Where we have good information about the connection between genotype and phenotype. 
So here's one example of the genotype of the genes in the certain 
of the genotype in the certain gene that's related to eye color.. 
And the genotypes possible at this site, or 
the common genotypes of this site are two A's, AA and A and a G, or two G's. 
And this trait, this site has been studied, it's in a gene called HERC2.
Play video starting at :6:26 and follow transcript6:26
And in Europeans if you have two A's you have an 85% chance of brown eyes and 
a 14% chance of having green eyes and a 1% chance of having blue eyes. 
In contrast, if you have two Gs, then you have a 72% chance of having blue eyes and 
a 27% chance of having green eyes. 
So, in other words, this trait behaves pretty much like a recessive trait. 
If you get two Gs you have green, you have blue or green eyes. 
And if you have two As, you have brown eyes. 
And if you have one A and one G, you still usually have brown eyes.
Play video starting at :6:57 and follow transcript6:57
And the way that, that trait has been studied is through, population studies. 
Where we go and measure that particular mutation in that gene in lots 
of people and look at how its associated with, with eye color. 
So here, here's one the studies that was behind that slide I just showed you.
Play video starting at :7:15 and follow transcript7:15
So in a, in a more general sense what we do to, to, to understand the length 
between genotype and phenotype today is we can go out into the sequencing again is, 
is fast and efficient. 
We can go out and collect large number of people who have a particular trait, 
might be a disease, might just be a trait like eye color. 
And we can sequence those, we can sequence those people looking for 
mutations in particular genes. 
Or we could in, in the most general case, sequence their entire genomes. 
So here we've looked at at two particular locations or 
single nucleotide polymorphisms labeled SNP1 and SNP2 on this slide. 
And we've looked at a bunch of cases to see how many people have a G at this site.
Play video starting at :7:55 and follow transcript7:55
In the cases, and then we get people who do, who don't have the disease, 
we call them controls. 
And ask how many, how many people have a G at that site? 
So if you look at the numbers here, you can see that SNP1 has a higher frequency, 
a significantly higher frequency of Gs in the cases that controls. 
Whereas SNP2 doesn't, the frequency's around 41 or 42% for SNP2. 
But for SNP1, the frequency is, is significantly higher in the cases. 
We would say that based on this study, that that particular SNP, 
that particular variation, genetic variation in that gene 
tended to be associated with that disease that we were studying. 
So now, even though this is a highly significant association, 
it's important to realize that just because you had a G at that location 
that doesn't mean that you have that disease. 
If you look at the actual raw numbers, the frequency of a G at that site 
is 52.6% versus just say around 45% in the controls. 
So in particular what it means having a G doesn't mean you have the disease. 
It just means there's somewhat more likelihood that you'll have the disease 
than if you don't have the G at that site. 
So these are what we call genome wide association studies which, 
which study in large groups of people. 
The association between single nucleotide polymorphism, or 
other small variations, and, and diseases or traits of interest.

POLYMERASE CHAIN REACTION 02:00 Hours

0:03
In this lecture, we're going to talk about polymerase chain reaction. 
A remarkable and very powerful way to copy DNA and make, in fact, 
as many copies of DNA as you want to, with a really simple but powerful technique. 
So how do we make copies of DNA? 
We'd like to just feed our DNA to some sort of machine like a copier, 
like I'm showing here, and make copies of it. 
And we use this in various aspects about technology and DNA sequencing, 
in genomics, all the time. 
There are many, many reasons why we need to make copies of DNA. 
For example, when we're doing RNA sequencing, 
we need to turn our RNA into DNA and make lots of copies of that. 
When we're sequencing someone's genome, we don't just sequence a single cell or 
a single molecule. 
All the technology we use for sequencing requires us to have many many identical 
copies of the DNA molecules before we get started. 
So, DNA copying is a really important, sort of, basic tool that we need for 
many of the things we want to do. 
So how do we do it? 
So polymerase chain reaction uses a couple of simple properties of DNA and 
turns it into this wonderful method for DNA replication or copying. 
So recall first that DNA sticks to itself, so DNA's always double-stranded. 
So here we have two strands of DNA. 
We have A, C, G, and Ts on the top, and the complementary strands on the bottom. 
And remember that the DNA is directional. 
And the direction that it goes, 
we always talk about it going from the 5' direction to the 3' direction. 
So the beginning of the DNA sequence will be the 5' end and 
the end will be the 3' end. 
So, the fact that DNA will stick to itself is a very important property that we're 
going to use in PCR.
Play video starting at :1:34 and follow transcript1:34
So, another thing we need is something called primers. 
So what's a primer? 
So primer is simply a short sequence, usually they are 15 or 20 bases long. 
They can be a little shorter, or a little longer, but it's a sequence of 
DNA bases that's complementary to the DNA that we want to copy. 
So, here I'm showing an example with two primers, one in green, and one in blue. 
So along the top strand, the forward strand, we call that, there's a primer you 
see at the beginning which is in green, which is the reverse complement that is 
the matching bases of the top strand of DNA, right at the beginning. 
And on the other strand we have a different primer going from the other end 
that is shown in blue. 
That's the reverse complement of the reverse strand going 
on the reverse strand. 
So it goes 3' to 5'. 
So these primers, if I were just to mix these primers with this DNA sequence, 
they would stick to it, because they're complementary to it. 
So how does PCR work? 
So we start with some DNA and some primers. 
And let's not worry how we get those primers. 
But just if you're curious, 
you can actually easily generate any primer you want of any length. 
You can just order it from a company these days. 
And, in fact, you can order a mixture of all possible combinations of DNA bases, 
of say eight bases or even a little longer so you have primers even for unknown DNA. 
So what do we do with that? 
Well, now we're ready to start the process of PCR. 
We're going to heat up the mixture gently. 
What happens when you heat up DNA is it melts or 
rather [NOISE] the two strands separate from one another, they fall apart. 
So you see, I've just moved them apart physically on the screen here. 
So if the mixture is hot, then the primers won't stick to the DNA and 
the two strands of the DNA won't stick to each other, either.
Play video starting at :3:12 and follow transcript3:12
Then we cool it down, or anneal it and cool it down gently. 
And the primers will stick to the DNA, 
and they tend to actually stick to the DNA before the two strands find themselves. 
Because the primers are small and they can float around a little bit more easily. 
So we cool it down. 
And we let the primers stick to our DNA. 
And if we kept cooling, eventually the DNA would cool back to itself. 
But we don't wait that long, so we also add another mix. 
We need something now to copy our DNA. 
So we need a copier molecule. 
Fortunately, nature has provided us with a very good copier molecule 
called DNA polymerase. 
It's what all of our cells use to copy their own DNA.
Play video starting at :3:45 and follow transcript3:45
So we can synthesize that and make large quantities of it and 
add that to the mixture as well. 
So, DNA polymerase acts in the following, very straightforward way. 
It looks for a place where the DNA is partly single stranded and partly 
double stranded and it grabs on to the sequence right there and starts to copy. 
So this DNA polymerase here will notice both of these two primers are now attached 
to sequences that are single stranded, except where the primers have stuck. 
So, the DNA polymerase will go and it will find those sites, 
and it will start to fill in the missing sequence starting at the primer. 
So that's the property of DNA polymerase that we really need, 
and there are many polymerases. 
Every living organism on the planet actually has one. 
So, we don't actually use the human polymerase for this process, 
but it doesn't matter. 
You can use, in theory, any DNA polymerase to do the copying. 
Although, we do need to specialized one for PCR. 
So the result after doing this is that, if I show you here, 
after one round of copying, I'll have completely filled in the sequence across 
the top from my green primer. 
As you see here, and then another polymerase, assuming I added lots of 
polymerase molecules, another polymerase will have filled in the sequence 
of the other strand across the bottom, starting with the blue primer. 
Now, I also needed a mixture of As, Cs, Gs, and Ts, the raw material to make DNA. 
So, I didn't say that yet, but in addition to adding my DNA polymerase, 
I'm going to add raw As, Cs, Gs, and Ts in large quantities. 
So that the DNA polymerase can incorporate them into the new double-stranded 
DNA that it's creating. 
So after one round of PCR, if we let things cool down, these two strands will 
stick together, and we've now created two strands where we only had one before. 
So we can just repeat that whole process. 
And if we repeat the whole process, we get four strands. 
And a very important property of PCR, the reason it's called a chain reaction, 
is that with each round, we double the amount of DNA we had before. 
So, very quickly, after just a few rounds, you go from just one molecule to many, 
many molecules. 
Or many, many copies. 
So, typically we'd repeat this for 30 cycles or more. 
And if you do the math, 2 to the 30th is about two billion, 
so you can take one molecule and turn it into billions of molecules 
after just a few dozen cycles of polymerase chain reaction. 
So let's look at a little cartoon of how this works, just to drive home the point. 
So we start with melting or denaturing the DNA at 94 degrees Celsius. 
So here you see a double strand of DNA with some polymerase. 
You'll notice some little arrows floating around. 
And primers are shown as the little short fragments of DNA that are floating in 
the solution here. 
And as we heat it up, 
you'll see the two strands of DNA in the middle start to denature. 
They're falling apart. 
So, you see they're melting apart there, and 
eventually they're completely separated. 
Now, an important part of PCR is you have to denature for long enough for 
the two strands to fall apart. 
But it doesn't take long at all. 
We're talking only a few minutes to do that. 
Then you cool it down to 54 degrees, and 
at that temperature the primers will stick to the DNA, as you see here. 
And the polymerase can then find these double-stranded pieces and 
start to fill it in. 
So here you see the two polymerases in this picture are starting to 
fill in the existing strand and create double-stranded DNA. 
So they'll just walk along until they get to the end of the molecule, and 
then they'll fall off. 
So now we've created two complete copies, and 
we do that at a slightly higher temperature of 72 degrees.
Play video starting at :6:57 and follow transcript6:57
And then we simply repeat the whole process to make another copy. 
So in summary, the PCR recipe is the following ingredients. 
You need some DNA that you want to copy, 
it could be any DNA at all from any living organism. 
You need primers, you need DNA polymerase, 
a special copier molecule, and you need lots of A's, C's, G's, and T's. 
And then the way you execute the recipe is you melt it at 94 degrees. 
You cool it down to 54 degrees, then warm it back up to 72 degrees. 
And then simply repeat. 
So all you have to do is mix those ingredients in a single mixture and 
go through this process of heating and cooling, and 
the rest of the reaction takes care of itself. 
So that's why it's called polymerase chain reaction. 
It uses the DNA polymerase to cause this chain reaction, 
this explosion in the number of copies of your DNA. 
And it all happens fully automatically just using the properties of DNA itself, 
which hybridizes to itself, and of DNA polymerase, which copies DNA. 
So this is such a clever and powerful idea, and was so revolutionary and 
had such a great impact on the field, that not that long after it was was discovered, 
the Nobel Prize in Chemistry was awarded for invention of DNA to Kary Mullis. 
And this is just a picture from the Nobel site describing his Nobel lecture.

NEXT GENERATION SEQUENCING 02:00 Hours

In this section we're going to talk about Next Generation Sequencing. 


We're just going to cover the, the basics, but 
it's important that you have an idea of how sequencing works so 
you'll understand the data that you're looking at. 
So next generation sequencing is a term we use to describe the very latest sequencing 
technology which has been around now since around 2007 and 
we'll probably come up with a better name for it, overtime. 
But for now we call it NGS or Next Generation Sequencing. 
So sequencing has undergone a number of different generations. 
Back in the, or, in the 70s, 80s and 90s we used a technology, it was 
called Sanger Sequencing because it was invented by a scientist named Fred Sanger. 
And those, that sequencing was the first very manual, very painstaking and slow. 
And in the 1980s a number of, a number of companies put together automated 
DNA sequencers and let people use that technology to sequence things 
much faster and more efficiently than had ever been possible before. 
And that's what we call Sanger Sequencing, and 
that's really the first generation of sequencing.
Play video starting at ::56 and follow transcript0:56
As time went on, in the 1990's there was a technology called DNA Microarrays that was 
invented, which wasn't exactly sequencing. 
That involved attaching DNA to a, to a tiny slide and, and then 
measuring other genomes by, by letting their DNA or RNA stick to that DNA.
Next Generation Sequencing :
Staring approximately 2007 (more often looks at single molecules )
State now: Not mature. 

Most Common DNA Sequencing: Second generation sequencing. 

Second Generation Sequencing.


- DNA sequencing technologies rely on taking a DNA template, the DNA that you
want to sequence and copy. 
- So use DNA at the D, the mole, the molecular machinery that, that cells use for
copying is a key tool in this process. 
- So DNA, it's copied by a molecule called DNA Polymerase which takes free 
nucleotides, those are As, Cs, Gs and Ts, that are just floating around in your
cell or you can synthesize them so that they can be floating around in your test
tube. 
And it uses them to copy the DNA that you're trying to sequence. 
And of course when we copy, 
we use this rule of DNA that gGs always bind to Cs and As always bind to Ts. 
So, you start with single strand DNA, and then with, you add lots of As, Gs, Cs, and 
Ts, and some DNA preliminaries and you can synthesize the, the complimentary strand. 
Now if you could just watch this happening, 
then you wouldn't need to do what we're calling sequencing. 
Somehow you get to observe this and 
measure your DNA while all this is going on. 
So how do you do that? 
The way Next Gen sequencing does, does it, 
is we take our, our template DNA, we chop it up into small pieces. 
Typically a few hundred bases long, maybe a little longer than that maybe as much as 
1,000 bases long and we attach those, we chemically attach those to a slide. 
Now, on the slide, 
there'll be millions or tens of millions of these, of these fragments of DNA. 
But nevermind, we'll just have a, we'll just show you an example where there's two 
pieces of DNA that are just two random pieces that are attached to the slide.
Play video starting at :3:4 and follow transcript3:04
Then on the slide itself we use polymerase chain reaction or 
PCR to make many identical copies of those pieces. 
So then you get these little clusters that you're seeing here where you have 
a cluster of fragments that might in fact be a few million copies but it's 
a relatively small number, a few million, copies of these very short fragments DNA. 
And they're all single stranded and 
they're all stuck to these spots on a slide. 
So now you still have them sequencing, you just made copies of these fragments. 
You still don't have any idea what's on the slide. 
You need to read it off somehow. 
So how do we do that? 
So we're going to use again, the, the, the property that DNA when it's copying. 
Single strand of DNA will always use the complementary base. 
So we can mix in we can add to the slide some nucleotides, As, Cs, Gs, and Ts. 
That are labeled with a, with in a special ways so that we hit them with the light 
with a laser, they'll, they'll, fluoresce in four different colors. 
So we get, if we hit, say the Ts here, with the light, they'll, 
they'll fluoresce in red, and we'll know that, that we've looked at a T. 
So what we want to do is we want to add these to the slide, and have them hi, 
have them hybridized or base pair with the single stranded templates that we, 
that we have on the slide. 
So a very important property of Next Generation Sequencing is that 
these nucleotides that we're adding are specially modified in, 
in two different ways. 
One thing that, that happens is that they fluoresce. 
So each of them has, 
each of the four nucleotides is a different color that fluoresce it. 
But another very important property and 
this was a very clever invention is that they have a terminator modification. 
That is once you've added the DNA polymerase which is, 
which is attaching them to these templates. 
Once it's added one of these molecules, it can't add another one. 
It can only add one because they're chemically modified to have a terminator. 
However, it's a reversible modification. 
So after we take our picture, we can remove the modification and 
let the polymerase go on and add the next base. 
So what these next gen sequencers do is, in parallel, 
at millions of spots on the slide, we add a single base to that, to that spot. 
And every spot because it has a different template may have a different base added. 
Then, we just take a picture. 
And the way we take a picture is we shine a light while we're imaging the slide and 
that shows us at every spot what color is what color is the base that's, 
that's just been added to that template.
Play video starting at :5:17 and follow transcript5:17
So what happens is we go through these sort of cycles of sequencing. 
In each cycle, we will, we will sequence another base. 
We'll add one more base to the gro, to the, to the template strand that we're 
to the new strand that we're synthesizing from the template. 
So, we'll get pictures like what you're seeing here for cycles one through five 
where every picture we need to register these pictures so they're all lined up. 
And if we just look at the, at the spot in the lower left, we'll see okay it's green 
and then maybe in the second cycle it might be red and, and so on. 
And as we go through the cycles, 
we can call the bases because the colors tell us what those bases are. 
So we, we, at each spot on the slide, 
we're reading a different sequence off, and we read all of them, 
of course, in parallel, all sequences that are attached to this slide in parallel. 
And with face technology that there will be millions of these short fragments 
attached to the same slide. 
So when we're done, we'll have millions of sequences all of once.
Play video starting at :6:7 and follow transcript6:07
So an important the reason you need to understand this, and understand this data. 
Is that there are errors in it. 
So there are a number of sources of error. 
One source is that, one, one property is that errors increase in later cycles. 
And the reason this happens is that, remember we're making millions of a few 
million copies of identical copies in, in one of these clusters, and 
we're sequencing that cluster one base at a time. 
Well, this property of DNA polymerase adding a base isn't perfect. 
So the DNA polymerase, we hope, will add a single base to every fragment on that, 
in the little cluster in one cycle. 
But once in awhile, it'll add an extra fra, it'll add an extra base, and 
that, that fragment will get ahead. 
And once in awhile it'll fail to add a base, and that fragment will get behind. 
So what that means is that instead of having all these molecules 
show as the same color, a few of them, at first a very tiny number, 
will have the wrong color when we hit it with a light. 
But as time goes on the number of these fragments and 
these little template strands that are out of sync will increase. 
And that increases your error. 
So that's why we can't just read thousands of bases using this technology. 
Because eventually the error is too great and we can't read the sequence accurately.
Play video starting at :7:15 and follow transcript7:15
So at the end when you get from the, the read out you get from this, 
this process is, and every position on this slide, every spot you get a read. 
And a read is just a long streak sequence of, of As, Cs, Gs and Ts, and 
here's one of the formats we commonly see it in. 
And, associative with every one of those As, Cs, Gs and Ts is a quality value. 
Which is an estimate by the base calling software is estimating how, 
how likely it is that there's an error at that point. 
And, it does that by trying to compute with, how pure that color signal was. 
And as, as the sequence goes on, that, that color isn't quite as pure, and 
the base calling software can tell that. 
And so it'll get, it'll tell you what its best guess of the base is, but 
it'll have a higher likelihood of error so the quality will go down.

APPLICATION OF SEQUENCING 02:00 Hours

So, in this lecture, 


we're going to talk about next generation sequencing applications. 
The introduction of next generation sequencing technology, which has made 
sequencing so fast and so cheap, has allowed scientists to come up with all 
sorts of creative new types of experiments that they can use sequencing to do. 
So, another way to think about it is that we can now ask scientific questions and 
answer them with sequencing. 
Questions, that we've, we've had for decades in many cases, but 
sequencing was simply too expensive or too slow to answer them before. 
Well, today through next generation sequencing applications, 
and through some clever new experimental methodologies, 
we can answer all kinds of interesting questions using sequencing alone. 
So let's, let's look at some of those some of those methods right now. 
So the basic idea is we need to create DNA, 
because if we're going to sequence it, we need DNA. 
To convert a molecule into DNA, we might start DNA and just sequence it or 
we might start with RNA and convert it to DNA and sequence it. 
And then apply second generation sequencing to measure something. 
So, let's look at a few applications like this.
Play video starting at ::59 and follow transcript0:59
So, one very, very popular application today is exome sequencing. 
So, whats the exome? 
The exome is the collection of all the exons in your genome. 
So, what are exons? 
We've talked about that in other lectures, but let me just quickly review. 
Your DNA gets transcribed into RNA. 
And the RNA then gets chopped up into exons and introns. 
The introns get thrown away, the exons that remain are concatenated together, and 
those exons then get translated into proteins. 
So if you want to know what the proteins are that are being turned on in a cell 
that are, that are in a, in a collection of cells, 
you need to know what the exons are. 
So, in particular, in the world of genetics, when we're looking for 
genetic mutations, we're mostly, we're mostly interested, or 
usually interested first in, in mutations that affect proteins. 
So, those mutations should occur in exons. 
So, we can capture just the exons in a cell and sequence those. 
And you might say, well, why would we do that, we can sequence the whole genome? 
Well the exons only comprise about 1.5% of your genome. 
So you can sequence much less DNA and still get a picture of your entire exon, 
your entire exome. 
So how do we do that? 
We take the DNA molecule, so 
only some parts of the DNA like I said about a 1.5% of your genome or 
maybe on the order of 30 to 60 million bases will be captured as exons. 
And there are kits that do this. 
They will, they will capture this for you. 
So we want to take that, that protein coding exon, you want to fragment your 
DNA, the whole genome, whole genomic DNA from a person whose exome we're sequencing 
and some of that will be exons and fragments of exons and some of it won't. 
And we want to just capture the exons. 
And so the kits that have been developed are kits where you have a a bead, 
a magnetic bead, typically. 
And on that bead you'll have pieces of DNA that are only found in exons attached, 
and this is single stranded DNA. 
When you're preparing your DNA for 
sequencing you make it single stranded by heating it up a little bit. 
And then the, the DNA that belongs to exons will hybridize 
to the complementary DNA that's attached to those chips. 
And then you can pull those chips down, and 
then remove the active of DNA attached to them, and sequence it. 
And that way you just have sequenced he exons. 
So that's exome sequencing, you only sequence the exonic parts of DNA, and 
kids today will capture around the order of 50 to 60 million base 
pairs out of a person's genome when they're doing exome sequencing.
Play video starting at :3:14 and follow transcript3:14
Another technology is RNA-seq, or RNA sequencing. 
So, this, this involves trying to capture 
all the genes that are being turned on in a cell, or in a collection of cells. 
So as I just said, to, to produce a protein, 
DNA first gets transcribed into RNA, and then translated into proteins. 
So if we can capture the RNA, that gives us a picture of which genes are being 
expressed or turned on, in a particular cells, set of cells or cell type. 
So a very important feature of the RNA molecule, 
is that after transcription, the cell attaches a long string of A's to it. 
So we can use that, and 
that's sort of the basis of RNA-seq technology is that all the molecules that 
we're interested in have these long stretches of A's on the end. 
Anything that doesn't have a long stretch of A's we can ignore.
Play video starting at :4: and follow transcript4:00
So we capture that poly A tail by various techniques. 
Basically, we use a string of T's that we know will stick to all those A's. 
And we, we, attach those T's to something we can grab a hold of, 
and through that we capture the mature mRNA by it's poly A tail.
Play video starting at :4:17 and follow transcript4:17
And once we've done that, we would have to then, we can't sequence RNA. 
We have to turn it into DNA. 
So fortunately we have a very, a very useful molecular mechanism, 
invented by evolution that, that does reverse transcription. 
So, there's, there's, the number of virus that do this as their way of, 
of surviving. 
So we have reverse transcriptase, and there's a number of different reverse 
transcriptase molecules we can use that will take RNA and copy it back into DNA. 
So rather than going from DNA to RNA, you can actually go from RNA to DNA using this 
special molecule called reverse transcriptase, and 
that gives us the DNA that matches the RNA that, that is that we've just captured. 
Once we have DNA, we just sequence it. 
And from then, from that point on, it's a, it's a computational problem to figure out 
which cells, or which genes would turn on those cells. 
A very complicated computational problem, but it's important in trying to solve that 
problem that you realize where this data came from.
Play video starting at :5:8 and follow transcript5:08
A third technology that's become very popular through 
since Next Generation Sequencing Technology was introduced is ChIP-Seq. 
So ChIP-Seq is trying to address a different problem. 
Which is trying, which is the problem of understanding where on the DNA certain 
proteins might bind. 
Now the way that DNA controls gene expression, 
the way that our cells control gene expression so 
that cells can behave differently from one another, is that some genes are, 
are inhibited in certain cell types or or enhanced in certain cell types. 
And the way that happens is that you have transcription factors that is other 
genes themselves, proteins themselves, that bind to the DNA and control 
the expression of the genes that are near the place where the protein is binding. 
And we'd like to know where that's happening. 
Now of course we don't have today, microscopes that'll let us just 
look at the, at the chromosome and see where there are proteins bound to it, 
but we can do something indirect, again, using sequencing. 
We can, we can, link the proteins, basically freeze the protein right onto 
the DNA through a process called cross-linking. 
So we can take a set of cells that we're interested in, a particular cell type. 
We can cross-link the proteins to the DNA in those cells, so 
now the protein is basically stuck there. 
We want to know okay well where was it stuck?
Play video starting at :6:15 and follow transcript6:15
So what we can do is we can then take that DNA where the proteins are stuck, 
we can fragment it lots, millions and millions of fragments. 
Most of them do not have any protein stuck to them, but 
some of them these are sharp fragments, will have protein stuck to them. 
We can then grab these proteins, we've, we've, we've designed in ChIP-Seq we've 
designed antibodies that will, that will pull those proteins out of the mixture. 
So it can pull those proteins out. 
And when we pull them out, because they're frozen to the DNA we'll pull out these 
little fragments of DNA that those proteins were bound to at the same time. 
We can then remove the protein and sequence the DNA. 
And again, we've turned, now we've turned our problem into a sequencing problem. 
The sequences that come out are short fragments and we know because of the way 
we do the experiment that those fragments were protein binding sites for 
whatever we were, we were targeting with that antibody.
Play video starting at :7:1 and follow transcript7:01
Then finally let me talk about one more technique which is called bisulfate 
sequencing or methelsync. 
This is a way of determining where on the, the genome the DNA has been methylated. 
So methylation is this important epigenetic modification 
that also affects which proteins are, are being expressed in the cell. 
And this methylation, methylation marks, or methyl groups, 
can be passed on from one cell cycle to another as cells divide. 
So how do we figure out where the DNA was methylated? 
Well, one way to do this experiment is to split your DNA into two identical samples. 
You take two very small samples, or aliquots of DNA. 
And then you treat one of them in a special way, 
doing something called bisulfite conversion. 
And this biochemical process converts all of the C's that are not methylated to U's. 
Oh, one thing I didn't mention was that the methyl groups are always attached to 
C's, that's the only, that's the only new they get attached to. 
So this process now gives us, so now we have two identical samples, 
where we've converted in one sample, we've converted the, the, 
the DNA in a special way, so that many of the C's are now U's. 
And then, we sequence again, and we have to compare those, and 
this requires very specialized programs that can do not just the comparison but 
also the alignment because that, that converted DNA now doesn't really match 
the genome very well, the reference genome that we usually align to. 
So we need to use a special aligner that allows these, 
these U's to now match what would have been C's in the original genome. 
So that's methyl seek, 
It's a way of measuring methylation on a, on a set of cells, of, or tissues.

WHAT IS COMPUTER SCIENCE? 01:00 Hours


This lecture is about what Computer Science is. 
So, computer science, broadly speaking, comprises many different activities. 
Within the field of computer science, 
sometimes we divide activities up into Theory, Systems, and Applications, and 
now let me say quickly what those three categories mean. 
So by theory we mean theoretical work on what computers can do, 
what kinds of problems can be computed. 
By systems we mean the, the study of computers themselves, 
the building of the software that runs computers, the operating systems, 
the design of the programming languages they use to program computers. 
And, and things like that. 
And applications are sort of everything else, things we use computers for. 
And in fact, as computer science has matures as a field, the applications that 
are within computer science many of them have spread to other fields and 
sometimes aren't even called computer science anymore. 
So application's a very broad area, which is using computers in generally 
a sophisticated way to do something else, not studying the computer, itself.
Play video starting at ::59 and follow transcript0:59
An important aspect of computer science is to think computationally. 
Computer scientists approach a problem in a, 
on way they have been trained to do which is, 
which requires an understanding of what a computer is and what computers can do. 
So it's not merely knowing how to program. 
Thinking computationally is about thinking of, of a problem in terms of. 
Dividing it into tasks that can be, can be done by a computer, that can be very very 
precisely described because computer is going to do exactly what you tell it and 
nothing else and you need to think through a problem very carefully and 
if you're really thinking computationally about it. 
So now turning to one of those major areas of computer science, the systems area,. 
One of the things computer scientists do we won't talk about it in this course, but 
one of the things computer scientists spend a lot of time on 
is building the software that runs the computer. 
This is invisible to most of you but for those of you who are using a Macintosh 
computer right now, underneath that computer is an operating system called 
UNIX which is the most mature and oldest operating system out there, 
it's actually been around for well over 30 years now. 
And the computer's operating system has to do things like move data around from the, 
from the disc drive or the, the, or other devices to 
the internal memory of the computer which is called RAM or Random Access Memory. 
The, the operating system has to handle running the programs, 
the programs themselves have to be loaded into memory. 
And it goes out and fetches data that the programs will process, and 
then spits data out in, in forms of printouts on your screen, or 
printouts to a printer, and things like that. 
So, the operating system has to manage. 
And in today's computers, that's to say are all multi, multiprocessor computers. 
So, the operating system manages that, too. 
That's what allows you to run multiple programs all at the same time, and 
have the computer not, have the computer respond robustly. 
So, that's what we mean by. 
The systems level aspect of computer science.
Play video starting at :2:46 and follow transcript2:46
Now another aspect of computer systems is designing programming languages. 
So programming languages, and we're going to cover you're going to learn about 
Python later in this course, that's one that is a programming language. 
There are many programming languages out there. 
Dozens, if not hundreds of languages. 
Popular ones today are things like Python, Perl. 
C++, Java, and others. 
And a programing language is really how we talk to computers. 
They aren't really anything like natural languages, 
such as English which I am speaking to you in. 
They're very, very highly constrained languages. 
They have specific defined terms, that have a specific meaning. 
And when you write these, when you write programming language code, 
you're telling the computer what to do. 
In a very precise way.
Play video starting at :3:26 and follow transcript3:26
So programming languages are, are designed to, basically allow us to talk to 
the computer and get the computer to do the things we want it to do. 
But, they're limited. 
They're not like natural languages. 
Now, another aspect of computer science is Engineering, and 
there's going to be a whole other lecture we'll have on software engineering. 
But, writing good code is not easy to do. 
So just because you can write a few lines of code in a language like Python or Perl. 
Doesn't mean that you can, that you're engineering the code well. 
And we'll talk about that more later. 
But really, what engineering means, is testing your code. 
After you've written code, you need to test it, and make sure it works in 
all the ca, all the cases, and all the conditions that it needs to work on. 
And then it delivers what it's supposed to do. 
Because the user, when the user, when someone is running a program. 
They don't want to have to, you sh, they shouldn't have to think about 
what the program is supposed to do and the program shouldn't for 
example, shouldn't crash or shouldn't hang the computer 
leaving you with say a spinning wheel that makes you wait forever. 
And the last aspect of, of computer science that I'll talk about and 
there are certainly many that I'm leaving out here. 
Our hardware aspect, so someone has to worry about how the software, 
which I've been talking about so far, that is program is that you, 
programs that you write whether they're operating systems or 
applications programs, those programs have to talk to the actual device itself. 
Now many of you are probably using a laptop computer or a desktop computer, but 
there are now mobile computers that you're walking around with. 
Those include your phone, probably most pro, most commonly. 
But also more and more commonly we see the little robots like the one I am 
showing you in this picture. 
And devices are also, devices that are primarily controlled by 
computers are also one of the main aspects of the study of computer sciences today. 
So and as, as time goes on, we expect to see more and 
more of the devices in our world. 
That are controlled by programs. 
And it's important, 
if you want to understand how these devices are working and 
why they're doing what they're doing, it helps to think like a computer scientist.

ALGORITHMS 01:00 Hours

This lecture is going to describe what an algorithm is. 


So, algorithms are a way of describing what a computer can do, 
and we talk about them in the context of computer science and 
computational biology all the time, but algorithms don't really need computers. 
An algorithm is really just a very clear, 
step-by-step series of instructions on how to do something. 
So here is an example of an algorithm. 
It's a recipe that tells you how to make chocolate brownies. 
So you preheat the oven, you melt some chocolate chips and some butter together, 
cool it off, stir in some eggs, then stir in the additional ingredients, 
spread them out in a pan, and bake them in an oven. 
Now presuming you know what all those words mean and 
you know what some of the instructions require you to do, 
such as preheat the oven, this is a complete set of instructions that lets 
you go from a written description to an actual product. 
So algorithms don't need computers. 
They're just a way of describing in full detail something that we want to get done. 
So, now switching to a more computational example one, another kind of algorithm 
that we often, that we often deal with is how to find a maximum in some space. 
So this is, this is a common a common sort of task in a variety of optimization 
problems, in computer science and computational biology and in other fields. 
So one way if you were trying to find a, if you, to find a maximum, 
imagine you're in a, you're actually walking around in a landscape and 
you want to find the highest place in that landscape. 
Well, one way you could do that is you could just look around you and 
find the direction in which you would be going uphill most rapidly. 
In other words, the steepest slope. 
And take a step in that direction. 
And just keep doing that. 
And with that algorithm, if you look at this little example of, 
this cartoon example here, that algorithm would take you from 
a valley up to the nearest peak, to the top of the nearest peak. 
At some point, following that algorithm, you wouldn't be able to proceed. 
So remember, the algorithm is look around, 
find the area with the steepest upward slope and go that way. 
So if you're in an area where everywhere you look it's downward, 
you can no longer proceed, so the algorithm would halt at that point. 
Now one interesting thing about this algorithm, 
there's a whole family of algorithms like this, 
is that it's not actually guaranteed to find the highest point in the world. 
It'll just find the highest point in your local environment. 
But still, it's a useful kind of algorithm and it's a way to think about the problem.
Play video starting at :2:14 and follow transcript2:14
So let me talk about a, a, another, one more algorithm, and, a very, 
very common problem we have in computer science is sorting. 
There's all kinds of reasons for sorting. 
Sometimes we want to put things in alphabetical order, 
want to put things in numeric order, or you want to rank things in some other way. 
So there's many algorithms for sorting. 
Here I'm going to just illustrate one of them. 
Let's consider the digits 0 through 9, and suppose they're given to us and 
they're not in order, and we want to sort them to put them in order. 
They might be, say, attached to something else. 
Maybe they're rankings of, of candidates for something and 
you want to put them in the right order from, from where 0 will be the minimum. 
So one algorithm for doing that says the following. 
First, put all the objects in some sort of list data structure 
where you can scan through the list from beginning to end. 
Now go to the beginning of the list, 
compare the first two objects, in this case, 6 and 3. 
If they're in the wrong order, simply flip them. 
And now move on and see if compare 6 to the next item in the list. 
If they're in the right order, okay, we're going to go back to the beginning and now 
scan forward until we find two ord, two items that are in the wrong order again. 
So we do that. 
We get to 8 and 1 and we say oh, wrong order, so we flip those, and 
then keep going from there. 
We look at 8 and 9. 
They're in the right order. 
So we'll go back again to the beginning of the list, 
scan forward till we get the first pair which is in the wrong order. 
There's 6 followed by 1, so we flip those. 
We move on and continue this process, always flipping things until we, 
till we find obviously the right order. 
Moving back to the beginning of the list and 
scanning forward until eventually the entire list is sorted. 
This algorithm is guaranteed to work. 
If you write this down in code, it will always sort your list. 
It's not the most efficient algorithm possible, because it actually makes many 
passes through the list, and depending on how scrambled the numbers were 
in the beginning, it might or might not give you an answer very quickly. 
So for, we'll be talking about efficiency in another lecture, but this, 
this is a way of introducing idea of efficient algorithms. 
So an algorithm is a complete and precise description of how to do something. 
But if the amount of data you're dealing with is large, 
you also want to think about, how do I do that as fast as possible? 
So just because the algorithm completes, and it gives you the correct answer, 
doesn't mean it's always the best algorithm to use. 
And that bubble sort algorithm, which is what I've just described, 
is in fact not the fastest way to sort a list of numbers. 
There are better ways to do it.

MEMORY AND DATA STRUCTURES 02:30 Hours

This lecture is about how we store data, data structures and computer memory. 
Data structures are an important part of computer science that lets us store data 
efficiently, and especially when we're dealing with very large data sets, 
we have to think carefully about how we store data. 
So, for example, in the context of genomics, we, we deal with large amounts 
of sequence data, sometimes aligned, as you can see on this slide. 
And we have to figure out, all right, we have sequences for many people or 
many species. 
We might have many sequences from all these people. 
We have to think about how to get those into memory. 
Now the way the sequences come at us from a sequencing machine is 
just as a long string of characters of text. 
And computers are very good at storing text, that's what they were originally 
designed to do in many cases, so computers already store, have a way to store that. 
But sometimes we can think about more efficient ways to store that. 
So for example, when we're looking at a multiple alignment of lots of sequences, 
there are lots of things those sequences have in common we might want to think 
about storing storing something that's a little bit smaller than all the separate 
sequences and just storing the differences between them. 
Another important aspect of data structures and memory, when we're doing, 
when we're designing them, is that we want to be able to find stuff. 
We're looking at, 
when we're looking at DNA sequences, they all pretty much look alike. 
They're a long string of As, Cs, Gs, and Ts. 
And when we're looking at the human genome, 
we have three billion base pairs to search through. 
So for example, an interesting problem to think about in 
the data structure point of, from the data structure point of view in genomics is, 
well I've got a little sequence of, say 100 nucleotides, 100 DNA letters, and 
I want to be able to store in such a way that I can go and quickly find it again. 
And I'm storing it in the context of a three billion base pair sequence. 
So, rather than just throw it in some random, like throwing it in a random pile, 
I'd like to store it in a way where I can go back and quickly find it. 
So, for example, with a DNA sequence that's from the human genome 
we might want to store some kind of tag that says what chromosome it's on, and 
what location it starts, where does it start at, in that chromosome.
Play video starting at :1:57 and follow transcript1:57
So, the kinds of data structures that computer scientists have designed over 
the years vary widely, but 
one of the most common sorts of data structures is a, is a tree. 
There's also something called a list and something called a link list, and 
many variants on, on these data structures. 
And these are simply ways of keeping track of things, 
keeping track of objects you stored, and having objects point to one another, 
so that once you go to one object you can quickly find another object. 
So and to, to understand how these objects work, it's important to understand that in 
computer memory, we not only have the data itself, but we also have an address. 
So every piece of memory has an address, and 
we can find any, any object in memory if we know that address. 
So those addresses are called, in programming language terms, are called 
pointers and those pointers will take us directly to a piece of memory. 
So, if we store a piece of sequence data somewhere in the computer's memory, 
to retrieve that again, we simply need a pointer. 
And one way to think about a data structure that's efficient is if we have 
lots of sequences that are in, that are near each other in the genome, 
we might want to store pointers from one seq, 
one of those sequences to the next one so that once we're in the right place we can 
quickly find these other sequences without having to start over again from scratch.
Play video starting at :3:6 and follow transcript3:06
So another aspect of data structures, thinking about data structures is to make, 
is making them efficient. 
As I said before, in the genomics world we're mostly dealing with sequence data. 
Sequence data has a natural representation as letters and 
computers represent letters typically as as one byte. 
So any, any letter in the alphabet, a through z, any nu, any numeral zero 
through nine is represented the same way inside the computer using one byte. 
So a byte actually of, of the word byte comes from the word bit. 
A bit is a binary digit. 
And that's just a zero or a one, 
and that's at the most fundamental level how computers represent information. 
If you take eight bits in a row, you can con-, you can consider that as 
an eight bit binary number, which, which can store up to 128 values. 
Usually we'd consider those to be the values 0 to 127. 
And the standard representation of text in the, in the, inside the computer 
is to represent every letter as one of those values between 0 and 127. 
So with that much space to, to represent information, we can 
represent all the lower case letters, all the upper case letters, that's another 26. 
We could represent, represent the ten single digits, that's another ten, and 
then we have a room for all the special characters. 
So basically everything on your computer keyboard is represented as a single byte. 
However, if you look at DNA, you see right away, well, 
there's actually only four letters there. 
So we can do much, much better when we're representing DNA. 
And this is how most serious, highly efficient programs for 
processing lots of DNA operate internally. 
Instead of representing the four DNA letters as one byte each, 
we can represent them as just two bits. 
So simply take A and call that, make that, 
represent that by the the two bits 0 0, C is 01, G is 1 0 and T is 1 1. 
And by doing it this way, we get a fourfold compression. 
So instead of using eight bits per letter of DNA, we're only using two bits. 
So, because we're storing gigabytes or even terabytes of DNA sequence data, 
a four-fold compression right out, right out of the box, 
is, is an important efficiency we can gain by that, that representation. 
So, finally to look at a slightly more sophisticated way of 
representing representing DNA when we're talking about the application of DNA, 
one thing that we like to capture in, 
in analyzing DNA are patterns of sequence that have some biological function. 
And here I'm showing you a picture of the, the ends of an intron. 
So introns are the, the interrupting sequences that are in the genes 
in our genome that actually don't encode proteins, but get snipped out and 
thrown away in the process of going from DNA to RNA to protein. 
And introns almost always start with the letters GT and 
they almost always end with the letters AG. 
And if you collect lots of them together and, and notice how these patterns are in 
common you can get the, you can create a probabilistic picture of 
what letters are most likely to occur at the beginnings and ends of introns. 
So these two pictures show you exactly those two pictures for 
the beginning of an intron which is called the donor site and 
the end which is called an acceptor site. 
So now we could represent all the donor sites we've ever seen as a big set 
of strings of say ten letters long, if we, if we chopped out a window of ten 
bases around those sites, or we could be much more efficient about it and 
capture much more interesting data by computing for every position in 
that little window the probability that the letter was A, C, G, or T. 
And these logos you see across the top use use the height of the letter to represent 
the probability that letter appears at that location. 
And with this kind of representation we've now compressed, essentially compressed, 
the information from hundreds or even thousands of sequences that we've seen 
into a simple pattern which we can then use to, to process other data to, for, for 
example to recognize these patterns when we see them again.

EFFICIENCY 01:00 Hours


In this lecture I'm going to talk about efficiency, and 
the concept of computational efficiency or algorithmic efficiency. 
Efficiency is a very key topic in computer science. 
And it's important for all the algorithms you, you, try to implement or 
try to run, especially with big data. 
And genomics today really does deliver big data. 
So I'm going to illustrate this topic with just one example. 
So let's consider the problem of how we deliver the mail every day. 
So, mail is collected from various sites, and brought to central locations, we'll call it a
warehouse. 
And the warehouse for  a particular reason will contain all the mail supposed to be delivered that
day. 

So, one way we might consider an algorithm for delivering the mail would be that  the mailman
goes in his truck to the warehouse, picks up all the mail for,  say, my house, drives over to my
house,  delivers the mail and then goes back to the warehouse. 

Picks up the mail for my neighbor, drives to the neighbor's house,  delivers the mail to the
neighbor, and then drives back. 
And so on. Now, that algorithm will definitely get  the job done.  It will get all the mail delivered
but to all the people who need their mail. 

But, as most of you probably realized right away,  if we're talking about more than a few houses, 
this mailman is not going to get much mail delivered in the course of a day. 

And the problem is that the mail truck is going back and  forth many more times than it needs to. 

So, clearly it would be more efficient if the mail truck could go to the warehouse, 
pick up the mail for my house, and my neighbors house, drive to my house, 
deliver my mail, and then drive, or just walk, if the houses are close to each 
other, to the next house, deliver the mail there, and then go back to the warehouse. 
So, in the concept of algorithmic efficiency this is actually 
a class of problems that sometimes called traveling salesman problems, where, 
what you want to do is much more sophisticated than what I just described. 
What you want to do is think about, how much mail can the truck fit? 
Go to the warehouse and essentially, like to fill the truck each time, 
and you'd like to deliver all that mail before going back to the warehouse. 
And to do that efficiently you'd like the truck to be able to drive 
to as few places as, as possible. 
Covering as little distance as possible, 
and getting as much mail delivered as possible in the same amount of time. 
So to do that, you really want to look at say a map of all the houses in an area, 
and here's a picture an overhead picture of a neighborhood. 
And you want to look at an area and say, okay, [SOUND] if I can put mail for 
a hundred houses or a thousand houses in the truck at once, can I find a, 
a map that allows the truck to visit all those houses as efficiently as possible, 
delivering a little bit of mail at each house, and then finally, at the end, 
when the truck is empty, only then going back to the warehouse. 
And, and re-filling the truck. 
So that's an example of an efficient algorithm, 
actually that's not the algorithm, 
that's an example of how we think about making an algorithm efficient. 
We look at what the computer is doing, we look at how often, for example, 
it's going back and forth to memory to retrieve data and 
we ask, is there a way to do that in fewer steps. 
Is there a way to do that in less time and using less memory, 
and that's what we talk, that's what we mainly talk about, efficient algorithms.

SOFTWARE ENGINEERING 03:00 Hours


This lecture is about software engineering in the context of genomic data science. 
Software engineering is often given short shrift in the world of 
computer science and programming, and the world at large. 
But, software engineering is critical to almost everything we do in 
computational analysis of data. 
By engineering, I mean paying attention not only to what the software does, but 
to how reliable it is, how many cases it handles, 
and whether it's really performing the way you expect it to perform.
Play video starting at ::29 and follow transcript0:29
So, when we write programs, 
and we;'re doing a lot of data analysis, in our programs you can see equations. 
In our descriptions, 
you can see equations, which describe relationships between variables. 
Here's a very, very simple equation, z equals x over y. 
Everybody learns, sees equations like this when they're taking algebra in 
middle school or high school. 
So, this is math. 
We're familiar with that.
Play video starting at ::50 and follow transcript0:50
But, programming is a different thing. 
We have these same equations in, in our programs But, we have to do more. 
So, of course, 
you can assign a variable like z to a value like x over y in a program. 
But, you have to check to make sure that y isn't equal to 0. 
So, here's, kind of, in pseudo-code how we would write that in a program. 
If y is not equal to 0, then we can assign the value x over y to z. 
So, why do we have to worry about things like that? 
It's because we know in our computer that undefined variables, undefined 
values are going to cause problems, they're going to make our program crash. 
So, we have to check for conditions that are unexpected but 
that might come up and will mess up our program in some way. 
So software engineering is all about thinking about all the different cases 
that your program is going to be handling or trying to think of all those cases and 
writing code to make sure those cases are handled. 
Then when we deal with big data sets, 
all sorts of very bizzare cases that you might think very rarely happen, and 
in fact, are very rare, they happen, because the data sets are so large. 
So, good software engineering is critical to reliable computational biology 
programs and analysis programs. 
So why do we need to understand genomic software? 
You're going to be if you, if you move into the field of, 
of computational biology or, or data, computational data science, you're 
going to need to you're going to need to run programs on very large datasets. 
You need to understand what, what those programs are doing and 
not treat them simply as black boxes that feed you an answer. 
If you don't understand what the programs are doing, say, 
underneath the hood you can be very confused by the output. 
So I'm just going to give you one example which was a, 
kin, kind of a combination of what software engineering. 
So give you an appreciation for software engineering, and also of the kinds of, of, 
of reports that a program can make that can initially confuse.
Play video starting at :2:36 and follow transcript2:36
So a little background here. 
There's a, there's a interesting process in 
in the world of genomics called RNA editing. 
So what is RNA editing? 
It has nothing to do with software engineering. 
DNA we've already explained. 
DNA gets converted into proteins by first being copied or transcribed into RNA. 
And that RNA gets translated into proteins. 
Now there was an interesting phenomenon discovered quite awhile ago, 
more than a decade ago, it said, once in awhile in very rare circumstances 
some of the nucleotides in the RNA can actually be edited by the cell. 
That is, they can be changed. 
So you have a piece of RNA that doesn't quite match your DNA, 
there are maybe one or two or three positions in your RNA sequence 
that are different and therefore your protein might be different. 
So this is a really interesting phenomenon. 
We actually understand some of the molecular basis of it, but 
it only affects one or two of the nucleotides. 
It can be changed into one or two others. 
It can't happen to any nucleotide. 
Not any nucleotide can be changed into any other but so it's very interesting and 
it's used to regulate genes.
Play video starting at :3:38 and follow transcript3:38
So why do I mention this problem?
Play video starting at :3:40 and follow transcript3:40
So if you wanted to study this phenomenon, 
you could take a big data set and try to, try to find new examples of RNA editing, 
this would be an interesting scientific discovery and 
there are people who, who study this as a part of their research programs. 
So how do we detect something like RNA editing. 
All we do is we sequence DNA, lots of it, because that's how we sequence DNA. 
We always get lots of it, these days. 
And we can also sequence RNA. 
And if I sequence DNA and RNA from the same person, I can, then, 
align those sequences to one another. 
So, in alignment, now, we're talking about a computational problem. 
Alignment is a well defined problem. 
We've, we've been studying it for years. 
There are many programs you can use to do alignment. 
It's important that you understand how those programs operate. 
If you just run those programs and you align DNA and 
RNA from the same person to each other, you would expect, 
not knowing anything else, that everything would match perfectly. 
And anywhere where there's a mismatch, an RNA, DNA difference, 
you could say aha, I found RNA editing. 
Now as I said a minute ago, RNA editing does occur. 
It's rare, but there are, there are a few nucleotides actually, 
A to I is the main kind of RNA anywhere at, at the A nucleotide gets changed to 
something called inosine or I, which gets read as a, as a G. 
But by the sequencing machine. 
So you'll see these differences and 
you can conclude that those genes underwent our new editing.
Play video starting at :5:3 and follow transcript5:03
So why am I giving this example? 
Well, it turns out that a couple of years ago, a couple of years ago now, 
some scientists who were looking for new types of RNA editing did exactly what I'm 
describing, using some of the current best software for doing alignment. 
And they found thousands of new examples of RNA editing in many different genes, 
none of which were known to undergo RNA editing before. 
And the way they found them was exactly how I described. 
They used software, and they weren't the ones who developed the software and 
they didn't quite understand all the details of how the software worked, and 
the software found for them lots of RNA-DNA differences. 
And they were very excited. 
This was published and, and generated a lot of discussion and 
a lot of controversy, 
cause they discovered a new mechanism for RNA editing, or it seemed that they had. 
Well, it turns out, that when you're dealing with big data sets and complex 
software, surprising things can happen and you shouldn't be that surprised. 
So what's the, the question here is if I align a large data set of 
RNA to a large data set of DNA, even from the same person, and 
I find lots of alignment mis, mismatches or misalignments. 
The question you want to ask before writing your paper and going off and 
getting excited about a new discovery is are those real. 
Are those differences real? 
So most of them are. 
The software that does alignment works most of the time. 
When I say most of the time it might work 99.999% of the time.
Play video starting at :6:23 and follow transcript6:23
But even one mistake in a million, in, in your software, one mis-alignment 
in a million leads to hundreds of errors in large number of data sets. 
And this is something that you really have to start working with 
this data set to gain an appreciation for.
Play video starting at :6:38 and follow transcript6:38
So you might get very excited at seeing what you, what you in this example believe 
are hundreds of, of RNA edits that are novel, that no one's ever observed before. 
And you might think, aha, I've discovered something. 
But the first question you need to ask yourself is What did that software do? 
Was the software engineered to handle all possible cases? 
Cause I just fed it 200 million reeds in alignment to 
say 200 million other reeds and asked it to show me all the differences. 
So even though the software may seem very reliable, 
it doesn't crash, it gives me lots of answers that seem to be correct. 
If it, if it's not handling all possible cases and occasionally making an error 
in a big data set, you can, you can get misleading conclusions.
Play video starting at :7:19 and follow transcript7:19
So, the message here for this, for 
this lecture is software engineering is critical for genomic software.
Play video starting at :7:27 and follow transcript7:27
Those people who develop it, including me, do our best to handle all possible cases, 
but you have to realize that the software that you're using, 
even if it's been carefully engineered, 
when you run it on a big data set sometimes it can do the wrong thing. 
And just because a program runs to completion and 
provides, and produces answers, doesn't mean that it's free of bugs. 
Of course if a program crashes you know something went wrong. 
But the more, the more dangerous cases for, for our, for 
our own analyses are when the software, when your software runs, 
gives you answers, and you then run with those answers and 
make more conclusions without checking to be sure that you understood exactly what 
the software was doing and, and all, that all the outputs were reliable.
WHAT IS COMPUTATIONAL BIOLOGY SOFTWARE? 03:30 Hours

What is computational biology software? 


This is a big topic. 
We're only going to give you a brief introduction, try to give you a flavor for 
what kinds of software we mean when we talk about computational biology software. 
But basically, computational biology software is what we're using to 
transform raw data into information. 
Information that you can use to make biological discoveries and 
guide experiments. 
The data that comes in from large scale genome experiments is in generally is 
just DNA sequence data, long strings of As, Cs, Gs and Ts. 
That if you just look at it are mind-numbing and not very informative. 
So you need software to go from this very raw data form, 
which comes in several formats and these very large files into something like say, 
on this slide, a picture of how a certain section of DNA might be duplicated in 
different people causing different phenotypes. 
So we need different types of software to do that. 
Software takes us from this raw, kind of uninterpretable and, 
and mind-numbing data into something which, which someone can look at. 
A student can look at, a post doc can look at or an investigator can look at and 
make sense of. 
So, one, one thing we often talk about when we're talking about analyzing data 
with computational biology software are analysis pipelines. 
You'll hear this word a lot. 
So, a pipeline is not really a pipeline to carry water or oil, but a pipeline is 
a way of taking raw data and feeding it through a whole series of programs that 
each make transformations on that data and result in conclusions at the end or 
condensed data at the end from which you can make your biological discoveries. 
So we might start with a raw data file and do something like say, 
clean it up to remove noise. 
There's multiple programs that do that kind of task and 
then you might summarize another way. 
You might assemble it, you might compare the sequences to each other, 
you might compare them to a reference genome. 
And then from there, go on and condense them further. 
So there are literally hundreds if not thousands of, 
of computational as you programs out there as well as pipelines. 
There are programs that let you visualize data in different ways. 
We have a, you have a, we have a course here on Galaxy, which, which puts together 
many of these programs and gives you one way of visualizing things. 
There is pipelines do things like discovery, 
which I'm not going to talk about today, except to say that it by discovery. 
I mean, if we're looking at someone's DNA, 
one of the things we want to figure out from that DNA sometimes is what's 
different about their genome from the reference genome. 
So the way we do that is to run through a competition biology pipeline, 
which does all sort of preprocessing, taking raw reads 100 base pair or 
200 base pair of short sequences and converting them into a readout 
of what's different about this person from the referenced genome. 
So these are, that's what I mean by software pipelines. 
So just to give you the flavor of what a computational biology software pipeline is 
we're going to talk about one example to do something called RNA-seq. 
So RNA-seq is probably the most, 
one of the most popular experimental protocols today in genomics. 
RNA-seq is the name for protocol that takes RNA from a cells or 
collection of cells and 
essentially sequences it to figure out, which genes are turned on in those cells. 
And there's literally an, literally thousands of kinds of experiments you can 
imagine you would do with RNA-seq. 
So we can use RNA-seq to measure the difference between say, two cell 
types to see which genes are turned on in one cell type and not another. 
We can use RNA-seq to look at cancer cells to see what's gone wrong with those. 
So, essentially with RNA-seq, we're taking a collect, 
we're taking our, our cells, we're extracting RNA. 
And we're turning that RNA into sequences, raw sequences, which are as, 
as with all sequences, they are very short reads that represent in this case, 
because it was RNA, 
they represent the genes that were turned on in the sample that we were looking at. 
So how do you go from those raw reads, those short 100 based pair or 
200 based pair sequences into, to the readout that you're interested in our 
RNA-seq experiment, which is a list of genes and their expression levels. 
So there's a pipeline that, that my group and, and 
others have been involved in developing that is sometimes called the Tuxedo tools 
that comprises three main programs called Bowtie, TopHat and Cufflinks. 
So the names are why it's called Tuxedo tools. 
And just, and there are actually more programs than just this, but 
just to give you the overall flavor of what this pipeline does is we start 
with RNA sequencing reads, these short reads. 
We line them to the human genome with Bowtie, 
that gives you a bunch of alignments. 
Because these are RNA, because these reads come from RNA, 
RNA is spliced, the introns are removed. 
I mean, some with a reasonable span two or more exons. 
So, a single short read might actually align with two places on the genome that 
are separated by an intron and in humans introns can be thousands of base pairs 
long, even tens of thousands or hundreds of thousands of base pairs long. 
So that's a different alignment problem. 
So you, you'll take reads that this, the reads that span these, 
these multiple exons have to be aligned differently, so 
it's another program to do that, it's called TopHat. 
And then you have all these alignments, you still don't have something you could 
give to a biologist and, and make any sense of. 
You need to take those alignments, assemble them together 
by comparing them to each other into the genes that were in the original sample. 
And then there's a program called Cufflinks that does that assembly. 
And then you have to go further than that, 
because now you know you can see perhaps which genes were present, but 
what you're really interested in is what their expression levels are. 
So Cufflinks also includes that. 
And then beyond that, when you're doing experiments, 
you want more than just the levels of expression in one sample. 
Usually, what you're doing, or almost always what you're doing is comparing two 
or more samples to each other. 
So you have to take the genes and their expression levels from one set of data and 
compare them to the genes and their expression levels from another set of 
data and see which genes went up and which went down. 
And from there, you can sort of, then you can start to make biological conclusions. 
So the Cufflinks package includes a program called Cuffdiff to do that, so 
that's a pipeline. 
It goes from raw reads to basically, tables showing you genes that went up and 
down and that's exactly what you want if you're the experimenter and 
you may not really care about what's going on in the software underneath, but 
it's important to understand these software tools. 
So, so here's a picture of what Tuxedo can spit out. 
Actually, it doesn't spit it out in quite this graphical form we're not there yet 
but with visualization tools, you can get a view very much like this. 
So the idea is you go from raw reads to a set of genes and 
one reason this is important in in our studies of the human genome is that we 
now know that almost all human genes, well, 
over 90% of human genes have more than one splice variant or isoform. 
That is what we're calling a gene is an interval on the genome that has some 
function and is usually translated, 
what we usually mean is it's translated into a protein. 
But the exons that comprise these genes can be chopped up and 
combined in different ways. 
We called those splice variants, or splice isoforms. 
And each of those different isoforms can be expressed at different levels. 
So not only do we have to figure out from these raw reads, 
which genes were expressed, but we also need to figure out which isoforms of those 
genes were expressed and what the expression level of those are. 
So we'd like our software to do all that and not have to worry about those details 
and we'd like our software to do it all correctly. 
So this is a very complicated process, 
which is why you need the software pipeline. 
So pipelines change and it's important that we have to keep, 
that we keep up with those things. 
So those programs I just talked to you about have already been superseded by, 
by even newer programs, which are, which are just being, 
some of them just being published that we call, that, that makes the, 
the next generation of the Tuxedo pipeline. 
So Bowtie is, is now Bowtie2. 
TopHat is still around, but its, 
its core engine is being replaced by a faster engine called HISAT. 
The Cufflinks Assembler is now been superseded by a program called StringTie, 
which does the same thing, assembles transcripts and quantitates them, but 
does it somewhat faster and somewhat more accurately. 
And the differential expression for the differential expression task, 
you can now use a program called a Ballgown. 
Now there are papers describing all these programs and you can go and 
read those papers and decide yourself, which are the best. 
And for some tasks, it may not matter that critically. 
But very often, it does matter. 
These programs produce different answer. 
So the bio, the biologists that, that you may work with if you're the computational 
analyst for a project isn't going to understand how this program work, 
programs work and isn't going to understand which ones are the best. 
So, it's up to you to keep up with what's the state of the art in, 
in competition biology software, if you want to do analysis. 
And you might think well, can it really matter that much? 
These are well defined, discreet problems. 
It's computing. 
Computing seems like a well defined, discreet task where the input and 
the output are always going to be pretty accurate, right? 
Well, I would argue that, that, that's not true at all. 
These are very complicated datasets and we've, we've found that the, 
the programs that operate on them, produce very different answers, even for 
something as well defined as alignment. 
Alignment is one of the most basic, 
in some ways the most basic of the computational problems that we deal with. 
Alignment, by alignment, I mean, take a short read and, 
and align it to the human genome. 
You would think that okay, well, if the read is long enough to align to one place 
in the genome, then all programs will give me the same answer. 
So, it doesn't really matter so much for my accuracy. 
It doesn't really matter so much which program I use as long as the, 
the program is fast enough. 
You might think, I'll just choose the fastest program and 
they'll all be giving you the same answers. 
Well, no, that's not true and, and 
we've done tests comparing two of the leading aligners. 
Bowtie2 and BWA on, on many datasets and we found that they, 
they don't actually align the same reads. 
So here's just one example from some exons data showing there's about 600,000, 
660,000 reasons in this dataset that Bowtie line that BWA didn't line and 
there's another 300,000 reason in the BWA line that Bowtie didn't align. 
And there's a few reasons that are unaligned by both. 
And you can say, well, okay, 98% of the reeds are were aligned by both, so 
maybe it doesn't matter. 
But maybe those reeds that got aligned by only one of the programs are the ones 
that you care about. 
So you don't, so you at least need to be aware of that. 
And these doesn't even, and this slide doesn't even show you the fact that when 
the two programs align a read, they don't always align it to the same place. 
So your choice of software matters a lot and it's, and it's critical that if you 
want to be a computational biologist that you keep up with the latest software and 
you're aware of what the differences are between the different software packages 
that you might apply to a dataset. 
So the, the, the overall message I want to leave you with is that software is 
changing, because technology is changing. 
So you can say, technology has been changing more rapidly over the past decade 
than almost any other technology we know of and 
the software's been changing rapidly as well to keep up with it. 
Today, there's, there's new sequencing technology for generating longer and 
longer reads. 
There's, there's, in many cases, isn't even yet software to process those, 
those kinds of sequences. 
The sequencing technology we use to do the highest group of sequencing, 
which currently is illumina sequencing is also changing, getting faster and 
higher throughput and the nature of the data itself is changing. 
This means that if we use software from three or 
four years ago on the latest sequencing data, we might get the wrong answer. 
And what's important to realize is that these programs are, 
some of these programs are pretty well engineered, so you'll get an answer. 
You'll get alignments, you'll get genes, you'll get expression levels, 
you'll get differences between genes in experiments. 
And everything will look like it's okay, but if the technology has changed, 
that is the sequencing technology has changed, the software may no longer be 
doing the right thing, so you might be getting misleading results. 
So you have to keep up with the technology and you have to keep up with the software.

WHY CARE ABOUT STATISTICS? 01:00 Hours


So this lecture's about why you should care about statistics. 
As Steven already told you, 
genomic data science consists of three different components. 

There's biology, computer science, and statistics. 

When people talk about genomic data science,  they often think about biology and computer
science,  and I think statistics often ends up being the third wheel. 

And so this lecture's to hopefully motivate you as to why statistics is  a very important component
of genomic data science. 

This is a really exciting result that came out in the Journal of Nature Medicine. 

And so, the results suggest that it's possible to take genomic measurements and  predict which
chemotherapies are going to work for which people. 

This is an incredibly exciting result in genomic data science,  because it was sort of the holy grail,
using genomic measurements to personalize  therapy, and particular, particularly personalized
therapy for cancer. 

And so, everybody was very excited about this, and people at all,  institutions all over the world
tried to go back and reproduce that result. 

And so, one of those groups was at MD Andersen Cancer Center.  So, that group of people
consisted of two statisticians, Keith Baggerly and  Kevin Coombes. 

And those statisticians tried to chase down all of the details and  reperform the analysis. 

They did this because their collaborators were really excited about it and  actually wanted to use
it at MD Anderson in order to tailor therapy. 

But it turned out that there were all sorts of problems with the analysis, and  they had trouble
getting a hold of the data. 

And so because of these problems,  they were actually unable to reproduce most of the
analysis. 

And this ended up being a huge scandal in the world of genomic data science, because 
this very high profile result, this result that everybody was sort of chasing after, 
turned out to sort of not work out once all the details were checked out. 
So this is actually an ongoing saga.  It, actually started off as a sort of a discussion between the
statistician  at MD Anderson and the group at Duke that actually performed the original analysis. 

And over time, they had a large set of interactions where they were trying to  settle on the details
of how the analysis was performed. 

It turned out that due to some lack of transparency by the people who  did the original analysis, 
clinical trials actually got started using this technology. 

They were assigning chemotherapy to people using sort of an incorrect data analysis, 
and it was because the statistics weren't actually really well worked out. 

This is so  serious that now there are ongoing lawsuits between some of the people that 
were involved in those clinical trials who had been assigned therapy and  the institution Duke that
actually was behind the creation of these signatures.  So missing out on why statistics will be part
of the genomic data science pipeline  caused a major issue, so big that actually lawsuits were
generated.

This actually spurred an Institute of Medicine report.  So this Institute of Medicine report dictated
that there are a whole  new set of standards by which people should develop genomic data
technologies. 

And much of this report focused on statistical issues, reproducibility,  how to build statistical
models,  how to lock those statistical models down, and so forth. 

And so, the first issue, the first thing that we,  I, I hope to motivate you is that we should care
about statistics. 

And I've just got a couple of silly examples here.  This is actually from a published abstract of a
paper. 

And in the abstract,  you can see where I've highlighted, that it says, insert statistical method
here. 

So, the authors of this paper cared so little about the statistical analysis  that they left a generic
statement about what statistical method they were using. 

So this sort of suggests how sort of the relative ranking of where statistics  falls in people's minds
when they're thinking about genomic data science. 

And that sort of issue can cause major problems like we saw with  the Potti scandal. 

So this is actually also not just in genomics,  it's actually a more general problem. 

So this is actually from a flyer from Berkeley, and so they talk about all  the different areas that
are sort of applying data science these days. 

And if you notice, statistics is listed, but there's actually no application area. 

And so this sort of, again, suggests that people think of statistics not necessarily 
as something that's important for data science. 

And that sort of lack of statistical thinking is a major contributor to problems in genomic data
analysis, both at the level of major projects,  but also at the level of individual investigators. 

And so the question is, how do we sort of change this perspective and  how do we make sure
that people care and know that caring about  statistics is just as important about, as caring about
the biology or  the computer science when doing genomic data science.

WHAT WENT WRONG? 01:00 Hours


In the previous lecture, I motivated the reason why we might care about statistics 
by analysis that went wrong. 

So in that analysis, as you'll recall,  they used genomic measurements to try to target
chemotherapeutics or  try to decide which chemotherapies would apply best to which people. 

And so that actually boiled down to two specific  reasons why that analysis went wrong. 

And so one is lack of transparency. 

So right from the start the data and code used to perform that analysis were not made available. 

In other words, the paper was published and people were looking at the paper and  trying to do
the analysis themselves, and they couldn't get a hold of the raw data  or the process data, and
they couldn't get a hold of the actual computer code or  the statistical code that was used to
perform that analysis. 

So this is related to the idea of reproducibility.  Can you actually re-perform the analysis that
someone else created in their  original paper? 

And so for the analysis that was done in the Duke example that I talked about. 

There was actually not reproducible code. The, it was not available. 

You couldn't get a hold of it.  Similarly, there was a lack of cooperation. 

And so this is actually not true in general, but  it was in this particular case, not only were the
code and data not available, but  the people in charge of the code and data, the principal
investigator and  the people that were the lead authors on the study were very reluctant to 
hand the data over to statisticians, and other people to take a look at. 

Now in every data analysis, it's invariable that there are always some problems. 

There's always some little issue that maybe people didn't notice.  But if the data and code aren't
available, and not only that but the people that  performed the analysis aren't cooperative, and
aren't sharing that data and code. 

But it can take a very long time to discover if there's any  problems in a data analysis that are
serious. 

Like for example in the analysis that was done  in the case of the genomic signatures. 

And so  the second thing that people would notice is that there's a lack of expertise. 

And so the lack of expertise in this case dealt with specifically with statistics. 

So one of the things that they used were very silly prediction rules.  So these are prediction rules
where they defined probabilities in ways  that people wouldn't necessarily not only are they not
right, but they sort of  are recognizably silly, so here's an example where I'm showing a
probability  formula where there's a minus one-fourth in the formula sort of out of nowhere. 

And so their prediction rules were based on these probability definitions  that not only weren't
right, but were kind of silly. 
And so that relates to a lack of statistical expertise.  The person who's actually developing the
model haven't gone through an actually done statistics class or perform enough analysis to
know. When they were doing something that wasn't not only right, but sort of silly to, 
to even look at.
Play video starting at :2:35 and follow transcript2:35
Another thing is that they have major study design problems.  So we'll talk about this in future
lecture.  But they have things that we call batrifacts basically  they have run samples on different
days. 

And those different samples related to whether they would have one  particular outcome better
than the other. 

So it's called a confounder which we'll talk about.  But these study design problems at the very
beginning before even performing analysis, they'd set it up in a way they sort of set themselves
up to  fail in the sense that the experimental design wasn't in place,  in a way that would allow
them to do the analysis that they were hoping to do. 

And so finally the predictions weren't locked down.  So in this case what, what happened was
they had these prediction rules. 

And the prediction rules stated, stated when you should apply  which chemotherapy to which
person on the basis of the genomic measurements. 

But because the prediction rules had a random component to them,  if you predicted on one day
the probability of being assigned  to one treatment might be one number. 

And on another day, you ran the exact same algorithm on the exact same code and you  would
get a totally different prediction on which chemotherapy they should get. 

And so this wasn't due to changes in the data or  changes in the statistical algorithm. 

It was due just to changes in the day that the algorithm was running. 

And obviously if you're running a clinical trial,  you don't people assigned to therapies based on
sort of random chance. 

So these are all issues where there was lack of statistical expertise, and it turns out that the
analysis was ultimately totally reproducible. 

The statisticians at MD Anderson were able to chase down all of the details,  were able to put out
all of the code, and all of the data that was originally used  to perform the paper, so it was a
totally reproducible analysis. 

The problem is that the data, the data analysis was just wrong.  And the reason why the data
analysis was wrong was that there was a severe lack of  statistical expertise among the people
that were performing the study. 

So I hope I motivated that having the statistical expertise can help you avoid  before the problem
even started, at the experimental design level,  at the level of creating correct statistical
procedures. 

Some of the problems that led to the scandal at Duke.

THE CENTRAL DOGMA STATISTICS 01:00 Hours


Just like biology has a central dogma, statistics, and 
in particular statistical inference, has a central dogma as well. 
The central dogma is sort of the central idea that 
explains what you're trying to do in the field. 
And so, the central dogma of statistics has to do with this specific problem. 
Suppose you have a huge population, like you see in the top left-hand corner, and 
you might want to know something about that population. 
In this case, it's an idealized example, so we might want to know how many pink and 
how many gray samples are there. 
So in general, the problem might be that measuring the whole population, or 
taking measurements on the whole population, might be really expensive, or 
it might be very hard to do for a number of different reasons. 
And so, what we want to do is take advantage of, basically, probability to be 
able to say something about the population without measuring the whole population. 
So what we do is we take, 
use probability to take a small sample from that population. 
You may have heard of sort of a randomized sample, 
there's a number of different ways you can use probability to get this sample, but 
the idea is that you would like it to somehow represent the larger population. 
So once you've taken that sample, we can maybe make measurements on the smaller 
number of objects that we've collected here. 
So we have these symbols in the lower right hand corner, 
there's only three of them, so it might be relatively cheap, 
or relatively easy to take measurements on it. 
So we see that there are two pink symbols and one gray symbol, and 
so then what we use is statistical inference 
to make a guess about what the population looks like. 
So we might say, you know, on average there are going to be more pink symbols 
than there are gray symbols in the whole population, 
because that's what happened in our sample. 
And if we did the sampling right and the probability sampling right, 
then that best guess might be pretty good. 
Another important component of the central dogma of statistics is that 
this best guess isn't quite enough. 
So, we took a sample, we didn't measure actually everything in the population, 
we only measured a subset. 
So, it turns out that the whole, the, our best guess is actually, potentially, 
kind of variable. 
And so, it could be that the best guess is off in one direction, 
we might actually have more gray symbols in the population. 
Or, it could be in the other direction, 
that it might be more pink symbols in the population.
Play video starting at :2:4 and follow transcript2:04
So the question is, how do we quantify that variability? 
How do we say, 
we took this sample, how do we see what's actually going on in the population, and 
that's the, sort of, the central idea behind statistical inference. 
And it's really important, so knowing the population is maybe one of the most 
fundamental ideas in statistics, and it's central to the central dogma. 
So, in this same example, suppose we have a population that consists of pink and 
gray symbols and we take a sample from that population. 
And, then it turns out that between the time that we took that sample and 
we actually want to do the inference, the population changes. 
So now, all of a sudden, we've introduced some purple symbols.
Play video starting at :2:42 and follow transcript2:42
Now, if we want to do that same inference, 
we end up in trouble, because the sample no longer represents the population. 
This is actually a very common problem, and it's a very under sort of, appreciated 
problems in statistical inferences, knowing what the population is. 
So here's an example of that. 
You may have heard about Google Flu Trends. 
Google Flu Trends tries to use search terms to predict 
flu activity in the United States and in other places. 
And it got a lot of press because it's sort of a cool, 
and a very inef, a very efficient way of trying to predict the flu, 
you just need the search terms and you can create the prediction. 
But it turns out that Google Flu Trends, despite being very good when it very, 
was first released, 
ended up being pretty bad at predicting when flu outbreaks would occur. 
And the reason why was that the population changed, the way people searched for 
symptoms of the flu changed over time. 
And so, that was one of the major reasons why the prediction algorithm 
they originally developed no longer worked, because the population changed. 
So the central idea of statistical inference, and 
the central dogma of statistics is, we have a population, 
we want to take a smaller sample from that population using probability, and 
then use statistical inference to say something about the population, and 
in particular, the variability of our estimate for that population.

DATA SHARING PLANS 01:00 Hours


One of the main issues in the original motivating example I gave you was that 
the data weren't available. 
They couldn't be analyzed by other people. 
And so it's important to have a data sharing plan. 
This is an important component of the statistical analysis of any genomic 
data set. 
And so, a data set consists of four actual components. 
First is the raw data. 
In the case of sequencing data, this is often the raw sequencing reads. 
That might be something like a FASTQ file or in a line file, 
a BAM file of line reads. 
Then there's a tidy data set. 
The tidy data set, we'll talk about in just a minute, 
is a data set where you've actually done some processing and 
cleaning so the data is easily analyzable and easily made interactive. 
Then you have to have a code book. 
This code book describes each variable and its values in the tidy data set. 
And finally, you need an explicit and exact recipe you use 
to go from the raw data to the tiny data set and the code book. 
Without all of these four parts, a data set is incomplete when you're sharing it.
Play video starting at ::53 and follow transcript0:53
So the first thing to keep in mind is the raw data. 
So the raw data here, I'm showing a FASTQ file, which is like, 
an example of raw data in genomics. 
And you know it's the raw data if you did no processing, no computing, 
no summarizing, and no deleting to the data set. 
You can't do any kind of analysis at all to the raw data in order for 
it to still be the raw data. 
It's important to also know that this is a relative term. 
So for example, you're seeing here the sequencing reads that you might get like, 
from the machine. 
But there are also images, as Stephen explained, 
underlying the actual sequence a lot, the sequence reads that you get here. 
You actually most often don't get access to those images. 
Those are the raw data to someone else. 
So the raw data is when it comes to you, 
the rawest form of the data that you have available.
Play video starting at :1:36 and follow transcript1:36
A tidy data set is described with these four terms. 
So it's one variable per column, one observation per row, and one table 
per kind of data set, along with a linking indicator if you have multiple data sets. 
So for example here, I'm showing a data set where 
each of the observations that we've collected is in a row. 
And the variables are the problem ID and the subject ID, and so forth. 
So in general, if you have a genomics data set, you might have 
the genomics part of the data, and you have that in one tidy data set. 
And you might have, say, metadata or phenotype data in another file. 
It's important to have a linking indicator between both, and that they're both tidy. 
So the tidy data set goes along with the raw data set when 
distributing your results. 
You also need a code book. 
So the code book should have things like the variable names and 
their descriptions and units that, 
things that you couldn't put into the actual raw data or in the tidy data. 
So for example, if you measured height in feet versus height in meters, 
you'd want to include that in the code book. 
And we know there have been major disasters, say, 
when we were trying to send a satellite to Mars when people didn't know what 
units they were working in. 
So all of this needs to be recorded in the code book.
Play video starting at :2:42 and follow transcript2:42
You also need a recipe. 
So the recipe has to take the raw data, execute some commands, 
and produce the tidy data set. 
The best way to create a recipe is to use some kind of script. 
So a script is a set of commands that a person can run without any interference 
from the original analyst and produce the tidy data set. 
These are typically written in R or Python code. 
So the input data is the raw data, and the output data is the tidy data. 
There are no parameters that are allowed in these sorts of functions. 
They have to be able to be run without the user interfering with them at all.
Play video starting at :3:13 and follow transcript3:13
If you're not comfortable creating a script, that, 
then you could also create an explicit list of instructions about how 
you went from the raw data to the processed data. 
There's a lot of danger here because you have to be extremely explicit. 
You have to list every parameter of every piece of software you ran, 
every version of every software you ran. 
If you did something in Excel, you have to be, 
maybe make a video of what you were doing in Excel when you did it so 
that people will have a record of everything that you did. 
And you have to distribute all of that along with the processed and the raw data. 
So this is, I've coded it in orange here, this is an okay thing to do. 
But you have to play, pay very careful attention.
Play video starting at :3:48 and follow transcript3:48
You have to be careful to avoid any vague instructions, any missing versions, or 
any skipped steps. 
These are common if you do the recipe outside of a script. 
So it's highly recommended that you use scripts for 
creating processed data from raw data. 
If you need a data sharing plan, I've created one here. 
It's available on GitHub, and you can go and use it. 
It explains a lot of what I've said in this lecture in even more detail, 
if you would like to get into how do you actually share data with people.

GETTING HELP WITH STATISTICS 01:00 Hours


So I hope I've convinced you of the importance statistics 
in genomic big data science. 
So, then the next question might be how do you get help in statistics? 
I'm a firm advocate that the best way to get help in statistics is to actually go 
out and learn a little bit of statistics yourself, and there are a large number of 
online resources that are available to you to do this. 
There's a statistics class as part of this specialization. 
We also offer a gen, 
a JHU data science specialization that has a large statistical component as well. 
There are also a large number of other online resources and 
courses that are available to you if you want to go out and 
learn how to do the statistics yourself. 
If you know a little bit of statistics and 
just need a little bit of help with whatever analysis you're actually working 
on at the moment, one way to do that is to go out to Q and A sites. 
I'd recommend sites like Cross Validated and 
Stack Over the Flow where you can go out and actually post questions, both about 
specific packages that you might be interested in, and also about specific 
analysis types that you might be, have questions about which model to fit, or 
you might have questions about whether the model that you fit makes a lot of sense. 
So, the one thing that you might do if you need a little bit more help than even that 
is go out and actually get some more expertise, in in wo, 
in other words you might find somebody to help you out. 
One way to do that is to, if you're sort of the principal investigator of a lab is 
to hire a single, lonely bioinformatician. 
So a lonely bioinformatician is someone who is hired to do computational biology, 
sits in a biology lab, but 
isn't supported by other people who do computational science necessarily. 
So this is a very hard job, and it's possible to do it, and 
there are actually some very excellent lonely bioinformaticians out there, but 
it's actually a very hard way to actually perform computational analysis in general 
because this person won't necessarily have access to all of the different help and 
resources that a person that's in a center for computational biology might have. 
So often the best way if you need deep statistical or 
computational expertise the best way to get it is to go out and 
start a collaboration, a long term collaboration. 
And so, the long term collaborations can be formed by formed by identifying 
a center for computational biology, or biostatistics where there are a large 
number genomic data scientists, computational genomic data 
scientists working, they can go out and help you actually perform your analysis. 
And so here at Johns Hopkins we have such a center, the Center for 
Computational Biology, that brings together people from biology, 
from biostatistics, from computer science, and so these folks all sort of work on 
computational biology, but dive deep into the problems. 
So forming these long term collaborations can really help solve statistical and 
computational problems.

PLOTTING YOUR DATA 01:00 Hours

Most statistical analysis of genomic data should be done interactively. 


And what we usually mean by that is by using plots. 
Another way of saying this is that you should take big data and 
make it as small as possible, as quickly as possible. 
This is a quote attributed to Robert Gentleman at Genentech. 
And what he was trying to say is that if you have a gigantic data set, say, 
made of millions or billions of reads, 
it's very hard to visualize what's going on with those data or 
understand the different properties or the characteristics of those data. 
And so the idea is you want to summarize them down just enough that you're able to 
plot them and visualize them and try to figure out what's going on.
Play video starting at ::34 and follow transcript0:34
And it's really important to do interactive analysis with lots of 
plotting, because sometimes the summary measures, 
the statistical summary measures that are often reported, say for 
associations can be a little bit deceptive. 
As an example of that, here are four plots. 
Each plot corresponds to a different data set. 
Each of these data sets seems statistically identical if you look at 
the coefficient for the slope, the p-value for the slope, 
or the correlation coefficient. 
So all of these data sets from the perspective of statistical summary 
measures are identical. 
But as you can see, 
there seem to be very different patterns going on in these different data sets. 
For example, the pattern in the upper right-hand corner looks a bit like sort of 
a quadratic shape. 
The pattern in the lower left-hand corner has a clear outlier, and 
so does the panel in the lower right-hand corner in a totally different direction.
Play video starting at :1:20 and follow transcript1:20
And so the idea is the thing that you want to be doing is taking your summaries 
of your raw data and plotting them as quickly as possible, so you can try to 
identify characteristics or features of the data that might be important.
Play video starting at :1:31 and follow transcript1:31
So interactivity allows for more discovery. 
And so one thing that you want to do when doing an interactive analysis, 
where you're trying to discover what's going on, 
is show as much of the data as you can in your plots. 
So here's an example. 
On the left is what I would call a bad plot. 
And lots of people, particularly statisticians, 
don't like plots like this because it shows bar plots with confidence bounds. 
But those bar plots don't show any of the actual raw data. 
So it's very hard to know if there was an outlier driving any of the analysis or 
how many data points were even used to create these bar plots. 
On the other hand, on the right-hand side, there's a much better plot. 
It shows the same information as the bar plot in terms of showing the average value 
in the confidence bounds, but it also shows the raw data. 
So you can actually see when there's differences in distributions. 
And in this case, there's only three points in each group. 
So you might have a little bit less faith in any conclusion you might draw than 
you might get from using the bad plot, 
where you don't actually know how much data is in the plot.
Play video starting at :2:26 and follow transcript2:26
Another thing that you might do is plot replicates. 
This is a very common plot that you would do when you're doing 
an interactive analysis. 
So here you might want to compare, say, 
you ran the same sample through the technology twice, technical replicates. 
And you want to see if those two replicates produce similar results. 
So here's a plot on x, the x-axis is replicate one, and 
on the y-axis is replicate two. 
And so here, they look very, very correlated. 
So that's very good, you might see this and be comforted or think, okay, 
this technology is doing very well. 
There's a couple of tricky things, though, especially about plotting replicates. 
So the first thing is be careful of scale. 
So if you go back to this plot, 99% of the data is in the tiny little lower left-hand 
corner that I've shown below to, below and to the left of the light blue line here. 
So what you're seeing, 
just sort of the 1% of the data that ends up getting spread out. 
So one way that you can deal with this sort of tightly clustered data, 
particularly for replicates, is to use transforms. 
One example of a data transform is the log transform. 
So if you take the log of the data and 
then make the same plot, you can see they're correlated. 
But now the data is much more spread out and all of the data that was super tightly 
clustered down in the lower left-hand corner has been spread out a little bit. 
You get a little bit better idea about what's going on in the plot.
Play video starting at :3:36 and follow transcript3:36
Another thing that people typically do when comparing two samples or 
two replicates is instead of plotting one replicate versus the other, 
they make something that's called a Bland-Altman plot. 
In genomics, this is often called an MA plot, and this is one of the most 
common plots and widely used in a variety of different technologies. 
So the idea here is rather than plotting replicate one on the x-axis and 
replicate two on the y-axis, you add the two replicates on the x-axis and 
you subtract the two re, replicates on the y-axis. 
So what does this mean? 
Moving from left to right are the, in this case, it's gene, 
each dot represents a gene. 
On the left-hand side are the genes that are very lowly expressed. 
And on the right-hand side are the genes that are very highly expressed. 
Then if you look at the y-axis, 
how far you are from zero is how different the two replicates are from each other. 
And so what you would like to see is all the points lining exactly on the zero 
line, which obviously never happens in real life. 
But what you can see here is, for example, a trend. 
There seem to be more differences between the replicates for 
the lowly expressed genes. 
This is a very common phenomenon. 
And so it's better to see that when you make the MA style plot 
as opposed to making the plot of one replicate versus the other.
Play video starting at :4:44 and follow transcript4:44
One thing that you should be very careful of, and 
is a very common problem in genomics, is what has been coined ridiculograms. 
So a ridiulogram is defined as, in general, 
it can be something a little bit more general than this. 
But it has been defined as a network plot that looks beautiful but communicates very 
little information and appears on the cover of Science and Nature. 
And so what these plots are, you often see these sort of hairball-type plots. 
There are other plots just like them that don't actually communicate a message. 
They just kind of look beautiful. 
And so the point of statistical graphics is both to look pretty and 
to be you know, enjoyable to look at, but more importantly, 
to communicate scientific information to the reader. 
So bewaring ridiculograms is an important component of making sure your plots 
are interpretable.

SAMPLE SIZE AND VARIABILITY 02:00 Hours

This lecture's about experimental design as well it's about sample size 
and variability. 
So if you remember from previous lecture, 
the central dogma of statistics is that we have this big population. 
And it's expensive to measure, you know, 
whatever measurement that we want to take genomic or 
otherwise on that whole population so we take a sample with probability. 
Then on that sample we make our measurements and 
use statistical inference to say something about the population. 
So we talked a little bit about how that best guess that we get from our sample 
isn't all that we get, we also get an estimate of variability. 
So let's talk a little bit about variability and 
what its relationship is to good experimental design. 
So there's a sample size formula that you may have heard of that's, 
if N is the number of measurements that you could take or 
the number of people that you could sample. 
If you're doing scientific research, you have to ask for grant money often and 
so N ends up being the number of dollars that you have 
divided by how much it cost to make a measurement. 
And while this is one way to get at a sample size, it's maybe not the best way. 
So the real idea behind sample size is basically to understand variability
Play video starting at :1:6 and follow transcript1:06
in the population. 
And so, here's a really quick example of what I mean by that. 
So here are two synthetic made up data sets. 
So there's a data set for Y and there's a data set for X. 
So the measurement values on the X axis and 
on the Y axis there's the two data sets, YX. 
And you can see, I have two lines here, the red line is the mean of the Y values 
and the blue line is the mean of the X values. 
And so, what you can see is that the means are different from each other but 
there's also quite a bit of variability around those means. 
Some measurements are lower and some measurements are higher and they overlap. 
So the idea is, if the two means are different, how, 
how confident can we be about that? 
If we know what the variation is around the measurement that we've taken and 
the mean that we have. 
How confident we can be that these two means are different than each other? 
So this goes through how many samples that you need to collect? 
How much variability you need to observe to be able to say whether 
the two things are different or not?
Play video starting at :1:57 and follow transcript1:57
So the way that people do this in advance in sort of experimental design is 
with power. 
So basically, the power is the probability that 
if there's a real effect in the data set then you'll be able to detect it. 
So, it depends on a few different things, it depends on the sample size, 
it depends on how different the means are between the two groups, 
like we saw the red and the blue lines. 
And it depends how variable they are, so 
we saw that there was variation around the means in both the X and the Y data sets. 
So this is actually code from the R statistical programming language. 
You don't have to worry about the code in this lecture but 
you can just see that for example, if we want to do a t-test, 
comparing the two groups which is a certain kind of statistical test. 
The probability that we'll detect an effect of size 5, 
that's what we have delta there with a variability of 10, the standard deviat, 
standard deviation of 10 in each group and 10 samples is 18%. 
So it's not very likely that even if there's an effect we'll detect it but 
what you can do is you could also go back and make the calculations, 
say, as is customary, we want 80% power. 
In other words, we want an 80% chance of detecting an effect if it's really there. 
So for a effect size of 5 and a standard deviation of 10, 
you could see that we could calc back out, how many samples that we need to collect? 
Here, in this case by doing the calculation, 
we see we need 64 samples from each groups 
in order to have an 80% chance of detecting its particular effects on us. 
But similarly, you can do that calculation by saying, how many do you need to have 
for one group if you're only going to be doing, or for each group, 
if you're only going to be doing a test in one direction or the other? 
So suppose, I know that the effect size will always be expression levels will be 
higher in the cancer samples than the control samples. 
Then it's possible to actually create, less, less samples and still 
get the same power because you actually have a little bit more information. 
Later classes and statistical classes will talk more about power and 
how you calculate it. 
But the basic idea is to keep in mind that you, the power is actually a curve. 
It's never just one number even though you might hear 80% thrown around quite a bit 
when talking about power, the idea is that there is a curve. 
So when there's no, in this plot, 
I'm showing on the X axis, all the different potential sizes of an effect. 
So it could be 0, that's the center of the plot or it could be very high or 
very low and then on the Y axis is power for different sample sizes. 
Black lines correspond to sample sizes of 5, blue line corresponds to sample sizes 
of 10 and red lines correspond to sample size of 20. 
So as you can see that, 
as you move out from the center of the plot, the power goes up. 
So, the bigger the effect, 
the easier it is to detect, also as the sample size go up, goes up, 
you see from the black, to the blue, to the red curve, you get more power as well. 
So as you vary these different parameters, you get different power and so 
a power calculation is a hypothetical calculation based on what you 
think the effect size might be and what sample size you can get. 
And so, it's important to pay attention before performing a study 
as to the power that you might have so you don't run the study. 
And end up at the end of the day without any potential difference 
even when there might have been one there.
Play video starting at :4:56 and follow transcript4:56
So there are three types of variability, we've been talking about variability in 
terms of the sampling variability that you get when you take a sample. 
And then look at how does that relate to the population but 
there's actually three kinds that are very commonly measured or 
considered when performing experiments in genomics.
Play video starting at :5:13 and follow transcript5:13
So, variability of a genomic measurement can be broken down into three types, 
the phenotypic variability. 
So, imagine you're doing a comparison between cancers and controls. 
Then there's variability between the cancer patients and 
the control patients about their genomic measurements. 
So this is often the variability that we care about, 
we want to detect differences between groups. 
There's also measurement error, 
all genomic technologies measure whether it's gene expression, 
methylation, whether it's the alleles that we measure in a DNA study. 
All of those are measured with error and so 
we have to take into account how well does the machi, 
machine actually measure the reads, how long we quantify the reads and so forth. 
There's also a component of variation that often gets ignored or 
missed which is natural biological variation.
Play video starting at :5:55 and follow transcript5:55
So for every kind of genomic measurement that we take, 
there's natural variation between people. 
So even if you have two people that are healthy, have the same phenotypes in every 
possible way, they're the same sex, the same age, they eat in the same breakfast. 
There is still going to be variation between people and that natural biological 
variability has to be accounted for when performing statistical modeling as well. 
An important consideration is that there's often a rush when there's new technologies 
to sort of claim that this new technology is so 
much better than the previous technology. 
One way they do that is by saying that the variability is much lower 
that may be true for the technical component or the measurement error 
component of variability, but it doesn't eliminate biological variability. 
So here I'm showing an example of that, 
there are four plots in this picture that you're looking at. 
The top two plots show data that was collected using next generation 
sequencing. 
The bottom two plots show data that was collecting with micro 
razing with older technology. 
Each dot corresponds to the same sample, so 
it's the same samples in all four plots. 
And so what you can see is for the gene on the left, you see that the pink gene, 
you can see that there's lower variability across people. 
So this is true, whether you measure it on the top with sequencing or 
on the bottom with arrays. 
Similarly, the gene on the right that, I've colored in blue here 
is highly variable when measured with sequencing or when measured with arrays. 
So what this suggests is that biological variation is a natural phenomenon 
that always is a component of non modeling data in genomic and 
it does not get eliminated by technology.
Play video starting at :7:29 and follow transcript7:29
So that's what we talked about here is the variability and 
sample size calculations and how those things relate. 
And one of the most important components of statistics is paying attention to 
how variation exists in both your sample and in the population that you measure.

STATISTICAL SIGNIFICANCE 02:00 Hours

Probably one of the things you've definitely heard about 


statistics is the idea of statistical significance or P-values. 
So this lecture is going to tell you a little bit about 
some of the high-level thinking about statistical significance. 
So the basic idea here is we want to know if observed differences that we have in 
the sample are replicable or more generally what people call real. 
Now, real is a little bit of a fuzzy concept or what 
does that mean necessarily? Is it totally clear? 
But what it's supposed to imply is that there's a difference between 
the two groups and it's usually in the mean value of the measurements that you're taking. 
So here is an example. Here are three genes. 
For each of the genes there is measurements from two groups. 
There's the red group and the blue groups, 
there's three dots from each group. 
On the Y-axis is the log expression values. 
So for each gene you see the plot of the six data points corresponding with their gene. 
Gene 1, there's not much difference in the means, 
and most of that is due to one little outlier that comes out of the blue group. 
But that suggests that while the means might be different, 
the variability is also high enough that it's hard to conclude any difference. 
For Gene 2, you see what would appear to be a pretty clear difference. 
The three red dots and the three blue dots are 
tightly clustered and they clearly have different mean levels. 
On the other hand, Gene 3 is 
another example where it looks like there might be a difference. 
So, for example, the red dot seemed to be a little bit higher than the blue dots. 
They're are also very tightly clustered. 
But they're not very far apart and 
the variability isn't very much different than the difference in means. 
So, how do we distinguish these cases? 
How do we know when we've observed a difference that 
appears to be large enough that we would call it a real difference? 
The most common statistic that people use and the one you've 
almost certainly heard about is called the t- statistic. 
The t-statistic actually has 
a general form that is also widely used in a number of other statistics. 
So imagine you have measurements that you've taken, 
and we've labeled them with Y, 
and we've labeled them with X for the two different groups. 
Then the t-statistic equals the average of 
the Y values minus the average of the X values, 
divided by a measure of variability. 
So here we estimate how variable the Y values are with S squared of Y, 
and how variable the X values are with S squared of X. 
So these are estimates. 
We can go into more detail about in a statistical class, 
but for now you can just think of the denominator as scaling 
the difference between Y and X by the units of variability. 
So if they're very far apart in 
variability units then we might believe that it's real and if not then maybe not. 
So big t-statistics means that it's more likely that there's a difference we think, 
and small t-statistics means it may be less likely that there's a difference. 
So, how do we actually quantify what we 
mean by how statistically significant a result is? 
The most common approach and probably the most 
widely used and known statistic ever created is the p-value. 
So the idea here is suppose that we've calculated 
a t-statistic for comparing the difference between two groups. 
Suppose that statistic is equal to two. 
Is that a big value or a little value? 
Well, one way that we could figure that 
out and a way that's commonly used is what's called a permutation test. 
Basically, you take the group labels that you're using, 
the values that you're calling X and 
the values that you're calling Y, and you scramble them up. 
So some on the X values get like the Y, 
and some of the Y values get left in the X and you do 
that randomly and you do it over and over again. 
Each time you create a random labeling, 
you recalculate the statistic. 
What's going on here? We've broken the relationship between the label and the data, 
because we've randomly scrambled them so we wouldn't expect there to be any association. 
So then what we can do is we can make a histogram, like we have here, 
of all the statistics that you get from these random scrambles. 
You can see where the original statistic lands in that distribution. 
To calculate a P-value, 
you can basically sum up how many 
times the scrambled values were larger than your observed value. 
Usually, you do this in absolute value. 
In other words, you don't care whether the statistic is 
bigger or smaller in absolute value than the value that you got, 
and so you calculate the average number of 
times the scrambled statistics are bigger than the observed statistic. 
This gives you a P-value. 
The P-value is widely used to calculate statistical significance. 
It's also widely both interpretive and 
misinterpreted and it has some properties that are very useful. 
In general, what you've probably heard is that p-values that are low, 
so closer to zero are reported statistically significant. 
The usual cut off is 0.05. 
This P-value is on a higher or often considered to be less statistically significant. 
It's important to know what a P-value is and what P-value isn't. 
In fact, this is the best way to get 
a statistician's blood pressure up is to misinterpret a p-value. 
So the p-value is interpreted as the probability of observing 
a statistic as extreme or more extreme than the one 
you calculated in the real data if the null hypothesis is true. 
It seems like a mouthful because it sort of is, 
it's a bit of a hard concept to think about. 
It's basically looking to see how many more times in the null data, 
in the data where we scramble the labels is 
the statistic big than the one we actually calculated. 
A few things that p-value is not and will get you in trouble with statisticians 
almost certainly is that this p-value is not the probability that the null is true. 
In other words, the probability that there's no difference between the groups. 
It's not the probability the alternative is true. 
In other words, it's not the probability that there is 
a difference and it's not a measure of statistical evidence. 
If you're using any of these interpretations you're 
potentially walking into a world of hurt. 
So you should stick with the very standard, 
although a little bit wary of what the definition of what a p-value means. 
So a common mistake is to misinterpret this p-value. 
So here's an example from the New York Times where they're actually 
trying to describe what the p-value means and you see a 0.05 there. 
0.05 is the common cutoff. 
If a p-value is less than 0.05, people often call it significant. 
There is absolutely no reason why 0.05 is the cutoff, 
other than one time a person asked one of the original developers and users of p-values, 
"What would be a good cut off?" 
He said, "I guess 0.05 might be all right." But that's now 
propagated throughout the entire medical establishment as they define cutoff. 
In general, what happens is people over interpret or misinterpret 
p-values and that's what gets us into a lot 
of trouble with issues with statistical significance, 
and why you've heard things like maybe most published medical research is false.

MULTIPLE TESTING 02:00 Hours

Previously we talked about statistical significance. 


But in general, in genomic studies, 
you're often consider, considering more than one data set at a time. 
In other words, you might be a, analyzing the expression of every one of the genes 
in your body, or you might be looking at hundreds of thousands of millions of 
variants in the DNA, or many other multiple testing site, type scenarios. 
So in these scenarios what you're doing is you're calculating a measure of 
association between say some phenotype that you care about say cancer 
versus control and every single data set that you collected. 
Say, a data set for each possible gene. 
So in this case what's happened is people are still applying 
the hypothesis testing frame work. 
They're using P-values and things like that. 
But the issue is that, that framework wasn't built for doing many, 
many hypothesis tests at once. 
So if you remember when we talked about what a P-value was, 
it's the probability of observing a statistic as, or more extreme, 
than the one we, you calculated in an original sample. 
And so what is it, one property of P-values that's very important, and 
that we should pay attention to is that if there's nothing happening, 
suppose that there's absolutely no difference between the two groups that 
you're comparing, the P-values are uni, what's called uniformly distributed. 
So this is a plot of some uniformly distributed data histogram. 
On the x-axis you see the P-value, and 
on the y-axis is the frequency of the number of P-value that fall to that bin. 
And so, this is what the uniform distribution looks like. 
And so, what a uniform distribution means is that 5% of the P-values will be 
less than 0.05. 
20% of the P-value would be less than 0.02, and so forth. 
In other words, when there is no signal, the P-value distribution is flat. 
So what does that mean? 
How does that sort of play a role in a multiple testing problem? 
And so here's an example with a cartoon. 
Imagine that you're trying to investigate whether jelly, 
jelly beans are associated with acne. 
So what you could do is, you could perform a study where you compare 
people who eat a lot of jelly beans and 
people who don't eat a lot of jelly beans and look to see if they have acne or not. 
And so if you do that, you probably won't find anything. 
And so, at the first test, people go ahead and collect the data on the whole sample, 
they calculate the statistic, the P-value's greater than 0.05, they conclude 
there's no statistically significant association between jelly beans and acne. 
But in the, you might consider, oh, well it might be just a kind of jelly beans. 
So you could go back and test brown jelly beans and yellow jelly beans and so 
forth, and in each case, most of the time, the P-value would be greater than 0.05. 
And so it would not be statistically significant, and you wouldn't report it. 
But then, since P-values are uniformly distributed, 
about one out of every 20 tests that you do, 
even if there's absolutely no association between jelly beans and 
acne, about one out of 20 will still show up with a P-value less than 0.05. 
And so a danger is that you do these many, many, many tests and 
then you find the one with P-value is less than 0.05 and you just report that one. 
So here's an example where there's a news article saying that green jelly beans have 
been linked to acne. 
So that's again, whether it's either reporting this with a statistical 
significance measure that was designed when performing one hypothesis test, but 
in reality they per, performed many. 
So how do we deal with this? 
How do we adapt the hypothesis testing framework 
to the situation where you're doing many hypotheses? 
Tests. 
So the way that we do that is with different error rates. 
So the two most commonly error rates that you'll probably hear about when doing 
a genomic data analysis are the family wise error rate and 
the false discovery rate. 
So the family wise error rate says that if we're going to do many, 
many hypothesis tests we want to control, 
control the probability that there will be even one false positive. 
This is a very strict criteria. 
If you find many things that are significant, and 
a false family wise error rate that's very low, 
you're saying that the probability of even one false positive is very small.
Play video starting at :3:40 and follow transcript3:40
Another very commonly used error measure is the false discovery rate. 
This is the expected number of false positives divided by the number of 
total discoveries. 
So what does this do? 
It sort of quantifies, among the things that you're calling statistically 
significant, what fraction of them appear to be false positives? 
And so the false discovery rate often is a little bit more liberal 
than the family wise error rate. 
You're not controlling the probability of even one false positive. 
You're allowing for some false positives, to make more discoveries. 
But it quantifies the error rate at which you're making those discoveries. 
And so to interpret these error rates you have to 
be very careful because they actually have different interpretations. 
You do different things to data, but you also have to interpret them differently. 
So just because you find more statistically significant results when you 
use the false discovery rate than when you use family wise error rate, 
it doesn't mean that magically, all of the sudden, 
there were more results that were truly different. 
It just means that there's a different interpretation 
to the analysis that you do. 
So i'm going to give you a very sim, simple example. 
Suppose you're doing an analysis with 10,000 genes. 
A gene expression, differential expression analysis. 
And you discover that 550 of those genes are significant at the 0.05 level.
Play video starting at :4:45 and follow transcript4:45
Now I said 0.05 level, but 
imagine the 0.05 level is one of three different scenarios. 
So suppose for 
example, those 550 were discovered just by thresholding the P-value to 0.05. 
In other words we said, 
we got those 550 by saying all P-values less than 0.05 are significant. 
In this case, remember we said the P-values are uniform under the case 
where there's nothing going on. 
So we would expect about 0.05 times 10,000. 
500 false positives at a total number of discoveries of 550, so even though 
we found statistically significant results they might mostly be false positives.
Play video starting at :5:19 and follow transcript5:19
Alternatively suppose that when we declare those 550 to be significant we were using 
the false discovery rate. 
In this case we're quantifying among the discoveries that we've made the rate of 
errors that we would make then. 
So about 5% times the 550 things we discovered equals about 
27.5 false positives. 
So in this case, we discovered the same number of things, but 
using a different error rate, it means that we control the error level much 
lower than if we just calculated P-values less than 0.05. 
Finally, suppose we use the family wise error rate. 
In this case, if we had found 550 genes differentially expressed out of 10,000, 
at a Family Wise Error Rate control of 0.05, that means 
the probability of even one of those 550 being a false positive is less than 0.05. 
So that means that almost all of them would probably be true positives. 
So in this case, we've sort of illustrated the three types of 
ways that you could sort of calculate statistical significance. 
In each case it means something totally different 
with statistical significance set. 
When you use those words it means something totally different depending on 
what error rate that you’re controlling. 
One last thing to consider when looking at multiple hypothesis tests is 
the inevitable scenario. 
So everybody who's done some real science has run into this scenario 
where the P-value that they calculated is just greater than 0.05. 
And the natural reaction is to be very sad and to think game over, oh, 
I've got to try all over again because my P-value's greater than 0.05. 
It's a really good idea not to do that. 
First of all, it's important to report negative results even if you can't get 
them into the best journals, to avoid what's called publication bias. 
But more importantly, it's a careful, it's important to be careful to avoid hacking.
Play video starting at :6:54 and follow transcript6:54
So a very typical email a statistician might get after reporting a P-value 
greater than 0.05 is this one that my friend Ingo got. 
So it said, curse you, Ingo! 
Yet another disappearing act! 
Because the P-value is greater than 0.05 after doing some correction. 
And so, the, while this is a joke and it was totally said in jest, in general, 
there can be pressure to try to discover more things at as, 
more statistically significant level. 
It's very important to avoid that temptation, 
because you'll run into something called P-value hacking. 
So in general, statistics hacking means doing things to the data. 
Or, changing the way that you do the calculations 
in order to manufacture a statistically significant result. 
Even when your original analysis didn't do it. 
So this is an example of a paper where people took a data set, 
a very simple, simulate, simulated data set. 
And made very sensible transformations to that data set 
with the statistical methods they used. 
And turn to almost any result into a statistically significant result. 
A way to avoid this, is to in advance of looking at the data, 
specify a data analysis plan and stick to it.
STUDY DESIGN, BATCH EFFECTS AND CONFOUNDING 02:00
Hours
Another important component of study design and 
experimental design are confounding and batch effects.
Play video starting at ::8 and follow transcript0:08
So, what is confounding? 
I'm going to give you an example using a very simple data set. 
So, this is a picture of me and my son. 
So, my son is three years old, he has small shoes and 
he's not very literate yet. 
I have bigger shoes, and I guess you could say I am somewhat literate. 
So using this data set, 
we might conclude that shoe size is associated with literacy. 
But, is that really true? 
Do we really believe that small shoes equals low literacy and 
big shoes equals big literacy? 
The reason why we might not believe this is because there's actually one piece of 
data that we hadn't concluded in our analysis. 
My son is relatively young, and I'm middle aged. 
And it turns out that age is more closely causally related to literacy. 
So if you make a plot the, the relationship between shoe size, 
literacy, and age, you see that age is related to shoe size. 
When you're young you have small shoes, and 
when you're old you have la, bigger shoes. 
And it's also related to literacy. 
When you're young, you're not very literate, and when you're older, 
you become more literate. 
And so, this, this variable that's related to both shoe size and 
literacy is what's called a confounder. 
So the confounder is a variable that's related to two other variables, and 
may potentially make it look like there's a relationship between those variables, 
even when there isn't. 
So, this is actually a very common problem is genomics and the most common 
confounder, the one that trips up the most people, is what's called batch effects. 
And so, here's and example of a batch effect. 
So this is a paper whe, that was originally published that looked for 
differences in gene expressions between ethnic groups. 
So he identified that 78% of genes were 
differentially expressed between the two ethnic groups. 
So you can see that in the p value histogram on the lower left. 
There are tons of tiny p values. 
So that's lots genes that look like they're differentially expressed between 
the two groups. 
This seems like really important and big result because 
there's very little actually genetic variation between different ethnic groups. 
So, there, it's surprising that almost all of the genes are differentially expressed. 
So turns out if you go back and look at when the data were collected all of 
the samples that come from Europeans were collected in 2003, 2004 and 2005. 
Whereas all of the samples that come from Asians, were collected later, 
were collected in 2006. 
So it turns out there's just enough overlap that you can kind of distinguish 
between the results that are, are, the genes that are different because of 
the date and the genes that are different because of the population. 
So if you look at the between population differences, 
it looks like 78% of genes are differentially expressed. 
If you look at the differences between the years when the samples were taken, 
96% of the genes are differentially expressed. 
And once you adjust for the fact that the samples were taken in different years, 
all the difference between the population goes away. 
So this is what's called a batch affect. 
it basically suggests that there's a confounder, 
which is the date that the samples were taken. 
Why would the date matter? 
Well the technology might change, the assays might change, the aliquot that they 
take might change, or maybe the freezer broke in between the two samples. 
There are a number of reasons that the date might be associated with differential 
expression, and in fact in almost every study, this is a major effect. 
So it's not true just in the gene expression studies, 
it's also true in genetic study. 
So this is another big picture of genetic studies that looked for relationships 
between SNPS, single nucleotide polymorphisms and human longevity. 
And so they looked and, and saw that, they claim there was a small set of genes that 
would predict whether you would live to be 100 or not. 
But it turns out they measured all the younger people with one technology and all 
the older people with another technology and this study was subsequently retracted. 
Similarly, there's a example of protiomics where a predictor was 
developed based on protiomic patterns to predict ovarian cancer or 
or in this case it also fell apart largely because of study design. 
The ovarian cancer patients were 
sampled at a different time than the healthy patients were sampled. 
And so it was impossible to distinguish whether it was due to the confounderate 
batch or whether it was due, to the actual difference in biology that we care about.
Play video starting at :4:1 and follow transcript4:01
This ends up being a huge problem, and it affects many technologies. 
And so, this is a paper where there's a discussion 
of how batch effects impact almost every genomic measurement. 
How do we deal with these potential confounders? 
One way is randomization. 
So imagine for example that we are trying to do a comparison here. 
And, so without randomization, th, this is what you might see. 
So, what I have is experimental units where the samples are shown as circles. 
And, so the treatments they might be given are the red and 
green circles around those samples, on th, right hand column. 
So suppose that there's another confounding variable. 
So it might be the age or the date or whatever variable you might consider. 
And so here in this case, the date or the age is related to the treatment. 
In other words the darker circles are more often get the green treatment, and 
the lighter circles more often take the red treatment.
Play video starting at :4:52 and follow transcript4:52
So one way that you could address this is by simply randomly assigning treatments. 
So every new patient comes in and 
you assign them to either re, red or green and you do it with the toss of a coin. 
This will break down the relationship between the treatment and 
the confounding variable. 
And since it's random, it will actually break down the relationship, 
regardless of the what the other confounding variable is. 
So randomization is one way to address the potential problem of confounding. 
Another example is through stratification. 
So, in addition to randomization, you can actually design your experiment around 
confounders that you may have already heard of and know about. 
So here's an example. 
It's a study in mice. 
There are 20 males and 20 females. 
Half are going to be treated, and the other half will be left untreated. 
And you can only perform this experiment on four samples per day, 
four mice per day. 
So the question might be, how do you assign individuals to treatment groups and 
to days?
Play video starting at :5:40 and follow transcript5:40
So a bad design would be to go ahead and basically do, 
you have the treated and the controls. 
Run all the controls as only females in the first week, and 
all the treateds to be only males in the second week. 
Here, you have all sorts of confounding. 
You don't know whether the treatment and the, so the treatment and 
the control are related to the data, the samples that are collected. 
They're also related to the sex of the mice, and so you, there's a very difficult 
ability to separate out the different sources of signal.
Play video starting at :6:8 and follow transcript6:08
A stratified sample, on the other hand, might do something like this. 
So you would run both treated and controlls in both week one and week two. 
You would make some of the treated be males and some of the treated be females. 
And some of the treated be control sorry, some of the controls be males and 
females and run in week one and week two. 
So when we balance out the variables in this way, 
since we knew the potential confounders, the date and the sex of the mice, 
we were able to sort of design the experiment around these confounders. 
And so we can estimate their effects independently of each other. 
So these are some other good study characteristics, 
and in fact there's a long class that can be given on experimental design. 
We'll talk a little bit more about it in the statistics class. 
But just to give you an idea of a couple of other things that are important for 
doing good experimental design. 
In general, it's better to have a balance design. 
In other words, if you're going to do treated and 
controlled, you should have about equal numbers of treated and control samples. 
A same, a study should be replicated. 
In other words, if you only take one sample from one person, 
you have no idea about the variability, both in the population and 
the inter-person biological variability. 
So it's a good idea to both take technical replicates, that is two, 
run two experiments using the exact same sample to try to measure how 
well your technology works, and biological replicates. 
Replicates where you take it from different individuals so 
you can take it from different individuals so 
you can measure the inter-person biological variability. 
Good designs also have controls, both negative and 
positive controls, to make sure that both your technology is working and that any 
effects that you've detected aren't just due to an artifact of the computation or 
artifact of the experimental design.

You might also like