Bio in For Matics
Bio in For Matics
o Understand,
o Analyze, and
o Interpret data
o Common tools
▪ command line
▪ software implementation
● Python, R intro
● Genome analysis
Learning resources for students Presentations, Links to tools, Dataset for Analysis.
Algorithm’s overview
Statistics importance
2. Documentation of study
Logbook entry required Completion of posting signed by faculty In-charge and HOD.
Other comments For doctors in these fields seeking to gain familiarity in data
science and statistical tools to interact better with the data
in their genetic datasets.
Bioinformatics
Course Introduction
Introduces you to the basic biology of modern genomics and the experimental tools used to
measure this biology.
It will cover technology used to "read" DNA or RNA, which in turn creates the data that
provides the raw material for genome analysis.
The course will also give a brief introduction to key concepts in computing and statistics that
you will need to understand how next generation sequence data is analyzed.
Content Overview and Time Estimate
TOPIC Hours
WHY GENOMICS?
● We are all > single cell > few apparently identical cells> embryo > whole person.
● Entire program of development is encoded in our genome, we don't yet understand it well.
● Code in our cells > determines all the different cell types
● Example neuron or skin cell.
- Genome inside of a neuron = Genome inside of your skin cells.
● What is going on in that cell even though it has the same program?
- Same code > executing a different program > neuron vs skin cell.
● Genomics in cancer?
● Cancer =~ genetic disease.
● Normal Cells genetic code =~ Cancer cell genetic code
● Cancer = dividing without any check on their division.
● We define cancers by the type of cell that started the cancer.
- Skin cancer - melanoma.
- Blood cancers - leukaemia.
However, this not the whole picture (it is not one way)
o Started in 1989 with the goal of sequencing one human genome in 15 years
o Published the human genome in 2001 - 12 years (3 years early)
o Today you can get a sequencer in a single lab, a single investigator, few days = ~ 100 time
more efficient
Future Feasibility
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Where is all this data?
So that is the structure of the genome. But the function of a genome is,
involves many, many other things. That's what we're, really when we sequence
a genome, what we're really getting at, is trying to understand the function. So the function is all
the things the genome does. So in that long, long string of letters, we have encoded
everything that your body can do. So we have the the instructions for
how you make all the organs in your body. For how your body develops
from a single embryo, into all these complicated tissues. And we also have instructions for how your
The body does things like respiration and metabolism. And even such complicated
things such as building a brain. Another aspect of genomics is evolution. Evolution is a very broad
topic. When we're talking about
change over time, and we usually mean over evolutionary time, and those are very,
very long time periods. We know from having sequenced the human
genomes, and now having sequenced many humans individual genomes that all of
generation barely change at all and most of those changes are small, random
changes which don't have any effect. But we can compare our genomes
genomes of much more distant things like fruit flies, nematodes, or even bacteria. And when we do
that, we discover that,
remote from us as bacteria. And when you think about it, that sort
very different from bacteria, we do very, very different things, every living thing
on the planet uses DNA as its basic code. Every living thing on the planet has to do
certain things in common because of that. For example, every living thing has to copy its
mechanism to copy DNA as, as humans, so we find the genes in bacteria are, in
when we talk about mapping genomes, we're actually beyond, mapping sometimes
try to figure out where the genes are, and we'll talk a lot more about
that later in the course. But, genes the word genes has, has changed
a lot over the past 50 to 100 years. Today, we think about, one way to
think about it is, is say an inhe, inheritable unit that is something that
you can inherit from your parents. Another way to think about is, is a small
section of the genome that encodes a protein which in turns does,
that can turn into functional elements. [SOUND] So here's another definition
of genomics that emphasizes not just the study of genomes themselves but
at human genomes, agriculture and, and other areas of science. So genomics is a relatively
genetics, which have a big overlap. So one way you can distinguish genomics
when we study genomes, or genes, we studied one at a time. Not that long ago,
back in maybe the 1980s, it was a common, a common experiment was to,
to study one gene and to spend months or years studying that one gene, or
people would spend their whole careers. You've been studying one gene and
writing about that gene, and trying to figure out what that,
what function, what the function of that gene was in said people or
because we can capture the whole genome, we often look at all the genes at once or
genomics way of studying. The technology has been the real driver of
this, so that's a very big differences and that's really what's allowed genomics to
grow the way it has the past 20 years. In the past, we could only do targeted,
between genomics and earlier ways of doing biology. And then the hard part, where things
differ between traditional and genomics science. In the past, you still had to be clever,
you had to understand biology and genetics well. Experiments were not
necessarily easy to do. Might, they might take years. Today, some of the experiments, you can
still do the same kind of experiments, some of the experiments are easier to do,
very clever, very efficient, and very expensive, we're now looking
tens of thousands of genes all at once, in multiple tissues, in multiple samples. And we know going
into such experiments
that are going to come out. But when we get the data,
we discover that, oh my god, there's so much data here that it's hard to figure
out where the interesting results are. So we have, we have to deal with big data problems
that we didn't have to deal with before. We have to deal with, statistical uncertainty in ways
that we didn't have to deal before. And we have to deal with large-scale
what genomic data science is. So, I'm going to try to quickly introduce
the major disciplines that comprise genomic data science and tell you what,
intersection of biology, statistics and computer science. And we're calling it genomic data science.
But it actually goes by many other names,
some of which you might have heard, such as computational genomics,
computational biology, bioinformatics, or statistical genomics. Regardless of which name you use,
the
activities of genomic data scientists do all sort of fall into these same and
let's talk about sort of humans, we start with some subjects. By the way, it doesn't have to be
humans,
it could be mice or model organisms. But let's say we're talking about humans. We'd want to go and
collect some samples from those humans. So, those could just be,
skin cells from normal people. And then prepare those samples in the lab,
because you're reading the DNA, so I'll use the term reads, and
come off a sequencer which is part of someone's genome. And we take those reads and we align
them to the reference human genome, and there is one reference genome, for now. Which
represents, for now,
sort of average northern European male. Oh, in the near future we're probably
going to have many other reference genomes as well. But in any case, we align these
reads to the reference genome and that tells us how the differ
from the reference genome. And also lets us kind of compile them
all up on one another and see how, see what sort of differences
there are even within the genome. Now, remember for every genome of a person you actually
have two copies of every chromosome. And that's because you got one
copy from mom and one from dad. So one of the things we study for example
at this phase, is how do your copies of those, of each gene differ between
the two copies that you have. And then once we've done some kind of
conclusions about what's going on. So that's sort of very, very broadly
how much data do I need? How many subjects do I need? What kind of data we get? I've been
talking about sequence data so far, but actually there's
names like ChIP-seq and methyl-seq. They'll let you study other
things about the genome as well. So the first thing you do is you, you
come up with an experimental design which if everything goes will you hope will
allow you to answer those questions. And that's really critical. If you don't, these experiments, even
though we're generating a lot of
data far more cheaply than before, it's still is expensive and
the end of your experiment and discover that you simply don't have
these, then your typical, then you'll take those big data sets and
as I was saying a few minutes ago. The first thing you would typically do is
align them to the reference genome and find out how they're,
how they differ from that genome and assemble them together in some form. So that might, that
might mean
the genome you can see, for every gene in the genome,
how much of that gene was present. So, that's the kind of
thing that you would, you'd be doing at the pa, part where
you're aligning and assembling things. You might be assembling the genes that
were expressed in a set of cells or tissue you were studying. So, another big s, important step,
in this process is preprocessing and normalization of the data. Now these, because these data
of biases that can, that can come out. And this is sort of the, the law of large
numbers comes into play a lot here. Even, all of the technology that
we're talking about, even though I'm talking about it like it's very nice,
themselves sometimes have have errors that are not random and
are hard to identify. And sometimes they have other errors that
are random and we can identify those and get rid of those. So we basically want to take
any kinds of systematic errors or any sort of bias that are in that data
before we move onto the next steps. So once we've got the data normalized,
we'll talk about later in this course. To, to go from this preprocessed
that normalized data and make this conclusions one of the things
when we have an experiment and I'll just take one as an example. One that I'm, I'm,
very familiar with is RNAC. because RNAC, which we'll talk about
are being turned on in a set of cells. And you measure those by sequencing. And this is a pro, this is
an experimental, an experimental paradome. It's very popular. You can use it to study cancer. You
can use, use it to study development. You can use it to study evolution and
software that will in a more or less standardized way go from this raw sequence data to something
which is more
what genes were turned on and what levels were those genes present in? So there are,
individual sorts of questions like what causes a cancer to be a cancer? But we can study other things
too, such
as we can look at a population and ask, what makes a particular group of people
more susceptible to a particular type of disease? Or why do some people have, resistance
at, rather than looking at one person at a time or small groups, look at whole
to see how they differ. And how those differences lead to, to recognizable differences in,
in the people in those populations. So, so another way to,
types and pull them all together. So you might call that
statistical activities where we're taking data that comes from sequencing
experiments, say of different types. As well as other kinds of measurements. They might be
measurements
on what we call the proteon, measurements separate from the DNA. And integrating this all
together
integrative genomics.
This lecture is about cell biology and it's just enough cell biology
a detailed background in cell biology, which is a very dense and complex topic,
and you can spend years studying it. In fact, people spend their
whole career studying it. But cell biology and molecular biology
include many complicated sounding terms, and it's important that you
together and called prokaryotes. And the distinction we make between these
two very fundamental groups of living things is that eukaryotes have cell
of the evolution of life on Earth. Bacteria and archaea split off a little
bit later, after eukaryotes had split off from one another, and
bacteria and archaea separated. But bacteria and archaea, in many ways, at the cellular
description level, seem very similar. The big difference between eukaryotes and prokaryotes is that
prokaryotes do not
have a cell nucleus and eukaryotes do. And the cell nucleus evolved a long, long time ago as a way to
sequester
our DNA from the rest of the cell. So at the most, at the cellular level,
eukaryotic cells and prokaryotic cells look rather different. The biggest difference is that
eukaryote- you all probably eat it every day if you have bread- even yeast have
nuclei, just like, just like our cells do. So, humans, yeast and
everything in between are eukaryotes. Our cells have nuclei in them, and other organelles are small,
sort of floating around. It's not really floating but you can
a nucleus which has its own membrane or wall surrounding it and inside that
are inside each cell, and within those cells are inside
about your DNA, which is that one of the organelle's inside of the
percent of our DNA But, you might have heard mitochondria called the
powerhouses of the cell because the genes inside your mitochondria are responsible
for a good bit of energy metabolism. So even though it's very small,
doesn't have that many genes, it's absolutely critical to life. And, as an interesting sort of side note,
we believe that the mitochondrion, a very long time ago, billions of years ago,
was originally an independent prokaryote that was absorbed by the ancestor of all
eukaryotic cells into the cell itself and then it's become part of every
eukaryote now on the, on the planet. So all eukaryotes have mitochondria and,
species starting, started to diverge, and most of them are very small. Back to the cell itself. So cells
undergo a cycle, characteristic
cell cycle that many people study. It's not important that you know all
the names of the parts of the cycle like metaphase and anaphase. But cells, in order to,
cycle where they undergo different, different processes, one of which, the
most critical is the process of division. So, the way you go from a single celled,
wants to divide many times and throughout your life, your cells are
constantly dying and have to be replaced. In order to replace damaged tissue, your
body has to divide which means cells have to divide to replace the cells that are,
that are damaged So that, that part of the process is called mitosis, where
a cell separates into two daughter cells, producing two essentially identical cells,
it's important that you realize that, that you understand just
the couple of words here. So, when the cell undergoes mitosis,
the DNA in the cell has to replicate. That is, it has to make two copies
where there was one before. So, inside the cell before devising
two copies the DNA replicates. Now you have two copies of every
chromosome where there was just, well you actually start off with
two copies every chromosome. After replication you have four copies,
a complicated procedure during mitosis, how to separate those, those copies and,
and that's what happens. The copies that are made are actually
in the cell as it/s dividing, and as a result you get two daughter cells or
as we call them, that are identical to the original cell, and both
the daughter cells are also diploid, so there's another word I haven't used yet,
two copies of every chromosome. So our DNA exists in two copies. One copy from mom and one
copy from dad. And of course, in diploid organisms it's
important that every time a cell divides it maintains these same two
another important aspect of cell biology is that cells don't always divide and
existing cells that have been damaged, say you've got a cut in your skin and
new skin cells have to be grown, you want the skin cells to be just like
just like the ones that were there before. But during the course of development, cells need to
develop into
our bodies start as, more basic cells that we call stem cells that
involves a slightly different cell division process where the two daughter
cells are not identical to each other. And they go down what we
into mature cells of different types. So all the cells, for example,
as this slide is showing, all the cells in your blood start as what we call
a multi-potent hematopoietic stem cell, which can divide and produce many
body are maintained over time as well. So there's a special type of cell
different happens in cell division, and that's, that's the main point
we want to get across here. When you're producing egg cells or sperm
cells for the next generation of your species, you start off with copies
of chromosomes from the mother and the father, and those are all put together
before meiosis in a single compartment, and then there's a special process that,
because they are very, very similar they can kind of stick to each and we have
this process called crossing over that occurs where part of, let's say chromosome
one from mom, can cross over with a part of chromosome one from dad and you get
resulting in the, in the daughter cell, you can have a copy of chromosome one
which is partly mom and partly dad. Now these recombination events
are relatively rare, but not that rare. In, in humans, in fact,
crossover read per chromosome. So, most of your chromosomes, although you
got every bit of your chromosomal DNA from your parents, your actual chromosomes end
that family- besides mutation- is one of the other reasons why family
members are not alike. So, I'll just conclude by mentioning, by,
every child gets different, has different recombination’s. So even though every child
of the same set of parents has essentially the same DNA as the two
them back to both parents. And that's one of the big sources
So, the other, another critical molecule for how our bodies work and how our genomes work is RNA.
So RNA is almost exactly like DNA except for a couple of important differences. The, the most
obvious difference is that we, we have, we don't have the T anymore. We don't have thymine
instead we have uracil. So when DNA gets copied or transcribed into RNA, the As get replaced by As,
Gs get replaced by Gs, Cs get replaced by Cs, but Ts get replaced by Us or uracil. And then you build
an R molecule which, unlike DNA, is single stranded. So RNA are not double stranded although they
can form double stranded complexes. But in general RNA is single stranded and it is from this RNA
template that we create proteins. Which are the other critical molecule in a cell. So RNA has, again,
the same, similar biochemical structures. The, the uracil or the U molecule is very similar to the T
molecule.
So we write RNA the same way we write DNA, five prime to three prime, only we're going to replace
all the Ts with Us. So if you see a string of, of letters and there are Us in it instead of Ts, you know
immediately that that's RNA, not DNA. So and important distinction genetically is that DNA is the
stuff of inheritance. DNA is what cells carry with them from one cell generation to another whenever
a cell divides, it creates DNA that replicates the DNA in the original cell. The RNA uses a template to
make proteins, but the RNA is not actually the stuff of inheritance. However, it is in most cases an
identical copy of the DNA, that it was formed from. So we use these, these molecules to encode how
the cell works, DNA is basically a program, that we read out. And the read out program starts with
RNA and then it goes to make, and the RNA is used to make proteins. So proteins that are also long
molecules, not nearly as long as DNA. They're typically hundreds or sometimes thousands of amino
acids long. So amino acids are more complicated molecules that there's a picture of one here that
are strung together as well to make proteins.
And the translation rule we the, the translation rules were worked out in the 1960s. In some
groundbreaking molecular biology work related to, at the, very founding of the field of molecular
biology.
Scientists figured out essentially one codon at a time how you translate RNA into proteins. So I just
use, use the word codons. So the way that RNA gets turned into proteins is that the every
combination of three of the letters of RNA encodes it an amino acid.
And here's just showing a translation of a particular set of nine nucleotides producing three amino
acids. We write proteins in, in the 20-letter alphabet we use to abbreviate the amino, amino acids.
There are 20 amino acids that comprise essentially all of our proteins. So actually to tell the whole
story, there are more than 20 amino acids, there's actually 22. The 21st amino acid was discovered
not that long ago, and the 22nd amino acid also not that long ago. And these amino acids are
primarily used in other forms of life besides humans. So there are a few exceptions to almost every
rule in biology. But in general, the way to think about human biology is that we have 64 possible
codons. 61 of them encode amino acids and they encode exactly 20 amino acids. That 21st amino
acid, when it was discovered, turned out to be, one of the stop codons that's once in a while used to
encode amino acid.
0:01
What was the Human Genome Project?
In this lecture, we're going to talk about the history of this groundbreaking project
that set us, that set us where we, and brought us where we are today.
Play video starting at ::10 and follow transcript0:10
So the Human Genome Project was first proposed in the late 1980s
by scientists at the US Department of Energy.
Many people don't realize it was not the National Institute of Health,
the NIH, that proposed it, but rather the, the DOE.
And the reason the DOE was interested in genomics was they were studying
effects of radiation on DNA.
But never mind that.
The project was proposed, initially, it was considered to be extremely ambitious,
and many scientists were actually against it.
It was, the idea was that it would be biology's Manhattan project, by far,
the largest project that biology had ever taken on.
But as scientists started to discuss it in the late 80s,
it quickly gained momentum and soon it was approved and and
the NIH joined in and then many other countries joined as well.
So the project officially started in 1989 as a joint effort of the NIH and
the DOE in the United States, plus many other countries.
Outside of the US, the Sanger Centre in England was the largest sequencing center.
Play video starting at :1:5 and follow transcript1:05
So the goal of the project was very simple.
The human genome is 3 billion base pairs long.
In the 1980s, sequencing was still very new technology.
The automated sequencing that was available at the time was very slow and
expensive.
It was around $10 a base to sequence in the 1980s.
So that was really expensive, and that would cost $30 billion.
But the scientists who were proposing it said, well,
we know things are getting faster and more efficient, so
we're going to assume that pro, that prices will drop by a factor of ten.
And we'll probably be able to sequence the genome for $1 a base.
So they came up with an estimate of $3 billion,
which is the number that has been widely reported as what the project cost, and is
probably a bit of an overestimate because cost went down quite a bit more than that.
But anyway, that was the goal, sequence all 3 billion base pairs
in the human genome for $1 a base, and finish it in 15 years, by 2005.
Now, one reason people were opposed to this was that we knew already at the time,
and we certainly know very well now,
that only about 1.5% of your DNA encodes proteins.
So people said, well, most of the DNA, they thought was just junk,
was stuff that wasn't really biologically important or useful.
We now know that that's not really true, but at the time it was widely believed
that most of the sequencing of, would be dedicated to learning, to learning
sequences that didn't really have any biological function or consequence.
So some scientists who were opposed to it were opposed to it because they thought it
would be a waste of time and a waste of money that would be better spent
trying to target the genes in the genome.
Nonetheless, the project took off and quickly gained momentum.
Play video starting at :2:28 and follow transcript2:28
So, you might have heard about the Human Genome Project as a race.
Well, in the early 1990s it wasn't really a race.
I'm going to get to that in just a few minutes.
So what, the way the project started was that scientists around the world worked
on what, what were called maps.
So the idea of the project and the plan for the overall project from the beginning
was that we would take large chunks of DNA, and these were about 150,000 base
pairs long, that were called bacterial artificial chromosomes or BACs.
So we take these chunks and
we could grow those chunks up in E.coli bacteria, make as many copies you wanted,
and we could sequence those chunks, and then stitch those pieces together.
So this seems like, and that was because at the time the best we could do in terms
of sequencing technology when we were sequencing DNA was to sequence
little tiny fragments from a slightly larger chunk.
And these bacterial artificial chromosomes,
or BACs, were about the largest chunks people thought they could handle.
So the, the real problem was could you assemble those little tiny fragments or
wreathes together, and we knew we could do it for 150,000 base pair chunks.
So the, the, the problem though is we have a bacterial artificial chromosome, if you,
it was easy to create these BACs, but
you had to figure out where they went on the genome before you sequenced them.
So the, the idea was to develop develop libraries they were called,
with hundreds of thousands or millions of BACs in them.
Then select those BACs, figure out where they went in the genome, and create
a tiling path, basically aligning the, the pieces of, of BAC DNA across the genome.
And then finally, when those maps were done, we would sequence the BACs.
And the idea was as mapping would go on, the funders would fund that effort, and
then sequencing, meanwhile, will get more efficient,
and when we finally got around the sequencing, it will all be $1 a base.
So that was the idea, and that was, that was moving along steadily throughout
the early 1990s, as well as technology development.
But then, something happened that kind of changed the game rather dramatically.
In 1995, a small non-profit research institute called TIGR, The Institute for
Genomic Research, sequenced the first complete bacterial genome ever to be done,
the genome of Haemophilus influenzae, which is an infectious bacteria.
This genome is about 1.8 million bases and had 7, has 1,742 genes.
And this project was led by Craig Venter, who was the founder of TIGR, and
Hamilton Smith, a professor at Hopkins, who also was a Nobel laureate,
is a Nobel laureate.
So what was different?
Why would this change things?
This is a tiny genome, bacteria are far smaller than human's,
about 1,000 times smaller.
What was different was that this was done through whole genome sequencing,
whole genome shotgun sequencing.
Where you didn't create these maps, but instead,
you took the whole genome, you fragmented it into tiny, into many, many tiny pieces,
tens of thousands of tiny pieces.
Then you just randomly sequenced those pieces, and by oversampling,
that is by sequencing every part of the genome many times over, you could
then use a computer program called an assembler to put it back together.
And people had never done this for
something on the order of a whole genome, even a whole bacterial genome before.
So this was dramatic and, and
certainly changed the field of microbial genomics at the time.
And everybody in the microbial world was very excited and
started proposing to sequence microbial genomes this way.
Meanwhile though, the human genome continued as planned, sequencing or
mapping these 150,000 base pair chunks.
So then things changed again a few years later in 1998,
and this is where the race really began.
So a new sequencing machine was developed by a company called Applied Biosystems.
And this machine was not dramatically more efficient than the other,
than the previous machines, but it was significantly faster, and it was easier.
It used capillaries to do sequencing.
That is tiny, tiny little plastic straws and the DNA would flow through those.
And it let you add, and
it let you automate the sequencing in a way that wasn't really possible before.
So with funding from Applied Biosciences Applied Biosystems Craig Venter,
Ham Smith and others left TIGR to form a for-profit company called Celera Genomics.
And this company's goal, its entire purpose for
being created was to sequence the human genome.
Not only were they planning to sequence the human genome, but what they proposed
at the time was that they would do it through whole genome shotgun sequencing.
That is, they would take the entire human genome, 1,000 times larger than
a bacterial genome, they would break that up in to lots of little pieces,
millions and millions of little pieces, sequence those, and
somehow assemble them back together to create the whole genome.
Now this method didn't have to do the mapping.
The mapping was still going on, the BAC mapping was still going on in the publicly
funded Human Genome Project, but Celera wasn't going to do that.
They were going to skip all that, and go straight to sequencing.
Play video starting at :6:46 and follow transcript6:46
So this certainly caused a kerfuffle, and the race began.
So NIH, which had, up until then had been funding eight large centers in the US,
merged its efforts into three even larger centers which still exist today.
The Sanger Institute, or the Sanger Centre at the time, ramped up their effort, and
other, and other the other groups in the public also did.
And, and everyone in the public effort started accelerating their,
their sequencing efforts.
Now, they were still doing BAC-by-BAC sequencing but they started sequencing
the BACs much faster than they, than they, than they had been before.
Play video starting at :7:16 and follow transcript7:16
So this was really a race.
So despite that,
some people were skeptical about Celera's ability to assemble an entire
animal genome using this whole genome shotgun sequencing technique.
No one had really done it for anything larger than a,
than a large bacterial genome, which is millions of base pairs, not billions.
However, soon after the formation of the company, Celera sequenced and
published the complete genome of the fruit fly, Drosophila melanogaster.
Now, drosophila is about 130 million base pairs long,
so still much smaller than human, about 20 times smaller.
But much larger, about 20 times larger than any bacterial genome or
than any genome that had been sequenced and
assembled through the whole genome shotgun technique up to that time.
So that was a success, it was published in 2000, and it proved that this whole genome
shotgun technique would, could scale up by a factor of 20.
And there was really no technical reason why it wouldn't scale up by another factor
of 20, and in fact, that's in, that's what eventually happened.
So this really proved that Celera meant business, and
it really spurred the public effort to, to accelerate their, their work even further.
So the race really heated up in 1999 and 2000.
In 1999, Craig Venter announced that Celera would finish their work by 2001.
Actually, originally he really announced 2003 because the public effort said they
were going to finish in 2005.
The public effort quickly responded by saying they would also finish in 2003.
Then in 1999 Venter announced that Celera would finish, in fact, in 2001.
And soon thereafter, within a matter of weeks, NIH and the Sanger Centre announced
to the public, Human Genome Project would finish a draft genome by 2001 as well.
So everybody was racing.
Now, by the way,
these are what the scientific leaders of the projects were doing.
The actual people doing the work,
I was one of those people, were mostly just panicking.
because we didn't really have any plan to finish that quickly but
we figured we would have to give it a shot.
So in 2000 as, as the work really did seem to be getting close to completion for
a draft genome, NIH, the Sanger Centre, and
Celera Genomics talked about publishing jointly.
So there was a considerable effort to make this into one
final project that everybody would say we all did together.
However, in late 2000, those talks fell apart and two papers were planned.
Play video starting at :9:28 and follow transcript9:28
So that's what happened.
In June of 2000 Bill Clinton and Tony Blair, the leaders of the US and
the UK at the time, jointly announced the completion of the human genome.
And you can see in the slide that Craig Venter is shaking hands with
Francis Collins,
who was the head of the Human Genome Research Institute at the time.
So it was announced in 2000 and, 2, in the year 2000, that both groups were done and
that it was a tie.
And that's kind of how, how it played out.
Now at this point, the paper wasn't done and in fact,
at the time of this announcement, the genomes weren't done either.
But we knew we had about six months to get them done,
those of us who were actually in the trenches doing the work.
And so everybody put all their effort into, into,
as quickly as possible finishing up this draft genome.
So now, whose genome did we sequence, by the way?
So, when you talk about sequencing, the genome, each of us on the planet, and
there are billions of us, each of us has a different genome.
Now, our genomes are all very, very similar, probably only differing by about
one position in a thousand, but they're all different.
The Human Genome Project sequenced one genome which was a mosaic of about a dozen
volunteers who contributed DNA, all anonymously to the Human Genome Project.
All of them were Northern European in origin, so
they all had a similar genetic background.
And the assembly of this original genome represents that one sort of mosaic of
a small collection of, of individuals of Northern European descent.
Now since then,
we've gone on to sequence the genomes of other people from other populations.
But at the time, that was what we did.
It wasn't one person's genome, but a few people's genome.
Play video starting at :10:53 and follow transcript10:53
So what did the genome tell us?
Why do we do this?
So, one of the major goals of the human genome is the,
of the Human Genome Project was to identify all the genes.
Identify all their sequences, eventually figure out what they all do, and
use that to develop better treatments and improve human health.
So just, this is just one of the first papers, what I'm showing you here is one
of the first papers ever to attempt to estimate the number of human genes.
This is a paper that appeared back 1964 in the journal Nature, so
40 years before the Human Genome Project completed.
There was, this was actually soon after the genetic code was kind of worked out,
very soon after the genetic code was worked out.
And what this scientist named Vogel did was he, he looked at, the act, the,
the first two genes whose sequence had been determined were human hemoglobin
sub units.
And these are very kind of small genes,
they're about 146 amino acids long, and he knew how much they weighed.
We, we had pretty good measurements of how much those amino acids weighed.
We also had pretty good measurements of how much DNA weighed.
So you could say, well, 146 amino acids, so you'd know how much the DNA encoding
that weight is 3, 3 nucleic, nucleotides for each amino acid.
So you, and you could also measure roughly the weight of the genome in a cell.
So he basically took the weight of one gene, he divided it into the weight
of the genome, and he assumed that basically the genome was,
was just gene after gene, end to end, encoded there.
Now remember, this was 1964, no one knew, knew otherwise.
We now know that only about 1 to 2% of the genome encodes genes, but
that wasn't known at the time.
So, using this estimate and not knowing anything about exons and introns and
about all the intergenic DNA and so-called junk DNA,
you would come up with a number of around 6.7 million genes, which is wildly off.
So the whole Human Genome Project presumably would give us a much better
fix on this number.
We're, there's many many other things, of course, that we can learn from the genome,
but this is one of the kind of simple messages we should be able to get out of
the genome, is how many genes do we have and what are they.
Play video starting at :12:48 and follow transcript12:48
So the two papers appeared in February 2001.
They actually appeared simultaneously, with lots of hoopla.
The public effort published its genome in Nature, and
here's a picture of the cover from the time.
And they estimated 30 to 40,000 genes.
Now, one interesting thing about that and that was the official estimate in
the paper, was that seems like a very imprecise number.
And 15 years ago, or 10 years earlier, no one would imagine that the genome
wouldn't finish and we wouldn't know precisely how many genes there were.
But it turns out that it's much harder to figure out exactly what the genes are from
the DNA sequence than, than we had realized.
So they gave a very rough estimate of 30 to 40,000, and that number
was a lot smaller than estimates that had been discussed even as recently as a year
earlier when people were still proposing 100,000 genes in the human genome.
So that was a surprise that there were so few.
The other paper was published in Science,
this is the paper led by the group at Celera Genomics,
which included scientists from around the US and, and Europe as well.
And that number was much more precise-seeming, 26,588 genes.
But there was an additional approximately 12,000 likely genes
that were also described in that paper.
So those two numbers were pretty, so
if you add those together it's in the high 30,000 range.
So that, that was consistent with what was in the Nature paper.
And let me just mention that I was on this paper,
buried in this long author list as well.
Play video starting at :14:3 and follow transcript14:03
So we, we had some new estimate of the number of genes
seemingly precise in one paper, imprecise in the other, but
if you read the Science paper, you see it's also a very imprecise number there.
So we didn't actually know how many genes there were
even though the genome was sort of finished.
And by the way, this was a draft genome, and
the draft only covered about 92% of the genome.
So today the genome that we have is still, technically speaking, a draft.
It's still not finished, but it covers well over 99.9% of the genome.
Play video starting at :14:31 and follow transcript14:31
So that, that, that gene count number is an interesting story in itself.
So if you look at how it's evolved over the past 40 years since, or now 50 years,
since that, that early 1964 paper, which estimated millions of genes,
starting around 1990, people were,
were bandying about estimates generally in the 100,000 range.
But this chart shows a number of publications that kept moving that number
around starting at 100,000 going down to,
there were estimates published in the 60,000 range, the 50,000 range, and so on.
And it gradually decreased until today we believe the number is around 22 to 23,000.
But we're still not sure of the precise number, even today,
we don't have a precise number of human genes.
And an important caveat here is, this gene count the way it was originally proposed
and the way I'm describing it now, refers to the number of protein-coding genes.
That is a, a piece of human genome,
a piece of DNA that gets transcribed into RNA, gets translated into a protein.
Everyone agrees we call that a gene.
However, for at least the last 20 years,
we've known that there's some number of genes in your genome
where the DNA gets transcribed into the RNA, and the RNA itself has a function.
We would call that an RNA gene, it never gets translated into a protein.
In the late 2000s, starting in the late 2000s,
through a new technique called RNAC, we've learned that there are many thousands,
really probably tens of thousands of RNA genes in the genome.
So this gene count that we've been looking at for the past several decades
only looks at one side of the coin, we're only looking at protein-coding genes.
We know the number's quite a bit higher if you look at RNA genes.
And that number is even less precise today.
Over the, over the coming decade, we probably will get a better handle on it.
But I wouldn't promise we'd have a precise answer to this question
even ten years from now.
Play video starting at :16:11 and follow transcript16:11
So let's just review.
The Human Genome Project started in the late 80s.
It officially began in 89 and
1990 with the goal of sequencing 3 billion base pairs.
Did they achieve that?
Yes, they did.
The goal was to do it for $1 a base.
How did they do on that?
Well, at the time the genome was published, it cost about a $1 for a read.
A read was a single sequence coming off a sequencer, it was about 700 bases long.
So not only did they achieve that goal,
they were 700 times cheaper than they thought it would be.
So that was really a dramatic success.
They were, the in terms of time, they wanted to get it done by 2005.
It was done in 2001, at least the papers were published in 2001, a draft genome.
So you could say they dramatically exceeded all of their goals.
And now, the cost today, the fi, the final note is that the cost in 2001,
$1 per read, $1 per 700 bases,
seemed quite, quite dramatically better than we'd expected.
And it did continue to slowly, to slowly drop for a few years thereafter.
But then, due to dramatic changes in technology, what's called next generation
sequencing, the cost has dropped far, far more than that.
So today, it costs about $1 for 3 million bases,
which is 4,000 fold cheaper than it cost even when the human genome was finished.
And that's what's led to this tremendous explosion in all
sorts of genome sequencing experiments that we're, that we're experiencing today.
0:03
In this lecture, we're going to talk about polymerase chain reaction.
A remarkable and very powerful way to copy DNA and make, in fact,
as many copies of DNA as you want to, with a really simple but powerful technique.
So how do we make copies of DNA?
We'd like to just feed our DNA to some sort of machine like a copier,
like I'm showing here, and make copies of it.
And we use this in various aspects about technology and DNA sequencing,
in genomics, all the time.
There are many, many reasons why we need to make copies of DNA.
For example, when we're doing RNA sequencing,
we need to turn our RNA into DNA and make lots of copies of that.
When we're sequencing someone's genome, we don't just sequence a single cell or
a single molecule.
All the technology we use for sequencing requires us to have many many identical
copies of the DNA molecules before we get started.
So, DNA copying is a really important, sort of, basic tool that we need for
many of the things we want to do.
So how do we do it?
So polymerase chain reaction uses a couple of simple properties of DNA and
turns it into this wonderful method for DNA replication or copying.
So recall first that DNA sticks to itself, so DNA's always double-stranded.
So here we have two strands of DNA.
We have A, C, G, and Ts on the top, and the complementary strands on the bottom.
And remember that the DNA is directional.
And the direction that it goes,
we always talk about it going from the 5' direction to the 3' direction.
So the beginning of the DNA sequence will be the 5' end and
the end will be the 3' end.
So, the fact that DNA will stick to itself is a very important property that we're
going to use in PCR.
Play video starting at :1:34 and follow transcript1:34
So, another thing we need is something called primers.
So what's a primer?
So primer is simply a short sequence, usually they are 15 or 20 bases long.
They can be a little shorter, or a little longer, but it's a sequence of
DNA bases that's complementary to the DNA that we want to copy.
So, here I'm showing an example with two primers, one in green, and one in blue.
So along the top strand, the forward strand, we call that, there's a primer you
see at the beginning which is in green, which is the reverse complement that is
the matching bases of the top strand of DNA, right at the beginning.
And on the other strand we have a different primer going from the other end
that is shown in blue.
That's the reverse complement of the reverse strand going
on the reverse strand.
So it goes 3' to 5'.
So these primers, if I were just to mix these primers with this DNA sequence,
they would stick to it, because they're complementary to it.
So how does PCR work?
So we start with some DNA and some primers.
And let's not worry how we get those primers.
But just if you're curious,
you can actually easily generate any primer you want of any length.
You can just order it from a company these days.
And, in fact, you can order a mixture of all possible combinations of DNA bases,
of say eight bases or even a little longer so you have primers even for unknown DNA.
So what do we do with that?
Well, now we're ready to start the process of PCR.
We're going to heat up the mixture gently.
What happens when you heat up DNA is it melts or
rather [NOISE] the two strands separate from one another, they fall apart.
So you see, I've just moved them apart physically on the screen here.
So if the mixture is hot, then the primers won't stick to the DNA and
the two strands of the DNA won't stick to each other, either.
Play video starting at :3:12 and follow transcript3:12
Then we cool it down, or anneal it and cool it down gently.
And the primers will stick to the DNA,
and they tend to actually stick to the DNA before the two strands find themselves.
Because the primers are small and they can float around a little bit more easily.
So we cool it down.
And we let the primers stick to our DNA.
And if we kept cooling, eventually the DNA would cool back to itself.
But we don't wait that long, so we also add another mix.
We need something now to copy our DNA.
So we need a copier molecule.
Fortunately, nature has provided us with a very good copier molecule
called DNA polymerase.
It's what all of our cells use to copy their own DNA.
Play video starting at :3:45 and follow transcript3:45
So we can synthesize that and make large quantities of it and
add that to the mixture as well.
So, DNA polymerase acts in the following, very straightforward way.
It looks for a place where the DNA is partly single stranded and partly
double stranded and it grabs on to the sequence right there and starts to copy.
So this DNA polymerase here will notice both of these two primers are now attached
to sequences that are single stranded, except where the primers have stuck.
So, the DNA polymerase will go and it will find those sites,
and it will start to fill in the missing sequence starting at the primer.
So that's the property of DNA polymerase that we really need,
and there are many polymerases.
Every living organism on the planet actually has one.
So, we don't actually use the human polymerase for this process,
but it doesn't matter.
You can use, in theory, any DNA polymerase to do the copying.
Although, we do need to specialized one for PCR.
So the result after doing this is that, if I show you here,
after one round of copying, I'll have completely filled in the sequence across
the top from my green primer.
As you see here, and then another polymerase, assuming I added lots of
polymerase molecules, another polymerase will have filled in the sequence
of the other strand across the bottom, starting with the blue primer.
Now, I also needed a mixture of As, Cs, Gs, and Ts, the raw material to make DNA.
So, I didn't say that yet, but in addition to adding my DNA polymerase,
I'm going to add raw As, Cs, Gs, and Ts in large quantities.
So that the DNA polymerase can incorporate them into the new double-stranded
DNA that it's creating.
So after one round of PCR, if we let things cool down, these two strands will
stick together, and we've now created two strands where we only had one before.
So we can just repeat that whole process.
And if we repeat the whole process, we get four strands.
And a very important property of PCR, the reason it's called a chain reaction,
is that with each round, we double the amount of DNA we had before.
So, very quickly, after just a few rounds, you go from just one molecule to many,
many molecules.
Or many, many copies.
So, typically we'd repeat this for 30 cycles or more.
And if you do the math, 2 to the 30th is about two billion,
so you can take one molecule and turn it into billions of molecules
after just a few dozen cycles of polymerase chain reaction.
So let's look at a little cartoon of how this works, just to drive home the point.
So we start with melting or denaturing the DNA at 94 degrees Celsius.
So here you see a double strand of DNA with some polymerase.
You'll notice some little arrows floating around.
And primers are shown as the little short fragments of DNA that are floating in
the solution here.
And as we heat it up,
you'll see the two strands of DNA in the middle start to denature.
They're falling apart.
So, you see they're melting apart there, and
eventually they're completely separated.
Now, an important part of PCR is you have to denature for long enough for
the two strands to fall apart.
But it doesn't take long at all.
We're talking only a few minutes to do that.
Then you cool it down to 54 degrees, and
at that temperature the primers will stick to the DNA, as you see here.
And the polymerase can then find these double-stranded pieces and
start to fill it in.
So here you see the two polymerases in this picture are starting to
fill in the existing strand and create double-stranded DNA.
So they'll just walk along until they get to the end of the molecule, and
then they'll fall off.
So now we've created two complete copies, and
we do that at a slightly higher temperature of 72 degrees.
Play video starting at :6:57 and follow transcript6:57
And then we simply repeat the whole process to make another copy.
So in summary, the PCR recipe is the following ingredients.
You need some DNA that you want to copy,
it could be any DNA at all from any living organism.
You need primers, you need DNA polymerase,
a special copier molecule, and you need lots of A's, C's, G's, and T's.
And then the way you execute the recipe is you melt it at 94 degrees.
You cool it down to 54 degrees, then warm it back up to 72 degrees.
And then simply repeat.
So all you have to do is mix those ingredients in a single mixture and
go through this process of heating and cooling, and
the rest of the reaction takes care of itself.
So that's why it's called polymerase chain reaction.
It uses the DNA polymerase to cause this chain reaction,
this explosion in the number of copies of your DNA.
And it all happens fully automatically just using the properties of DNA itself,
which hybridizes to itself, and of DNA polymerase, which copies DNA.
So this is such a clever and powerful idea, and was so revolutionary and
had such a great impact on the field, that not that long after it was was discovered,
the Nobel Prize in Chemistry was awarded for invention of DNA to Kary Mullis.
And this is just a picture from the Nobel site describing his Nobel lecture.
This lecture is about how we store data, data structures and computer memory.
Data structures are an important part of computer science that lets us store data
efficiently, and especially when we're dealing with very large data sets,
we have to think carefully about how we store data.
So, for example, in the context of genomics, we, we deal with large amounts
of sequence data, sometimes aligned, as you can see on this slide.
And we have to figure out, all right, we have sequences for many people or
many species.
We might have many sequences from all these people.
We have to think about how to get those into memory.
Now the way the sequences come at us from a sequencing machine is
just as a long string of characters of text.
And computers are very good at storing text, that's what they were originally
designed to do in many cases, so computers already store, have a way to store that.
But sometimes we can think about more efficient ways to store that.
So for example, when we're looking at a multiple alignment of lots of sequences,
there are lots of things those sequences have in common we might want to think
about storing storing something that's a little bit smaller than all the separate
sequences and just storing the differences between them.
Another important aspect of data structures and memory, when we're doing,
when we're designing them, is that we want to be able to find stuff.
We're looking at,
when we're looking at DNA sequences, they all pretty much look alike.
They're a long string of As, Cs, Gs, and Ts.
And when we're looking at the human genome,
we have three billion base pairs to search through.
So for example, an interesting problem to think about in
the data structure point of, from the data structure point of view in genomics is,
well I've got a little sequence of, say 100 nucleotides, 100 DNA letters, and
I want to be able to store in such a way that I can go and quickly find it again.
And I'm storing it in the context of a three billion base pair sequence.
So, rather than just throw it in some random, like throwing it in a random pile,
I'd like to store it in a way where I can go back and quickly find it.
So, for example, with a DNA sequence that's from the human genome
we might want to store some kind of tag that says what chromosome it's on, and
what location it starts, where does it start at, in that chromosome.
Play video starting at :1:57 and follow transcript1:57
So, the kinds of data structures that computer scientists have designed over
the years vary widely, but
one of the most common sorts of data structures is a, is a tree.
There's also something called a list and something called a link list, and
many variants on, on these data structures.
And these are simply ways of keeping track of things,
keeping track of objects you stored, and having objects point to one another,
so that once you go to one object you can quickly find another object.
So and to, to understand how these objects work, it's important to understand that in
computer memory, we not only have the data itself, but we also have an address.
So every piece of memory has an address, and
we can find any, any object in memory if we know that address.
So those addresses are called, in programming language terms, are called
pointers and those pointers will take us directly to a piece of memory.
So, if we store a piece of sequence data somewhere in the computer's memory,
to retrieve that again, we simply need a pointer.
And one way to think about a data structure that's efficient is if we have
lots of sequences that are in, that are near each other in the genome,
we might want to store pointers from one seq,
one of those sequences to the next one so that once we're in the right place we can
quickly find these other sequences without having to start over again from scratch.
Play video starting at :3:6 and follow transcript3:06
So another aspect of data structures, thinking about data structures is to make,
is making them efficient.
As I said before, in the genomics world we're mostly dealing with sequence data.
Sequence data has a natural representation as letters and
computers represent letters typically as as one byte.
So any, any letter in the alphabet, a through z, any nu, any numeral zero
through nine is represented the same way inside the computer using one byte.
So a byte actually of, of the word byte comes from the word bit.
A bit is a binary digit.
And that's just a zero or a one,
and that's at the most fundamental level how computers represent information.
If you take eight bits in a row, you can con-, you can consider that as
an eight bit binary number, which, which can store up to 128 values.
Usually we'd consider those to be the values 0 to 127.
And the standard representation of text in the, in the, inside the computer
is to represent every letter as one of those values between 0 and 127.
So with that much space to, to represent information, we can
represent all the lower case letters, all the upper case letters, that's another 26.
We could represent, represent the ten single digits, that's another ten, and
then we have a room for all the special characters.
So basically everything on your computer keyboard is represented as a single byte.
However, if you look at DNA, you see right away, well,
there's actually only four letters there.
So we can do much, much better when we're representing DNA.
And this is how most serious, highly efficient programs for
processing lots of DNA operate internally.
Instead of representing the four DNA letters as one byte each,
we can represent them as just two bits.
So simply take A and call that, make that,
represent that by the the two bits 0 0, C is 01, G is 1 0 and T is 1 1.
And by doing it this way, we get a fourfold compression.
So instead of using eight bits per letter of DNA, we're only using two bits.
So, because we're storing gigabytes or even terabytes of DNA sequence data,
a four-fold compression right out, right out of the box,
is, is an important efficiency we can gain by that, that representation.
So, finally to look at a slightly more sophisticated way of
representing representing DNA when we're talking about the application of DNA,
one thing that we like to capture in,
in analyzing DNA are patterns of sequence that have some biological function.
And here I'm showing you a picture of the, the ends of an intron.
So introns are the, the interrupting sequences that are in the genes
in our genome that actually don't encode proteins, but get snipped out and
thrown away in the process of going from DNA to RNA to protein.
And introns almost always start with the letters GT and
they almost always end with the letters AG.
And if you collect lots of them together and, and notice how these patterns are in
common you can get the, you can create a probabilistic picture of
what letters are most likely to occur at the beginnings and ends of introns.
So these two pictures show you exactly those two pictures for
the beginning of an intron which is called the donor site and
the end which is called an acceptor site.
So now we could represent all the donor sites we've ever seen as a big set
of strings of say ten letters long, if we, if we chopped out a window of ten
bases around those sites, or we could be much more efficient about it and
capture much more interesting data by computing for every position in
that little window the probability that the letter was A, C, G, or T.
And these logos you see across the top use use the height of the letter to represent
the probability that letter appears at that location.
And with this kind of representation we've now compressed, essentially compressed,
the information from hundreds or even thousands of sequences that we've seen
into a simple pattern which we can then use to, to process other data to, for, for
example to recognize these patterns when we see them again.
So, one way we might consider an algorithm for delivering the mail would be that the mailman
goes in his truck to the warehouse, picks up all the mail for, say, my house, drives over to my
house, delivers the mail and then goes back to the warehouse.
Picks up the mail for my neighbor, drives to the neighbor's house, delivers the mail to the
neighbor, and then drives back.
And so on. Now, that algorithm will definitely get the job done. It will get all the mail delivered
but to all the people who need their mail.
But, as most of you probably realized right away, if we're talking about more than a few houses,
this mailman is not going to get much mail delivered in the course of a day.
And the problem is that the mail truck is going back and forth many more times than it needs to.
So, clearly it would be more efficient if the mail truck could go to the warehouse,
pick up the mail for my house, and my neighbors house, drive to my house,
deliver my mail, and then drive, or just walk, if the houses are close to each
other, to the next house, deliver the mail there, and then go back to the warehouse.
So, in the concept of algorithmic efficiency this is actually
a class of problems that sometimes called traveling salesman problems, where,
what you want to do is much more sophisticated than what I just described.
What you want to do is think about, how much mail can the truck fit?
Go to the warehouse and essentially, like to fill the truck each time,
and you'd like to deliver all that mail before going back to the warehouse.
And to do that efficiently you'd like the truck to be able to drive
to as few places as, as possible.
Covering as little distance as possible,
and getting as much mail delivered as possible in the same amount of time.
So to do that, you really want to look at say a map of all the houses in an area,
and here's a picture an overhead picture of a neighborhood.
And you want to look at an area and say, okay, [SOUND] if I can put mail for
a hundred houses or a thousand houses in the truck at once, can I find a,
a map that allows the truck to visit all those houses as efficiently as possible,
delivering a little bit of mail at each house, and then finally, at the end,
when the truck is empty, only then going back to the warehouse.
And, and re-filling the truck.
So that's an example of an efficient algorithm,
actually that's not the algorithm,
that's an example of how we think about making an algorithm efficient.
We look at what the computer is doing, we look at how often, for example,
it's going back and forth to memory to retrieve data and
we ask, is there a way to do that in fewer steps.
Is there a way to do that in less time and using less memory,
and that's what we talk, that's what we mainly talk about, efficient algorithms.
When people talk about genomic data science, they often think about biology and computer
science, and I think statistics often ends up being the third wheel.
And so this lecture's to hopefully motivate you as to why statistics is a very important component
of genomic data science.
This is a really exciting result that came out in the Journal of Nature Medicine.
And so, the results suggest that it's possible to take genomic measurements and predict which
chemotherapies are going to work for which people.
This is an incredibly exciting result in genomic data science, because it was sort of the holy grail,
using genomic measurements to personalize therapy, and particular, particularly personalized
therapy for cancer.
And so, everybody was very excited about this, and people at all, institutions all over the world
tried to go back and reproduce that result.
And so, one of those groups was at MD Andersen Cancer Center. So, that group of people
consisted of two statisticians, Keith Baggerly and Kevin Coombes.
And those statisticians tried to chase down all of the details and reperform the analysis.
They did this because their collaborators were really excited about it and actually wanted to use
it at MD Anderson in order to tailor therapy.
But it turned out that there were all sorts of problems with the analysis, and they had trouble
getting a hold of the data.
And so because of these problems, they were actually unable to reproduce most of the
analysis.
And this ended up being a huge scandal in the world of genomic data science, because
this very high profile result, this result that everybody was sort of chasing after,
turned out to sort of not work out once all the details were checked out.
So this is actually an ongoing saga. It, actually started off as a sort of a discussion between the
statistician at MD Anderson and the group at Duke that actually performed the original analysis.
And over time, they had a large set of interactions where they were trying to settle on the details
of how the analysis was performed.
It turned out that due to some lack of transparency by the people who did the original analysis,
clinical trials actually got started using this technology.
They were assigning chemotherapy to people using sort of an incorrect data analysis,
and it was because the statistics weren't actually really well worked out.
This is so serious that now there are ongoing lawsuits between some of the people that
were involved in those clinical trials who had been assigned therapy and the institution Duke that
actually was behind the creation of these signatures. So missing out on why statistics will be part
of the genomic data science pipeline caused a major issue, so big that actually lawsuits were
generated.
This actually spurred an Institute of Medicine report. So this Institute of Medicine report dictated
that there are a whole new set of standards by which people should develop genomic data
technologies.
And much of this report focused on statistical issues, reproducibility, how to build statistical
models, how to lock those statistical models down, and so forth.
And so, the first issue, the first thing that we, I, I hope to motivate you is that we should care
about statistics.
And I've just got a couple of silly examples here. This is actually from a published abstract of a
paper.
And in the abstract, you can see where I've highlighted, that it says, insert statistical method
here.
So, the authors of this paper cared so little about the statistical analysis that they left a generic
statement about what statistical method they were using.
So this sort of suggests how sort of the relative ranking of where statistics falls in people's minds
when they're thinking about genomic data science.
And that sort of issue can cause major problems like we saw with the Potti scandal.
So this is actually also not just in genomics, it's actually a more general problem.
So this is actually from a flyer from Berkeley, and so they talk about all the different areas that
are sort of applying data science these days.
And if you notice, statistics is listed, but there's actually no application area.
And so this sort of, again, suggests that people think of statistics not necessarily
as something that's important for data science.
And that sort of lack of statistical thinking is a major contributor to problems in genomic data
analysis, both at the level of major projects, but also at the level of individual investigators.
And so the question is, how do we sort of change this perspective and how do we make sure
that people care and know that caring about statistics is just as important about, as caring about
the biology or the computer science when doing genomic data science.
So in that analysis, as you'll recall, they used genomic measurements to try to target
chemotherapeutics or try to decide which chemotherapies would apply best to which people.
And so that actually boiled down to two specific reasons why that analysis went wrong.
So right from the start the data and code used to perform that analysis were not made available.
In other words, the paper was published and people were looking at the paper and trying to do
the analysis themselves, and they couldn't get a hold of the raw data or the process data, and
they couldn't get a hold of the actual computer code or the statistical code that was used to
perform that analysis.
So this is related to the idea of reproducibility. Can you actually re-perform the analysis that
someone else created in their original paper?
And so for the analysis that was done in the Duke example that I talked about.
You couldn't get a hold of it. Similarly, there was a lack of cooperation.
And so this is actually not true in general, but it was in this particular case, not only were the
code and data not available, but the people in charge of the code and data, the principal
investigator and the people that were the lead authors on the study were very reluctant to
hand the data over to statisticians, and other people to take a look at.
Now in every data analysis, it's invariable that there are always some problems.
There's always some little issue that maybe people didn't notice. But if the data and code aren't
available, and not only that but the people that performed the analysis aren't cooperative, and
aren't sharing that data and code.
But it can take a very long time to discover if there's any problems in a data analysis that are
serious.
Like for example in the analysis that was done in the case of the genomic signatures.
And so the second thing that people would notice is that there's a lack of expertise.
And so the lack of expertise in this case dealt with specifically with statistics.
So one of the things that they used were very silly prediction rules. So these are prediction rules
where they defined probabilities in ways that people wouldn't necessarily not only are they not
right, but they sort of are recognizably silly, so here's an example where I'm showing a
probability formula where there's a minus one-fourth in the formula sort of out of nowhere.
And so their prediction rules were based on these probability definitions that not only weren't
right, but were kind of silly.
And so that relates to a lack of statistical expertise. The person who's actually developing the
model haven't gone through an actually done statistics class or perform enough analysis to
know. When they were doing something that wasn't not only right, but sort of silly to,
to even look at.
Play video starting at :2:35 and follow transcript2:35
Another thing is that they have major study design problems. So we'll talk about this in future
lecture. But they have things that we call batrifacts basically they have run samples on different
days.
And those different samples related to whether they would have one particular outcome better
than the other.
So it's called a confounder which we'll talk about. But these study design problems at the very
beginning before even performing analysis, they'd set it up in a way they sort of set themselves
up to fail in the sense that the experimental design wasn't in place, in a way that would allow
them to do the analysis that they were hoping to do.
And so finally the predictions weren't locked down. So in this case what, what happened was
they had these prediction rules.
And the prediction rules stated, stated when you should apply which chemotherapy to which
person on the basis of the genomic measurements.
But because the prediction rules had a random component to them, if you predicted on one day
the probability of being assigned to one treatment might be one number.
And on another day, you ran the exact same algorithm on the exact same code and you would
get a totally different prediction on which chemotherapy they should get.
And so this wasn't due to changes in the data or changes in the statistical algorithm.
It was due just to changes in the day that the algorithm was running.
And obviously if you're running a clinical trial, you don't people assigned to therapies based on
sort of random chance.
So these are all issues where there was lack of statistical expertise, and it turns out that the
analysis was ultimately totally reproducible.
The statisticians at MD Anderson were able to chase down all of the details, were able to put out
all of the code, and all of the data that was originally used to perform the paper, so it was a
totally reproducible analysis.
The problem is that the data, the data analysis was just wrong. And the reason why the data
analysis was wrong was that there was a severe lack of statistical expertise among the people
that were performing the study.
So I hope I motivated that having the statistical expertise can help you avoid before the problem
even started, at the experimental design level, at the level of creating correct statistical
procedures.
This lecture's about experimental design as well it's about sample size
and variability.
So if you remember from previous lecture,
the central dogma of statistics is that we have this big population.
And it's expensive to measure, you know,
whatever measurement that we want to take genomic or
otherwise on that whole population so we take a sample with probability.
Then on that sample we make our measurements and
use statistical inference to say something about the population.
So we talked a little bit about how that best guess that we get from our sample
isn't all that we get, we also get an estimate of variability.
So let's talk a little bit about variability and
what its relationship is to good experimental design.
So there's a sample size formula that you may have heard of that's,
if N is the number of measurements that you could take or
the number of people that you could sample.
If you're doing scientific research, you have to ask for grant money often and
so N ends up being the number of dollars that you have
divided by how much it cost to make a measurement.
And while this is one way to get at a sample size, it's maybe not the best way.
So the real idea behind sample size is basically to understand variability
Play video starting at :1:6 and follow transcript1:06
in the population.
And so, here's a really quick example of what I mean by that.
So here are two synthetic made up data sets.
So there's a data set for Y and there's a data set for X.
So the measurement values on the X axis and
on the Y axis there's the two data sets, YX.
And you can see, I have two lines here, the red line is the mean of the Y values
and the blue line is the mean of the X values.
And so, what you can see is that the means are different from each other but
there's also quite a bit of variability around those means.
Some measurements are lower and some measurements are higher and they overlap.
So the idea is, if the two means are different, how,
how confident can we be about that?
If we know what the variation is around the measurement that we've taken and
the mean that we have.
How confident we can be that these two means are different than each other?
So this goes through how many samples that you need to collect?
How much variability you need to observe to be able to say whether
the two things are different or not?
Play video starting at :1:57 and follow transcript1:57
So the way that people do this in advance in sort of experimental design is
with power.
So basically, the power is the probability that
if there's a real effect in the data set then you'll be able to detect it.
So, it depends on a few different things, it depends on the sample size,
it depends on how different the means are between the two groups,
like we saw the red and the blue lines.
And it depends how variable they are, so
we saw that there was variation around the means in both the X and the Y data sets.
So this is actually code from the R statistical programming language.
You don't have to worry about the code in this lecture but
you can just see that for example, if we want to do a t-test,
comparing the two groups which is a certain kind of statistical test.
The probability that we'll detect an effect of size 5,
that's what we have delta there with a variability of 10, the standard deviat,
standard deviation of 10 in each group and 10 samples is 18%.
So it's not very likely that even if there's an effect we'll detect it but
what you can do is you could also go back and make the calculations,
say, as is customary, we want 80% power.
In other words, we want an 80% chance of detecting an effect if it's really there.
So for a effect size of 5 and a standard deviation of 10,
you could see that we could calc back out, how many samples that we need to collect?
Here, in this case by doing the calculation,
we see we need 64 samples from each groups
in order to have an 80% chance of detecting its particular effects on us.
But similarly, you can do that calculation by saying, how many do you need to have
for one group if you're only going to be doing, or for each group,
if you're only going to be doing a test in one direction or the other?
So suppose, I know that the effect size will always be expression levels will be
higher in the cancer samples than the control samples.
Then it's possible to actually create, less, less samples and still
get the same power because you actually have a little bit more information.
Later classes and statistical classes will talk more about power and
how you calculate it.
But the basic idea is to keep in mind that you, the power is actually a curve.
It's never just one number even though you might hear 80% thrown around quite a bit
when talking about power, the idea is that there is a curve.
So when there's no, in this plot,
I'm showing on the X axis, all the different potential sizes of an effect.
So it could be 0, that's the center of the plot or it could be very high or
very low and then on the Y axis is power for different sample sizes.
Black lines correspond to sample sizes of 5, blue line corresponds to sample sizes
of 10 and red lines correspond to sample size of 20.
So as you can see that,
as you move out from the center of the plot, the power goes up.
So, the bigger the effect,
the easier it is to detect, also as the sample size go up, goes up,
you see from the black, to the blue, to the red curve, you get more power as well.
So as you vary these different parameters, you get different power and so
a power calculation is a hypothetical calculation based on what you
think the effect size might be and what sample size you can get.
And so, it's important to pay attention before performing a study
as to the power that you might have so you don't run the study.
And end up at the end of the day without any potential difference
even when there might have been one there.
Play video starting at :4:56 and follow transcript4:56
So there are three types of variability, we've been talking about variability in
terms of the sampling variability that you get when you take a sample.
And then look at how does that relate to the population but
there's actually three kinds that are very commonly measured or
considered when performing experiments in genomics.
Play video starting at :5:13 and follow transcript5:13
So, variability of a genomic measurement can be broken down into three types,
the phenotypic variability.
So, imagine you're doing a comparison between cancers and controls.
Then there's variability between the cancer patients and
the control patients about their genomic measurements.
So this is often the variability that we care about,
we want to detect differences between groups.
There's also measurement error,
all genomic technologies measure whether it's gene expression,
methylation, whether it's the alleles that we measure in a DNA study.
All of those are measured with error and so
we have to take into account how well does the machi,
machine actually measure the reads, how long we quantify the reads and so forth.
There's also a component of variation that often gets ignored or
missed which is natural biological variation.
Play video starting at :5:55 and follow transcript5:55
So for every kind of genomic measurement that we take,
there's natural variation between people.
So even if you have two people that are healthy, have the same phenotypes in every
possible way, they're the same sex, the same age, they eat in the same breakfast.
There is still going to be variation between people and that natural biological
variability has to be accounted for when performing statistical modeling as well.
An important consideration is that there's often a rush when there's new technologies
to sort of claim that this new technology is so
much better than the previous technology.
One way they do that is by saying that the variability is much lower
that may be true for the technical component or the measurement error
component of variability, but it doesn't eliminate biological variability.
So here I'm showing an example of that,
there are four plots in this picture that you're looking at.
The top two plots show data that was collected using next generation
sequencing.
The bottom two plots show data that was collecting with micro
razing with older technology.
Each dot corresponds to the same sample, so
it's the same samples in all four plots.
And so what you can see is for the gene on the left, you see that the pink gene,
you can see that there's lower variability across people.
So this is true, whether you measure it on the top with sequencing or
on the bottom with arrays.
Similarly, the gene on the right that, I've colored in blue here
is highly variable when measured with sequencing or when measured with arrays.
So what this suggests is that biological variation is a natural phenomenon
that always is a component of non modeling data in genomic and
it does not get eliminated by technology.
Play video starting at :7:29 and follow transcript7:29
So that's what we talked about here is the variability and
sample size calculations and how those things relate.
And one of the most important components of statistics is paying attention to
how variation exists in both your sample and in the population that you measure.