Course Notes For Unit 2 of The Udacity Course ST101 Introduction To Statistics
Course Notes For Unit 2 of The Udacity Course ST101 Introduction To Statistics
Contents
Probability Conditional Probability Bayes Rule Programming Bayes Rule (optional) Correlation vs. Causation Answers
Probability
In this unit we will be talking about probability. In a sense, probability is just the opposite of statistics. Put differently, in statistics we are given data and try to infer possible causes, whereas in probability we are given a description of the causes and we try to predict the data.
The reason that we are studying probability rather than statistics is that it will give us a language to describe the relationship between data and the underlying causes.
Flipping coins
Flipping a coin creates data. Each flip of the coin will result in either a head or a tail result. A fair coin is one that has a 50% chance of coming up heads and a 50% chance of coming up tails. Probability is a method for describing the anticipated outcomes of these coin flips.
Fair Coin
The probability of a coin coming up heads is written using the notation: P(Heads) = In a fair coin, the chance of a coin flip coming up heads is 50%. In probability, this is given as a probability of 0.5: P(Heads) = 0.5 A probability of 1 means that the outcome will always happen. A probability of 0 means that it will never happen. In a fair coin, the probability that a flip will come up tails is: P(Tails) = 0.5 The sum of the probabilities of all possible outcomes is always 1. So: P(Heads) + P(Tails) = 1
Loaded Coin
A loaded coin is one that comes up with one outcome much more frequently than the other.
Complementary Outcomes
If we know the probability of an outcome, A, then the probability of the opposite outcome, A (not A), is given by: P(A) = 1 P(A) This is a very basic law of probability.
Two Flips
So what happens when we flip the same, unbiased, coin twice? What is the probability of getting two heads in a row, assuming that P(H) = 0.5? We can derive the answer to this type of problem using a truth table. A truth table enumerates every possible outcome of the experiment. In this case: Flip-1 H H T T Flip-2 H T H T Probability 0.25 0.25 0.25 0.25
In the case of two coin flips, there are four possible outcomes, and because heads and tails are equally likely, each of the four outcomes is equally likely. Since the total probability must equal 1, the probability of each outcome is 0.25. Another way to consider this is that the probability that we will see a head, followed by another head is the product of the probabilities of the two events: P(H, H) = P(H) x P(H) = 0.5 x 0.5 = 0.25 So what happens if the coin is loaded? Well, if the probability of getting a head, P(H), is 0.6, then the probability that we will see a tail, P(T), is going to be 0.4, and the truth table will be:
Flip-1 H H T T
Flip-2 H T H T
Notice that the total probability is still 1: 0.36 + 0.24 + 0.24 + 0.16 = 1 The truth table lists all possible outcomes, so the sum of the probabilities will always be 1.
One Head
The truth table can get more interesting when we ask different questions. Suppose we flip the coin twice, but what we care about is that exactly one of the two flips reveals a head. For a fair coin, where P(H) = 0.5, the probability is: P(Exactly one H) = 0.5 We can see from the truth table that there are exactly two possible outcomes with exactly one head: Flip-1 H H T T Flip-2 H T H T Probability 0.25 0.25 0.25 0.25
Doubles Quiz
Suppose we throw a fair die twice. What is the probability that we throw the same number on each throw (i.e. a double)?
Summary
In this section we learned that if we know the probability of an event, P(A) the probability of the opposite event is just 1 P(A). We also learned about composite events where the probability is given by: P(A) x P(A) x x P(A) Now technically, these conditional events imply independence. This just means that the outcome of the second coin flip does not depend on the outcome of the first. In the next section we will look at dependence.
Conditional Probability
In real life, things depend on each other. For example, people can be born smart or dumb. For simplicity, lets assume that whether theyre born smart or dumb is just natures equivalent of the flip of a coin. Now, whether they become a Stanford professor is not entirely independent of their intelligence. In general, becoming a Stanford professor is not very likely. The probability may only be 0.0001, but it also depends on their intelligence. If they are born smart, the probability may be higher.
In the previous section, subsequent events like coin tosses were independent of what had happened before. We are now going to look at some more interesting cases where the outcome of the first event does have an impact on the probability of the outcome of the second.
Cancer Example
Lets suppose that there is a patient who may be suffering from cancer. Lets say that the probability of a person getting this cancer is 0.1: P(Cancer) = 0.1 P(Cancer) = 0.9 Now, we dont know whether the person actually has cancer, but there is a blood test that we can give. The outcome of the test may be positive, or it may be negative, but like any good test, it tells us something about the thing we really care about in this case whether or not the person has cancer. Lets say that the probability of a positive test when a person has cancer is 0.9: P(Positive | Cancer) = 0.9 and, P(Negative | Cancer) = 0.1 The sum of the possible test outcomes will always e equal to 1. This is called the sensitivity of the test. Now, this notation says that the result of the test depends on whether or not the person has cancer. This is known as a conditional probability. In order to fully specify the test, we also need to specify the probability of a positive test in the case of a person who doesnt have cancer. In this case, we will say that this is 0.2: P(Positive | Cancer) = 0.2 P(Negative | Cancer) = 0.8 This is the specificity of the test. We now have all the information we need to derive the truth table: Cancer Y Y N N Test Positive Negative Positive Negative P( ) 0.1 x 0.9 0.1 x 0.1 0.9 x 0.2 0.9 x 0.8 = = = = = 0.09 0.01 0.18 0.72 1.0
We can now use the truth table to find the probability that we will see a positive tets result P(Positive) = 0.09 + 0.18 = 0.27
Total Probability
Lets put this into mathematical notation. We were given the probability of having cancer, P(C), from which we were able to derive the probability of not having cancer: P( C) = 1 - P(C) We also had the two conditional probabilities, P( + | C) and P( + | C), and from these we were able to derive the probabilities of a negative test: P( - | C) = 1 P( + | C) and P( - | C) = 1 P( + | C) Then, the probability of a positive test result was: P(+) = P(C).P( + | C) + P( C).P( + | C) This is known as total probability. Lets consider another example.
Two Coins
Image that we have a bag containing two coins. We know that coin1 is fair, and coin2 is loaded, so that: P1(H) = 0.5 and P1(T) = 0.5 P2(H) = 0.9 and P2(T) = 0.1 We now pick a coin from the bag. Each coin has an equal probability of being picked from the bag. We flip the coin once.
Bayes Rule
In this section, we introduce what may be the Holy Grail of probabilistic inference. Its called Bayes Rule. The rule is based on work by Reverent Thomas Bayes who used the principle to infer the existence of God. In doing so, he created a new family of methods that have vastly influenced artificial intelligence and statistics. Lets think about the cancer example from the previous section. Say that there is a specific cancer that occurs in 1% of the population. There is a test for this cancer that has a 90% chance of a positive result if the person has cancer. The specificity of the test is 90%, i.e. there is a 90% chance of a negative test result if the person doesnt have cancer: P(C) = 0.01 P(+ | C) = 0.9 P(- | C) = 0.9 So, here is the question. What is the probability that a person has cancer, given that they have had a positive test?
Only 1% of the people have this cancer. 99% are cancer-free. The test for this cancer catches 90% of those who have the cancer, which is 90% of the cancer circle. But the test can also give a positive result even when the person doesnt have cancer. In our case a false-positive can occur in 10% of cases which is 10% of the total population. The remaining area represents the case of people who dont have the cancer and get a negative result from the test. In fact, the area C + pos in the diagram above is actually about 8.3% of the total area representing a positive test result. So a positive test has only raised the probability that the person has cancer by a factor of about 8. So this is the basis of Bayes Rule. We start with some prior probability before we run the test, and then we get some evidence from the test itself, which leads us to what is known as a posterior probability:
In our example, we have the prior probability, P(C), and we obtain the posterior probabilities as follow. First we calculate what are known as the joint probabilities: P(C | pos) = P(C) . P(Pos | C) P( C | pos) = P( C) . P(Pos | C) Given the values in our example we get: P(C | pos) = 0.01 x 0.9 = 0.009 P( C | pos) = 0.99 x 0.1 = 0.099 These values are non-normalised they do not sum to 1. In terms of our diagram above, they are the absolute areas of the regions representing a positive test.
We obtain the posterior probabilities by normalising the joint probabilities. To do this, we divide each of the joint probabilities by the probability of a positive test result: P(pos) = P(C | pos) + P( C | pos) So the posterior probabilities are: P (C | pos ) P(C ).P( pos | C ) P( pos ) P(C ).P( pos | C ) P( pos )
P (C | pos )
Lets say we get a positive test. We have a prior probability, and a test with a given sensitivity and specificity: Prior: P(C) Sensitivity: P(pos | C) Specificity: P(pos | C) We multiply the prior by the sensitivity and by the specificity. This gives us a number that combines the cancer hypothesis with the test result for each of the cases cancer or non-cancer. We add these numbers (normally, they do not add up to 1), to get the total probability of a positive test. Now all we need to do to obtain the posterior probabilities is to normalise the two numbers by dividing by the total probability, P(pos).
This is our algorithm for Bayes Rule, and we can produce an almost exactly similar diagram for a negative test result as shown below:
Lets work through an example. We start with our prior probability, sensitivity and specificity: P(C) = 0.001 P(pos | C) = 0.9 P(neg | C) = 0.9
P(C | neg) #the combined probability of having cancer given the negative test result P( C | neg) #the combined probability of being cancer-free given the negative test result
Normaliser Quiz
Calculate the normaliser, P(neg)
What is remarkable about the result is what the posterior probabilities actually mean. Before the test, we had a 1% chance of having cancer. After a negative test result this has gone down by about a factor of 9. Conversely, before the test there was a 99% chance that we were cancer-free. That number has now gone up to 99.89%, greatly increasing our confidence that we are cancer free. Lets consider another example. In this case the prior probability, sensitivity and specificity are: P(C) = 0.1 P(pos | C) = 0.9 P(neg | C) = 0.5 So the sensitivity is high, but the specificity is much lower.
Bayes rule then applies the algorithm we saw earlier to calculate the posterior probabilities for the variable given a test outcome: Positive test: Negative Test:
Robot Sensing
Lets practice using Bayes Rule with a different example. Consider a robot living in a world that has exactly two places. There is a red place, R, and a green place, G:
Initially, the robot has no idea of its location, so the prior probabilities are: P(R) = P(G) = 0.5 The robot has sensors that allow it to see its environment, but these sensors are somewhat unreliable: P(see R | in R) = 0.8 p(see G | in G) = 0.8
So the hidden variable now has three states. We will assume that each place has the same prior probability: P(A) = P(B) = P(C) = 1/3 The robot sees red, and we know that: P(R | A) = 0.9 P(G | B) = 0.9 P(G | C) = 0.9 We can solve for the posterior probabilities exactly as before. P(A, R) = P(A) x P(R | A) = 1/3 x 0.9 = 0.3 P(B, R) = P(B) x P(R | B) = 1/3 x 0.1 = 0.0333 P(C, R) = P(C) x P(R | C) = 1/3 x 0.1 = 0.0333 So, the normaliser is P(R) = 0.3 + 0.0333 + 0.0333 = 0.3667
Which gives the posterior probabilities: P(A | R) = 0.82 P(B | R) = 0.09 P(C | R) = 0.09
Generalising
The last example showed that there may be more than two states of the hidden variable that we are interested in. There may be 3, 4, 5 or any other number. We can solve these cases using exactly the same methods, but we have to keep track of more values. In fact, there can be more than just two outcomes of the test. For example, the robot may see red, green or blue. This means that our measurement probabilities will be more elaborate, but the actual method for calculating the posterior probabilities will remain the same. We can now deal with very large problems, that have many possible hidden causes, by applying Bayes Rule to determine the posterior probabilities.
The statement: The chances of dying in hospital are 40 times greater than dying at home shows that there is a correlation between whether or not you die, and whether or not you are in hospital. Whereas, the statement: Being in a hospital increases your probability of dying by a factor of 40 is a causal statement. It says that being in hospital causes you to die, not just that being in hospital coincides with the fact that you die. People frequently get this wrong. They observe a correlation, but they suggest that the correlation is causal. To understand why this could be wrong, lets look a little deeper into our example.
Considering Health
Lets say that of the people in the hospital, 36 were sick, and 4 of these died. Four of the people in the hospital were actually healthy, and they all survived. Of the people who were at home, 40 were actually sick, and 20 of these people died. The remaining 7960 were healthy, but 20 of these people also died (perhaps due to accidents etc.). These statistics are consistent with our earlier statistics. We have just added another variable - whether a person is sick or healthy. The percentages of people who died are tabulated below: Sick Healthy Sick Healthy In Hospital 36 4 At Home 40 7960 Died 4 0 Died 20 20 11.11% 0% 50% 0.2513%
Now, if you are sick, your chances of dying at home are 50% compared with about 11% in the hospital, so you should really make your way to the hospital.
Correlation
So why does the hospital example lead us to draw such a wrong conclusion? We looked at two variables, being in hospital and the chance of dying, and we rightfully observed that these two things are correlated. If we had a scatter-plot with two categories whether a person was in hospital, & whether or not that person died we would see increased occurrence of data points as shown below:
This shows that the data correlates. So what is correlation? Well, in any plot, data is correlated if knowledge about on variable tells us something about the other.
Correlation Quiz
Are the following data pots correlated? A B
Causation Structure
In the example above, there is clearly a correlation between whether a person is in hospital, and whether or not they die. But we initially left out an important variable: whether or not a person was sick. In fact, it was the sickness that caused people to die. If we add arcs of causation to our diagram, we find that sickness causes death, and that sickness also causes people to go into hospital:
In fact, in our example, once a person knew that they were sick, being in a hospital negatively correlated with them dying. That is, being in a hospital made it less likely that they would die, given that they were sick. In statistics, we call this a confounding variable. It can be very tempting to just omit this from your data, but if you do, you might find correlations that have absolutely nothing to do with causation.
Fire Correlation
Suppose we study a number of fires. We recorded the number of fire-fighters and the surface area (size) of the fire. # Fire-fighters 10 40 200 70 # Size fire 100 400 2000 700
Clearly, the number of fire-fighters is correlated with the size of the fire. But firefighters dont cause the fires! Getting rid of all the fire-fighters will not get rid of all the fires. This is actually a case of reverse causation. The size of the fire determines the number of fire-fighters that will be sent to deal with it.
However, it is impossible to know this just from the data. We only know that this is a case of reverse causation because we already know something about fire and firefighters.
Assignment
Check out old news articles in newspapers, or online, and find some that takes data which shows a correlation, and from the data suggests causation, or tells you what to do based on that data. You will find that the news is full of such abuses of statistics.
Answers
Loaded Coin Quiz
P(Tails) = 1 P(Heads) = 0.25
Doubles Quiz
The truth table has 36 possible outcomes. Each outcome in the truth table will have a probability of 1/36. Six of these outcomes are doubles, so the probability of a double is: 6 x 1/36 = 1/6 =0.16667
= = = =
P(T, T) = 0.08
Normaliser Quiz
p(neg) = 0.892
Correlation Quiz
A. B. C. D. Yes No No Yes