Conditional probability is the sine qua non of data science and
statistics. There are many useful explanations and examples of
conditional probability and Bayes’ Theorem. In this article, I will
explain the background of the Bayes’ Theorem with examples by using
simple math.
Bayes’ Theorem looks simple in mathematical expressions such as;
P(A|B) = P(B|A)P(A)/P(B)
The important point in data science is not the equation itself, the
application of this equation to the verbal problem is more important
than remembering the equation. So, I will solve a simple conditional
probability problem with Bayes theorem and logic.
Problem 1:
Let’s work on a simple NLP problem with Bayes Theorem. By using
NLP, I can detect spam e-mails in my inbox. Assume that the word
‘offer’ occurs in 80% of the spam messages in my account. Also, let’s
assume ‘offer’ occurs in 10% of my desired e-mails. If 30% of the
received e-mails are considered as a scam, and I will receive a new
message which contains ‘offer’, what is the probability that it is spam?
Now, I assume that I received 100 e-mails. The percentage of spam in
the whole e-mail is 30%. So, I have 30 spam e-mails and 70 desired e-
mails in 100 e-mails. The percentage of the word ‘offer’ that occurs in
spam e-mails is 80%. It means 80% of 30 e-mail and it makes 24. Now,
I know that 30 e-mails of 100 are spam and 24 of them contain ‘offer’
where 6 of them not contains ‘offer’.
The percentage of the word ‘offer’ that occurs in the desired e-mails is
10%. It means 7 of them (10% of 70 desired e-mails) contain the word
‘offer’ and 63 of them not.
Now, we can see this logic in a simple chart.
image by author
The question was what is the probability of spam where the mail
contains the word ‘offer’:
1. We need to find the total number of mails which contains ‘offer’ ;
24 +7 = 31 mail contain the word ‘offer’
2. Find the probability of spam if the mail contains ‘offer’ ;
In 31 mails 24 contains ‘offer’ means 77.4% = 0.774 (probability)
NOTE: In this example, I choose the percentages which give integers
after calculation. As a general approach, you can think that we have
100 units at the beginning so if the results are not an integer, it will not
create a problem. Such that, we cannot say 15.3 e-mails but we can say
15.3 units.
Solution with Bayes’ Equation:
A = Spam
B = Contains the word ‘offer’
image by author
P( contains offer|spam) = 0.8 (given in the question)
P(spam) = 0.3 (given in the question)
Now we will find the probability of e-mail with the word ‘offer’. We can
compute that by adding ‘offer’ in spam and desired e-mails. Such that;
P(contains offer) = 0.3*0.8 + 0.7*0.1 = 0.31
image by author
As it is seen in both ways the results are the same. In the first part, I
solved the same question with a simple chart and for the second part, I
solved the same question with Bayes’ theorem.
Problem 2:
I want to solve one more example from a popular topic as Covid-19. As
you know, Covid-19 tests are common nowadays, but some results of
tests are not true. Let’s assume; a diagnostic test has 99% accuracy and
60% of all people have Covid-19. If a patient tests positive, what is the
probability that they actually have the disease?
image by author
The total units which have positive results= 59.4 + 0.4 = 59.8
59.4 units (true positive) is 59.8 units means 99.3% = 0.993
probability
With Bayes’;
image by author
P(positive|covid19) = 0.99
P(covid19) = 0.6
P(positive) = 0.6*0.99+0.4*0.01=0.598
Again, we find the same answer with the chart. There are many
examples to learn Bayes’ Theorem’s applications such as the Monty
Hall problem which is a little puzzle that you have 3 doors. Behind the
doors, there are 2 goats and 1 car. You are asked to select one door to
find the car. After selecting one door, the host opens one of the not-
selected doors and revealing the goat. Then, you are asked to switch
the doors or stick with your first choice. By running this process a
thousand times and simulating it, you can find the probability of
winning and figure out the idea of Bayes’ theorem and Bayesian
statistics in general through the Monty Hall problem.