0% found this document useful (0 votes)
13 views30 pages

Principles-of-Data-Science-WEB-5

The document discusses measures of central tendency, including mode, mean, and median, and their applicability to qualitative and quantitative data. It highlights the influence of outliers on these measures, demonstrating that the median is often a better representation of the center when outliers are present. Additionally, it introduces measures of variation such as range, variance, and standard deviation, emphasizing their importance in understanding data dispersion.

Uploaded by

pihak21291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

Principles-of-Data-Science-WEB-5

The document discusses measures of central tendency, including mode, mean, and median, and their applicability to qualitative and quantitative data. It highlights the influence of outliers on these measures, demonstrating that the median is often a better representation of the center when outliers are present. Additionally, it introduces measures of variation such as range, variance, and standard deviation, emphasizing their importance in understanding data dispersion.

Uploaded by

pihak21291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

3.

1 • Measures of Center 111

The mode can also be applied to non-numeric (qualitative) data, whereas the mean and the median can only
be applied for numeric (quantitative) data. For example, a restaurant manager might want to determine the
mode for responses to customer surveys on the quality of the service of a restaurant, as shown in Table 3.1.

Customer Number of
Service Respondents
Rating
Excellent 267
Very Good 410
Good 392
Fair 107
Poor 18

Table 3.1 Customer Survey Results for


Customer Survey Rating

Based on the survey responses, the mode is the Customer Service Rating of “Very Good,” since this is the data
value with the greatest frequency.

Influence of Outliers on Measures of Center


As mentioned earlier, when outliers are present in a dataset, the mean may not represent the center of the
dataset, and the median will provide a better measure of center. The reason is that the median focuses on the
middle value of the ordered dataset. Thus, any outliers at the lower end of the dataset or any outliers at the
upper end of the dataset will not affect the median. Note: A formal method for identifying outliers is presented
in Measures of Position when measures of position are discussed. The following example illustrates the point
that the median is a better measure of central tendency when potential outliers are present.

EXAMPLE 3.5

Problem

Suppose that in a small company of 40 employees, one person earns a salary of $3 million per year, and the
other 39 individuals each earn $40,000. Which is the better measure of center: the mean or the median?

Solution

The mean, in dollars, would be arrived at mathematically as follows:

However, the median would be $40,000 since this is the middle data value in the ordered dataset. There are
39 people who earn $40,000 and one person who earns $3,000,000.

Notice that the mean is not representative of the typical value in the dataset since $114,000 is not reflective
of the average salary for most employees (who are earning $40,000). The median is a much better measure
of the “average” than the mean in this case because 39 of the values are $40,000 and one is $3,000,000. The
data value of $3,000,000 is an outlier. The median result of $40,000 gives us a better sense of the center of
the dataset.

Using Python for Measures of Center


We learned in What Are Data and Data Science? how the DataFrame.describe() method is used to
112 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

summarize data. Recall that the method describe() is defined for a DataFrame type object so should be
called upon a DataFrame type variable (e.g. given a DataFrame d, use d.describe()).

Figure 3.2 shows the output of DataFrame.describe() on the “Movie Profit” dataset we used in What Are
Data and Data Science?, movie_profit.csv. The mean and 50% quartile show the average and median of each
column. For example, the average worldwide gross earnings are $410.14 million, and the median earnings are
$309.35 million. Note that average and/or median of some columns are not as meaningful as the others. The
first column—Unnamed: 0—was simply used as an identifier of each item in the dataset, so the average and
mean of this column is not quite useful. DataFrame.describe() still computes the values because it can (and
it does not care which column is meaningful to do so or not).

Figure 3.2 The Output of DataFrame.describe() with the Movie Profit Dataset

EXPLORING FURTHER

Working with Python


See the Python website (https://openstax.org/r/python) for more details on using, installing, and working
with Python. See this additional documentation, for more specific information on the statistics module
(https://openstax.org/r/docspython).

3.2 Measures of Variation


Learning Outcomes
By the end of this section, you should be able to:
• 3.2.1 Define and calculate the range, the variance, and the standard deviation for a dataset.
• 3.2.2 Use Python to calculate measures of variation for a dataset.

Providing some measure of the spread, or variation, in a dataset is crucial to a comprehensive summary of the
dataset. Two datasets may have the same mean but can exhibit very different spread, and so a measure of
dispersion for a dataset is very important. While measures of central tendency (like mean, median, and mode)
describe the center or average value of a distribution, measures of dispersion give insights into how much
individual data points deviate from this central value.

The following two datasets are the exam scores for a group of three students in a biology course and in a
statistics course.

Dataset A: Exam scores for students in a biology course: 40, 70, 100
Dataset B: Exam scores for students in a statistics course: 69, 70, 71

Access for free at openstax.org


3.2 • Measures of Variation 113

Notice that the mean score for both Dataset A and Dataset B is 70.

However, the datasets are significantly different from one another:

Dataset A has larger variability where one student scored 30 points below the mean and another student
scored 30 points above the mean.
Dataset B has smaller variability where the exam scores are much more tightly clustered around the mean of
70.

This example illustrates that publishing the mean of a dataset is often inadequate to fully communicate the
characteristics of the dataset. Instead, data scientists will typically include a measure of variation as well.

The three primary measures of variability are range, variance, and standard deviation, and these are described
next.

Range
Range is a measure of dispersion for a dataset that is calculated by subtracting the minimum from the
maximum of the dataset:

Range is a straightforward calculation but makes use of only two of the data values in a dataset. The range can
also be affected by outliers.

EXAMPLE 3.6

Problem

Calculate the range for Dataset A and Dataset B:

Dataset A: Exam scores for students in a biology course: 40, 70, 100
Dataset B: Exam scores for students in a statistics course: 69, 70, 71

Solution

For Dataset A, the maximum data value is 100 and the minimum data value is 40.
The range is then calculated as:

For Dataset B, the maximum data value is 71 and the minimum data value is 69.
The range is then calculated as:

The range clearly indicates that there is much less spread in Dataset B as compared to Dataset A.

One drawback to the use of the range is that it doesn’t take into account every data value. The range only uses
two data values from the dataset: the minimum (min) and the maximum (max). Also the range is influenced by
outliers since an outlier might appear as a minimum or maximum data value and thus skew the results. For
these reasons, we typically use other measures of variation, such as variance or standard deviation.
114 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

Variance
The variance provides a measure of the spread of data values by using the squared deviations from the mean.
The more the individual data values differ from the mean, the larger the variance.

A financial advisor might use variance to determine the volatility of an investment and therefore help guide
financial decisions. For example, a more cautious investor might opt for investments with low volatility.

The formula used to calculate variance also depends on whether the data is collected from a sample or a
population. The notation is used to represent the sample variance, and the notation is used to represent
the population variance.

Formula for the sample variance:

Formula for the population variance:

In these formulas:
represents the individual data values
represents the sample mean
represents the sample size
represents the population mean
represents the population size

ALTERNATE FORMULA FOR VARIANCE

An alternate formula for the variance is available. It is sometimes used for more efficient computations:

In the formulas for sample variance and population variance, notice the denominator for the sample variance
is , whereas the denominator for the population variance is . The use of in the denominator of the
sample variance is used to provide the best estimate for the population variance, in the sense that if repeated
samples of size are taken and the sample mean computed each time, then the average of those sample
means will tend to the population mean as the number of repeated samples increase.

It is important to note that in many data science applications, population data is unavailable, and so we
typically calculate the sample variance. For example, if a researcher wanted to estimate the percentage of
smokers for all adults in the United States, it would be impractical to collect data from every adult in the United
States.

Notice that the sample variance is a sum of squares. Its units of measurement are squares of the units of
measurement of the original data. Since these square units are different than the units in the original data,
this can be confusing. By contrast, standard deviation is measured in the same units as the original dataset,
and thus the standard deviation is more commonly used to measure the spread of a dataset.

Standard Deviation
The standard deviation of a dataset provides a numerical measure of the overall amount of variation in a
dataset in the same units as the data; it can be used to determine whether a particular data value is close to or

Access for free at openstax.org


3.2 • Measures of Variation 115

far from the mean, relative to the typical distance from the mean.

The standard deviation is always positive or zero. It is small when the data values are all concentrated close to
the mean, exhibiting little variation, or spread. It is larger when the data values are spread out more from the
mean, exhibiting more variation. A smaller standard deviation implies less variability in a dataset, and a larger
standard deviation implies more variability in a dataset.

Suppose that we are studying the variability of two companies (A and B) with respect to employee salaries. The
average salary for both companies is $60,000. For Company A, the standard deviation of salaries is $8,000,
whereas the standard deviation for salaries for Company B is $19,000. Because Company B has a higher
standard deviation, we know that there is more variation in the employee salaries for Company B as compared
to Company A.

There are two different formulas for calculating standard deviation. Which formula to use depends on whether
the data represents a sample or a population. The notation is used to represent the sample standard
deviation, and the notation is used to represent the population standard deviation. In the formulas shown,
is the sample mean, is the population mean, is the sample size, and is the population size.

Formula for the sample standard deviation:

Formula for the population standard deviation:

Notice that the sample standard deviation is calculated as the square root of the variance. This means that
once the sample variance has been calculated, the sample standard deviation can then be easily calculated as
the square root of the sample variance, as in Example 3.7.

EXAMPLE 3.7

Problem

A biologist calculates that the sample variance for the amount of plant growth for a sample of plants is 8.7
cm2. Calculate the sample standard deviation.

Solution

The sample standard deviation ( ) is calculated as the square root of the variance.

EXAMPLE 3.8

Problem

Assume the sample variance ( ) for a dataset is calculated as 42.2. Based on this, calculate the sample
standard deviation.
116 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

Solution

The sample standard deviation ( ) is calculated as the square root of the variance.

This result indicates that the standard deviation is about 6.5 years.

Notice that the sample variance is the square of the sample standard deviation, so if the sample standard
deviation is known, the sample variance can easily be calculated.

USE OF TECHNOLOGY FOR CALCULATING MEASURES OF VARIABILITY

Due to the complexity of calculating variance and standard deviation, technology is typically utilized to
calculate these measures of variability. For example, refer to the examples shown in Coefficient of Variation
on using Python for measures of variation.

Coefficient of Variation
A data scientist might be interested in comparing variation with different units of measurement of different
means, and in these scenarios the coefficient of variation (CV) can be used. The coefficient of variation
measures the variation of a dataset by calculating the standard deviation as a percentage of the mean. Note:
coefficient of variation is typically expressed in a percentage format.

EXAMPLE 3.9

Problem

Compare the relative variability for Company A versus Company B using the coefficient of variation, based
on the following sample data:

Company A:

Company B:

Solution

Calculate the coefficient of variation for each company:

Company A exhibits more variability relative to the mean as compared to Company B.

Using Python for Measures of Variation


DataFrame.describe() computes standard deviation as well on each column of a dataset. The std lists the
standard deviation of each column (See Figure 3.3).

Access for free at openstax.org


3.3 • Measures of Position 117

Figure 3.3 The Output of DataFrame.describe() with the Movie Profit Dataset

3.3 Measures of Position


Learning Outcomes
By the end of this section, you should be able to:
• 3.3.1 Define and calculate percentiles, quartiles, and -scores for a dataset.
• 3.3.2 Use Python to calculate measures of position for a dataset.

Common measures of position include percentiles and quartiles as well as -scores, all of which are used to
indicate the relative location of a particular datapoint.

Percentiles
If a student scores 47 on a biology exam, it is difficult to know if the student did well or poorly compared to the
population of all other students taking the exam. Percentiles provide a way to assess and compare the
distribution of values and the position of a specific data point in relation to the entire dataset by indicating the
percentage of data points that fall below it. Specifically, a percentile is a value on a scale of one hundred that
indicates the percentage of a distribution that is equal to or below it. Let’s say the student learns they scored in
the 90th percentile on the biology exam. This percentile indicates that the student has an exam score higher
than 90% of all other students taking the test. This is the same as saying that the student’s score places the
student in the top 10% of all students taking the biology test. Thus, this student scoring in the 90th percentile
did very well on the exam, even if the actual score was 47.

To calculate percentiles, the data must be ordered from smallest to largest and then the ordered data divided
into hundredths. If you score in the 80th percentile on an aptitude test, that does not necessarily mean that
you scored 80% on the test. It means that 80% of the test scores are the same as or less than your score and
the remaining 20% of the scores are the same as or greater than your score.

Percentiles are useful for comparing many types of values. For example, a stock market mutual fund might
report that the performance for the fund over the past year was in the 80th percentile of all mutual funds in
the peer group. This indicates that the fund performed better than 80% of all other funds in the peer group.
This also indicates that 20% of the funds performed better than this particular fund.

To calculate percentiles for a specific data value in a dataset, first order the dataset from smallest to largest
and count the number of data values in the dataset. Locate the measurement of interest and count how many
data values fall below the measurement. Then the percentile for the measurement is calculated as follows:
118 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

EXAMPLE 3.10

Problem

The following ordered dataset represents the scores of 15 employees on an aptitude test:

51, 63, 65, 68, 71, 75, 75, 77, 79, 82, 88, 89, 89, 92, 95

Determine the percentile for the employee who scored 88 on the aptitude test.

Solution

There are 15 data values in total, and there are 10 data values below 88.

Quartiles
While percentiles separate data into 100 equal parts, quartiles separate data into quarters, or four equal
parts. To find the quartiles, first find the median, or second quartile. The first quartile, , is the middle value,
or median, of the lower half of the data, and the third quartile, , is the middle value of the upper half of the
data.

Note the following correspondence between quartiles and percentiles:

• The first quartile corresponds to the 25th percentile.


• The second quartile (which is the median) corresponds to the 50th percentile.
• The third quartile corresponds to the 75th percentile.

EXAMPLE 3.11

Problem

Consider the following ordered dataset, which represents the time in seconds for an athlete to complete a
40-yard run:

5.4, 6.0, 6.3, 6.8, 7.1, 7.2, 7.4, 7.5, 7.9, 8.2, 8.7

Solution

The median, or second quartile, is the middle value in this dataset, which is 7.2. Notice that 50% of the data
values are below the median, and 50% of the data values are above the median. The lower half of the data
values are 5.4, 6.0, 6.3, 6.8, 7.1. Note that these are the data values below the median. The upper half of the
data values are 7.4, 7.5, 7.9, 8.2, 8.7, which are the data values above the median.

To find the first quartile, , locate the middle value of the lower half of the data (5.4, 6.0, 6.3, 6.8, 7.1). The
middle value of the lower half of the dataset is 6.3. Notice that one-fourth, or 25%, of the data values are
below this first quartile, and 75% of the data values are above this first quartile.

To find the third quartile, , locate the middle value of the upper half of the data (7.4, 7.5, 7.9, 8.2, 8.7).
The middle value of the upper half of the dataset is 7.9. Notice that one-fourth, or 25%, of the data values
are above this third quartile, and 75% of the data values are below this third quartile.

Thus, the quartiles , , for this dataset are 6.3, 7.2, 7.9, respectively.

Access for free at openstax.org


3.3 • Measures of Position 119

The interquartile range (IQR) is a number that indicates the spread of the middle half, or the middle 50%, of
the data. It is the difference between the third quartile, , and the first quartile, .

Note that the IQR provides a measure of variability that excludes outliers.

In Example 3.11, the IQR can be calculated as:

Quartiles and the IQR can be used to flag possible outliers in a dataset. For example, if most employees at a
company earn about $50,000 and the CEO of the company earns $2.5 million, then we consider the CEO’s
salary to be an outlier data value because this salary is significantly different from all the other salaries in the
dataset. An outlier data value can also be a value much lower than the other data values in a dataset, so if one
employee only makes $15,000, then this employee’s low salary might also be considered an outlier.

To detect outliers, you can use the quartiles and the IQR to calculate a lower and an upper bound for outliers.
Then any data values below the lower bound or above the upper bound will be flagged as outliers. These data
values should be further investigated to determine the nature of the outlier condition and whether the data
values are valid or not.

To calculate the lower and upper bounds for outliers, use the following formulas:

These formulas typically use 1.5 as a cutoff value to identify outliers in a dataset.

EXAMPLE 3.12

Problem

Calculate the IQR for the following 13 home prices and determine if any of the home prices values are
potential outliers. Data values are in US dollars.

389950, 230500, 158000, 479000, 639000, 114950, 5500000, 387000, 659000, 529000, 575000, 488800,
1095000

Solution

Order the data from smallest to largest.

114950, 158000, 230500, 387000, 389950, 479000, 488800, 529000, 575000, 639000, 659000, 1095000,
5500000

First, determine the median of the dataset. There are 13 data values, so the median is the middle data
value, which is 488,800.

Next, calculate the and .

For the first quartile, look at the data values below the median. The two middle data values in this lower half
of the data are 230,500 and 387,000. To determine the first quartile, find the mean of these two data values.

For the third quartile, look at the data values above the median. The two middle data values in this upper
half of the data are 639,000 and 659,000. To determine the third quartile, find the mean of these two data
values.
120 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

Now, calculate the interquartile range (IQR):

Calculate the value of 1.5 interquartile range (IQR):

Calculate the lower and upper bound for outliers:

The lower bound for outliers is −201,625. Of course, no home price is less than −201,625, so no outliers are
present for the lower end of the dataset.

The upper bound for outliers is 1,159,375. The data value of 5,500,000 is greater than the upper bound of
1,159,375. Therefore, the home price of $5,500,000 is a potential outlier. This is important because the
presence of outliers could potentially indicate data errors or some other anomalies in the dataset that
should be investigated. For example, there may have been a data entry error and a home price of $550,000
was erroneously entered as $5,500,000.

-scores
The -score is a measure of the position of an entry in a dataset that makes use of the mean and standard
deviation of the data. It represents the number of standard deviations by which a data value differs from the
mean. For example, suppose that in a certain neighborhood, the mean selling price of a home is $350,000 and
the standard deviation is $40,000. A particular home sells for $270,000. Based on the selling price of this home,
we can calculate the relative standing of this home compared to other home sales in the same neighborhood.

The corresponding -score of a measurement considers the given measurement in relation to the mean and
standard deviation for the entire population. The formula for a -score calculation is as follows:

Where:
is the measurement
is the mean
is the standard deviation

Notice that when a measurement is below the mean, the corresponding -score will be a negative value. If the
measurement is exactly equal to the mean, the corresponding -score will be zero. If the measurement is
above the mean, the corresponding -score will be a positive value.

-scores can also be used to identify outliers. Since -scores measure the number of standard deviations from
the mean for a data value, a -score of 3 would indicate a data value that is 3 standard deviations above the
mean. This would represent a data value that is significantly displaced from the mean, and typically, a -score
less than −3 or a -score greater than +3 can be used to flag outliers.

Access for free at openstax.org


3.4 • Probability Theory 121

EXAMPLE 3.13

Problem

For the home example in Example 3.12, the value is the home price of $270,000, the mean is $350,000,
and the standard deviation is $40,000. Calculate the -score.

Solution

The -score can be calculated as follows:

This -score of −2 indicates that the selling price for this home is 2 standard deviations below the mean,
which represents a data value that is significantly below the mean.

Using Python to Calculate Measures of Position for a Dataset


DataFrame.describe() computes different measures of position as well on each column of a dataset. See
min, 25%, 50%, 75%, and max in Figure 3.4.

Figure 3.4 The Output of DataFrame.describe() with the Movie Profit Dataset

3.4 Probability Theory


Learning Outcomes
By the end of this section, you should be able to:
• 3.4.1 Describe the basic concepts of probability and apply these concepts to real-world applications
in data science.
• 3.4.2 Apply conditional probability and Bayes’ Theorem.

Probability is a numerical measure that assesses the likelihood of occurrence of an event. Probability
applications are ubiquitous in data science since many decisions in business, science, and engineering are
based on probability considerations. We all use probability calculations every day as we decide, for instance,
whether to take an umbrella to work, the optimal route for a morning commute, or the choice of a college
major.

Basic Concepts of Probability


We have all used probability in one way or another on a day-to-day basis. Before leaving the house, you might
122 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

want to know the probability of rain. The probability of obtaining heads on one flip of a coin is one-half, or 0.5.

A data scientist is in interested in expressing probability as a number between 0 and 1 (inclusive), where 0
indicates impossibility (the event will not occur) and 1 indicates certainty (the event will occur). The probability
of an event falling between 0 and 1 reflects the degree of uncertainty associated with the event.

Here is some terminology we will be using in probability-related analysis:

• An outcome is the result of a single trial in a probability experiment.


• The sample space is the set of all possible outcomes in a probability experiment.
• An event is some subset of the sample space. For example, an event could be rolling an even number on a
six-sided die. This event corresponds to three outcomes, namely rolling a 2, 4, or 6 on the die.

To calculate probabilities, we can use several approaches, including relative frequency probability, which is
based on actual data, and theoretical probability, which is based on theoretical conditions.

Relative Frequency Probability


Relative frequency probability is a method of determining the likelihood of an event occurring based on the
observed frequency of its occurrence in a given sample or population. A data scientist conducts or observes a
procedure and determines the number of times a certain Event occurs. The probability of Event , denoted
as , is then calculated based on data that has been collected from the experiment, as follows:

EXAMPLE 3.14

Problem

A polling organization asks a sample of 400 people if they are in favor of increased funding for local schools;
312 of the respondents indicate they are in favor of increased funding. Calculate the probability that a
randomly selected person will be in favor of increased funding for local schools.

Solution

Using the data collected from this polling, a total of 400 people were asked the question, and 312 people
were in favor of increased school funding. The probability for a randomly selected person being in favor of
increased funding can then be calculated as follows (notice that Event in this example corresponds to the
event that a person is in favor of the increased funding):

EXAMPLE 3.15

Problem

A medical patient is told they need knee surgery, and they ask the doctor for an estimate of the probability
of success for the surgical procedure. The doctor reviews data from the past two years and determines
there were 200 such knee surgeries performed and 188 of them were successful. Based on this past data,
the doctor calculates the probability of success for the knee surgery (notice that Event in this example
corresponds to the event that a patient has a successful knee surgery result).

Access for free at openstax.org


3.4 • Probability Theory 123

Solution

Using the data collected from the past two years, there were 200 surgeries performed, with 188 successes.
The probability can then be calculated as:

The doctor informs the patient that there is a 94% chance of success for the pending knee surgery.

Theoretical Probability
Theoretical probability is the method used when the outcomes in a probability experiment are equally
likely—that is, under theoretical conditions.

The formula used for theoretical probability is similar to the formula used for empirical probability.
Theoretical probability considers all the possible outcomes for an experiment that are known ahead of time so
that past data is not needed in the calculation for theoretical probability.

For example, the theoretical probability of rolling an even number when rolling a six-sided die is (which is ,
or 0.5). There are 3 outcomes corresponding to rolling an even number, and there are 6 outcomes total in the
sample space. Notice this calculation can be done without conducting any experiments since the outcomes are
equally likely.

EXAMPLE 3.16

Problem

A student is working on a multiple-choice question that has 5 possible answers. The student does not have
any idea about the correct answer, so the student randomly guesses. What is the probability that the
student selects the correct answer?

Solution

Since the student is guessing, each answer choice is equally likely to be selected. There is 1 correct answer
out of 5 possible choices. The probability of selecting the correct answer can be calculated as:

Notice in Example 3.16 that probabilities can be written as fractions, decimals, or percentages.

Also note that any probability must be between 0 and 1 inclusive. An event with a probability of zero will never
occur, and an event with a probability of 1 is certain to occur. A probability greater than 1 is not possible, and a
negative probability is not possible.

Complement of an Event
The complement of an event is the set of all outcomes in the sample space that are not included in the event.
The complement of Event is usually denoted by (A prime). To find the probability of the complement of
Event , subtract the probability of Event from 1.
124 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

EXAMPLE 3.17

Problem

A company estimates that the probability that an employee will provide confidential information to a hacker
is 0.1%. Determine the probability that an employee will not provide any confidential information during a
hacking attempt.

Solution

Let Event be the event that the employee will provide confidential information to a hacker. Then the
complement of this Event is the event that an employee will not provide any confidential information
during a hacking attempt.

There is a 99.9% probability that an employee will not provide any confidential information during a hacking
attempt.

Conditional Probability and Bayes’ Theorem


Data scientists are often interested in determining conditional probabilities, or the occurrence of one event
that is conditional or dependent on another event. For example, a medical researcher might be interested to
know if an asthma diagnosis for a patient is dependent on the patient’s exposure to air pollutants. In addition,
when calculating conditional probabilities, we can sometimes revise a probability estimate based on additional
information that is obtained. As we’ll see in the following section, Bayes’ Theorem allows new information to
be used to refine a probability estimate.

Conditional Probability
A conditional probability is the probability of an event given that another event has already occurred. The
notation for conditional probability is , which denotes the probability of Event , given that Event
has occurred. The vertical line between and denotes the “given” condition. (In this notation, the vertical
line does not denote division).

For example, we might want to know the probability of a person getting a parking ticket given that a person
did not put any money in a parking meter. Or a medical researcher might be interested in the probability of a
patient developing heart disease given that the patient is a smoker.

If the occurrence of one event affects the probability of occurrence for another event, we say that the events
are dependent; otherwise, the events are independent. Dependent events are events where the occurrence of
one event affects the probability of occurrence of another event. Independent events are events where the
probability of occurrence of one event is not affected by the occurrent of another event. The dependence of
events has important implications in many fields such as marketing, engineering, psychology, and medicine.

EXAMPLE 3.18

Problem

Determine if the two events are dependent or independent:

1. Rolling a 3 on one roll of a die, rolling a 4 on a second roll of a die

Access for free at openstax.org


3.4 • Probability Theory 125

2. Obtaining heads on one flip of a coin and obtaining tails on a second flip of a coin
3. Selecting five basketball players from a professional basketball team and a player’s height is greater
than 6 feet
4. Selecting an Ace from a deck of 52 cards, returning the card back to the original stack, and then
selecting a King
5. Selecting an Ace from a deck of 52 cards, not returning the card back to the original stack, and then
selecting a King

Solution

1. The result of one roll does not affect the result for the next roll, so these events are independent.
2. The results of one flip of the coin do not affect the results for any other flip of the coin, so these events
are independent.
3. Typically, basketball players are tall individuals, and so they are more likely to have heights greater than
6 feet as opposed to the general public, so these events are dependent.
4. By selecting an Ace from a deck of 52 cards and then replacing the card, this restores the deck of cards
to its original state, so the probability of selecting a King is not affected by the selection of the Ace. So
these events are independent.
5. By selecting an Ace from a deck of 52 cards and then not replacing the card, this will result in only 51
cards remaining in the deck. Thus, the probability of selecting a King is affected by the selection of the
Ace, so these events are dependent.

There are several ways to use conditional probabilities in data science applications.

Conditional probability can be defined as follows:

When assessing the conditional probability of , if the two events are independent, this indicates that
Event is not affected by the occurrence of Event , so we can write that for independent
events.

If we determine that the is not equal to , this indicates that the events are dependent.

implies independent events, where .

implies dependent events.

EXAMPLE 3.19

Problem

Table 3.2 shows the number of nursing degrees and non-nursing degrees at a university for a specific year,
and the data is broken out by age groups. Calculate the probability that a randomly chosen graduate
obtained a nursing degree, given that the graduate is in the age group of 23 and older.
126 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

Age Nursing Non- Total


Group Degrees Nursing
Degrees
22 and 1036 1287 2323
under
23 and 986 932 1918
older
Total 2022 2219 4241

Table 3.2 Number of Nursing and Non-Nursing Degrees at a


University by Age Group

Solution

Since we are given that the group of interest are those graduates in the age group of 23 and older, focus
only on the second row in the table.

Looking only at the second row in the table, we are interested in the probability that a randomly chosen
graduate obtained a nursing degree. The reduced sample space consists of 1,918 graduates, and 986 of
them received nursing degrees. So the probability can be calculated as:

Another method to analyze this example is to rewrite the conditional probability using the equation for
, as follows:

We can now use this equation to calculate the probability that a randomly chosen graduate obtained a
nursing degree, given that the graduate is in the age group of 23 and older. The probability of and is
the probability that a graduate received a nursing degree and is also in the age group of 23 and older. From
the table, there are 986 graduates who earned a nursing degree and are also in the age group of 23 and
older. Since this number of graduates is out of the total sample size of 4,241, we can write the probability of
Events and as:

We can also calculate the probability that a graduate is in the age group of 23 and older. From the table,
there are 1,918 graduates in this age group out of the total sample size of 4,241, so we can write the
probability for Event as:

Next, we can substitute these probabilities into the formula for , as follows:

Access for free at openstax.org


3.4 • Probability Theory 127

Probability of At Least One


The probability of at least one occurrence of an event is often of interest in many data science applications. For
example, a doctor might be interested to know the probability that at least one surgery to be performed this
week will involve an infection of some type.

The phrase “at least one” implies the condition of one or more successes. From a sample space perspective,
one or more successes is the complement of “no successes.” Using the complement rule discussed earlier, we
can write the following probability formula:

As an example, we can find the probability of rolling a die 3 times and obtaining at least one four on any of the
rolls. This can be calculated by first finding the probability of not observing a four on any of the rolls and then
subtracting this probability from 1. The probability of not observing a four on a roll of the die is 5/6. Thus, the
probability of rolling a die 3 times and obtaining at least one four on any of the rolls is .

EXAMPLE 3.20

Problem

From past data, hospital administrators determine the probability that a knee surgery will be successful is
0.89.

1. During a certain day, the hospital schedules four knee surgeries to be performed. Calculate the
probability that all four of these surgeries will be successful.
2. Calculate the probability that none of these knee surgeries will be successful.
3. Calculate the probability that at least one of the knee surgeries will be successful.

Solution

1. For all four surgeries to be successful, we can interpret that as the first surgery will be successful, and
the second surgery will be successful, and the third surgery will be successful, and the fourth surgery
will be successful. Since the probability of success for one knee surgery does not affect the probability
of success for another knee surgery, we can assume these events are independent. Based on this, the
probability that all four surgeries will be successful can be calculated using the probability formula for
by multiplying the probabilities together:

There is about a 63% chance that all four knee surgeries will be successful.

2. The probability that a knee surgery will be unsuccessful can be calculated using the complement rule. If
the probability of a successful surgery is 0.89, then the probability that the surgery will be unsuccessful
is 0.11:

Based on this, the probability that all four surgeries will be unsuccessful can be calculated using the
probability formula for by multiplying the probabilities together:
128 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

Since this is a very small probability, it is very unlikely that none of the surgeries will be successful.

3. To calculate the probability that at least one of the knee surgeries will be successful, use the probability
formula for “at least one,” which is calculated as the complement of the event “none are successful.”

This indicates there is a very high probability that at least one of the knee surgeries will be successful.

Bayes’ Theorem
Bayes’ Theorem is a statistical technique that allows for the revision of probability estimates based on new
information or evidence that allows for more accurate and efficient decision-making in uncertain situations.
Bayes’ Theorem is often used to help assess probabilities associated with medical diagnoses such as the
probability a patient will develop cancer based on test screening results. This can be important in medical
analysis to help assess the impact of a false positive, which is the scenario where the patient does not have the
ailment but the screening test gives a false indication that the patient does have the ailment.

Bayes’ Theorem allows the calculation of the conditional probability . There are several forms of Bayes’
Theorem, as shown:

EXAMPLE 3.21

Problem

Assume that a certain type of cancer affects 3% of the population. Call the event that a person has cancer
“Event ,” so:

A patient can undergo a screening test for this type of cancer. Assume the probability of a true positive from
the screening test is 75%, which indicates that probability that a person has a positive test result given that
they actually have cancer is 0.75. Also assume the probability of a false positive from the screening test is
15%, which indicates that probability that a person has a positive test result given that they do not have
cancer is 0.15.

A medical researcher is interested in calculating the probability that a patient actually has cancer given that
the screening test shows a positive result.

The researcher is interested in calculating , where Event is the person actually has cancer and
Event is the event that the person shows a positive result in the screening test. Use Bayes’ Theorem to
calculate this conditional probability.

Access for free at openstax.org


3.5 • Discrete and Continuous Probability Distributions 129

Solution

From the example, the following probabilities are known:

The conditional probabilities can be interpreted as follows:

Substituting these probabilities into the formula for Bayes’ Theorem results in the following:

This result from Bayes’ Theorem indicates that even if a patient receives a positive test result from the
screening test, this does not imply a high likelihood that the patient has cancer. There is only a 13% chance
that the patient has cancer given a positive test result from the screening test.

3.5 Discrete and Continuous Probability Distributions


Learning Outcomes
By the end of this section, you should be able to:
• 3.5.1 Describe fundamental aspects of probability distributions.
• 3.5.2 Apply discrete probability distributions including binomial and Poisson distributions.
• 3.5.3 Apply continuous probability distributions including exponential and normal distributions.
• 3.5.4 Use Python to apply various probability distributions for probability applications.

Probability distributions are used to model various scenarios to help with probability analysis and
predictions, and they are used extensively to help formulate probability-based decisions. For example, if a
doctor knows that the weights of newborn infants follow a normal (bell-shaped) distribution, the doctor can
use this information to help identify potentially underweight newborn infants, which might indicate a medical
condition warranting further investigation. Using a normal distribution, the doctor can calculate that only a
small percentage of babies have weights below a certain threshold, which might prompt the doctor to further
investigate the cause of the low weight. Or a medical researcher might be interested in the probability that a
person will have high blood pressure or the probability that a person will have type O blood.

Overview of Probability Distributions


To begin our discussion of probability distributions, some terminology will be helpful:

• Random variable—a variable where a single numerical value is assigned to a specific outcome from an
experiment. Typically the letter is used to denote a random variable. For example, assign the numerical
values 1, 2, 3, … 13 to the cards selected from a standard 52-card deck of Ace, 2, 3, … 10, Jack, Queen, King.
Notice we cannot use “Jack” as the value of the random variable since by definition a random variable must
be a numerical value.
• Discrete random variable—a random variable is considered discrete if there is a finite or countable
number of values that the random variable can take on. (If there are infinitely many values, the number of
values is countable if it possible to count them individually.) Typically, a discrete random variable is the
result of a count of some kind. For example, if the random variable represents the number of cars in a
130 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

parking lot, then the values that x can take on can only be whole numbers since it would not make sense
to have cars in the parking lot.
• Continuous random variable—a random variable is considered continuous if the value of the random
variable can take on any value within an interval. Typically, a continuous random variable is the result of a
measurement of some kind. For example, if the random variable represents the weight of a bag of
apples, then can take on any value such as pounds of apples.

To summarize, the difference between discrete and continuous probability distributions has to do with the
nature of the random variables they represent. Discrete probability distributions are associated with
variables that take on a finite or countably infinite number of distinct values. Continuous probability
distributions deal with random variables that can take on any value within a given range or interval. It is
important to identify and distinguish between discrete and continuous random variables since different
statistical methods are used to analyze each type.

EXAMPLE 3.22

Problem

A coin is flipped three times. Determine a possible random variable that can be assigned to represent the
number of heads observed in this experiment.

Solution

One possible random variable assignment could be to let count the number of heads observed in each
possible outcome in the sample space. When flipping a coin three times, there are eight possible outcomes,
and will be the numerical count corresponding to the number of heads observed for each outcome.
Notice that the possible values for the random variable are 0, 1, 2 and 3, as shown in Table 3.3.

Result for Flip #1 Result for Flip #2 Result for Flip #3 Value of Random
Variable
Heads Heads Heads 3
Heads Heads Tails 2
Heads Tails Heads 2
Heads Tails Tails 1
Tails Heads Heads 2
Tails Heads Tails 1
Tails Tails Heads 1
Tails Tails Tails 0

Table 3.3 Result of Three Random Coin Flips

EXAMPLE 3.23

Problem

Identify the following random variables as either discrete or continuous random variables:

1. The amount of gas, in gallons, used to fill a gas tank


2. Number of children per household in a certain neighborhood

Access for free at openstax.org


3.5 • Discrete and Continuous Probability Distributions 131

3. Number of text messages sent by a certain student during a particular day


4. Number of hurricanes affecting Florida in a given year
5. The amount of rain, in inches, in Detroit, Michigan, in a certain month

Solution

1. The number of gallons of gas used to fill a gas tank can take on any value, such as 12.3489, so this
represents a continuous random variable.
2. The number of children per household in a certain neighborhood can only take on certain discrete
values such as 0, 1, 2, 3, etc., so this represents a discrete random variable.
3. The number of text messages sent by a certain student during a particular day can only take on certain
discrete values such as 26, 10, 17, etc., so this represents a discrete random variable.
4. The number of hurricanes affecting Florida in a given year can only take on certain values such as 0, 1,
2, 3, etc., so this represents a discrete random variable.
5. The number of inches of rain in Detroit, Michigan, in a certain month can take on any value, such as
2.0563, so this represents a continuous random variable.

Discrete Probability Distributions: Binomial and Poisson


Discrete random variables are of interest in many data science applications, and there are several probability
distributions that apply to discrete random variables. In this chapter, we present the binomial distribution and
the Poisson distribution, which are two commonly used probability distributions used to model discrete
random variables for different types of events.

Binomial Distribution
The binomial distribution is used in applications where there are two possible outcomes for each trial in an
experiment and the two possible outcomes can be considered as success or failure. For example, when a
baseball player is at-bat, the player either gets a hit or does not get a hit. There are many applications of
binomial experiments that occur in medicine, psychology, engineering, science, marketing, and other fields.

There are many statistical experiments where the results of each trial can be considered as either a success or
a failure. For example, when flipping a coin, the two outcomes are heads or tails. When rolling a die, the two
outcomes can be considered to be an even number appears on the face of the die or an odd number appears
on the face of the die. When conducting a marketing study, a customer can be asked if they like or dislike a
certain product. Note that the word “success” here does not necessarily imply a good outcome. For example, if
a survey was conducted of adults and each adult was asked if they smoke, we can consider the answer “yes” to
be a success and the answer “no” to be a failure. This means that the researcher can define success and failure
in any way; however, the binomial distribution is applicable when there are only two outcomes in each trial of
an experiment.

The requirements to identify a binomial experiment and apply the binomial distribution include:

• The experiment of interest is repeated for a fixed number of trials, and each trial is independent of other
trials. For example, a market researcher might select a sample of 20 people to be surveyed where each
respondent will reply with a “yes” or “no” answer. This experiment consists of 20 trials, and each person’s
response to the survey question can be considered as independent of another person’s response.
• There are only two possible outcomes for each trial, which can be labeled as “success” or “failure.”
• The probability of success remains the same for each trial of the experiment. For example, from past data
we know that 35% of people prefer vanilla as their favorite ice cream flavor. If a group of 15 individuals are
surveyed to ask if vanilla is their favorite ice cream flavor, the probability of success for each trial will be
0.35.
• The random variable will count the number of successes in the experiment. Notice that since will count
132 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

the number of successes, this implies that will be a discrete random variable. For example, if the
researcher is counting the number of people in the group of 15 that respond to say vanilla is their favorite
ice cream flavor, then can take on values such as 3 or 7 or 12, but could not equal 5.28 since is
counting the number of people.

When working with a binomial experiment, it is useful to identify two specific parameters in a binomial
experiment:

1. The number of trials in the experiment. Label this as .


2. The probability of success for each trial (which is a constant value). Label this as .

We then count the number of successes of interest as the value of the discrete random variable.
Label this as .

EXAMPLE 3.24

Problem

A medical researcher is conducting a study related to a certain type of shoulder surgery. A sample of 20
patients who have recently undergone the surgery is selected, and the researcher wants to determine the
probability that 18 of the 20 patients had a successful result from the surgery. From past data, the
researcher knows that the probability of success for this type of surgery is 92%.

1. Does this experiment meet the requirements for a binomial experiment?


2. If so, identify the values of , , and in the experiment.

Solution

1. This experiment does meet the requirements for a binomial experiment since the experiment will be
repeated for 20 trials, and each response from a patient will be independent of other responses. Each
reply from a patient will be one of two responses—the surgery was successful or the surgery was not
successful. The probability of success remains the same for each trial at 92%. The random variable
can be used to count the number of patients who respond that the surgery was successful.
2. The number of trials is 20 since 20 patients are being surveyed, so .
The probability of success for each surgery is 92%, so .
The number of successes of interest is 18 since the researcher wants to determine the probability that
18 of the 20 patients had a successful result from the surgery, so .

When calculating the probability for successes in a binomial experiment, a binomial probability formula can
be used, but in many cases technology is used instead to streamline the calculations.

The probability mass function (PMF) for the binomial distribution describes the probability of getting exactly
successes in independent Bernoulli trials, each with a probability of success. The PMF is given by the
formula:

Where:
is the probability that the random variable takes on the value of exactly successes
is the number of trials in the experiment
is the probability of success
is the number of successes in the experiment

refers to the number of ways to choose successes from

Access for free at openstax.org


3.5 • Discrete and Continuous Probability Distributions 133

Note: The notation is read as factorial and is a mathematical notation used to express the multiplication of
. For example, .

EXAMPLE 3.25

Problem

For the binomial experiment discussed in Example 3.24, calculate the probability that 18 out of the 20
patients will respond to indicate that the surgery was successful. Also, show a graph of the binomial
distribution to show the probability distribution for all values of the random variable .

Solution

In Example 3.24, the parameters of the binomial experiment are:

Substituting these values into the binomial probability formula, the probability for 18 successes can be
calculated as follows:

Based on this result, the probability that 18 out of the 20 patients will respond to indicate that the surgery
was successful is 0.271, or approximately 27%.

Figure 3.5 illustrates this binomial distribution, where the horizontal axis shows the values of the random
variable , and the vertical axis shows the binomial probability for each value of . Note that values of less
than 14 are not shown on the graph since these corresponding probabilities are very close to zero.

Figure 3.5 Graph of the Binomial Distribution for and

Since these computations tend to be complicated and time-consuming, most data scientists will use
technology (such as Python, R, Excel, or others) to calculate binomial probabilities.
134 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

Poisson Distribution
The goal of a binomial experiment is to calculate the probability of a certain number of successes in a specific
number of trials. However, there are certain scenarios where a data scientist might be interested to know the
probability of a certain number of occurrences for a random variable in a specific interval, such as an interval
of time.

For example, a website developer might be interested in knowing the probability that a certain number of
users visit a website per minute. Or a traffic engineer might be interested in calculating the probability of a
certain number of accidents per month at a busy intersection.

The Poisson distribution is applied when counting the number of occurrences in a certain interval. The
random variable then counts the number of occurrences in the interval.

A common application for the Poisson distribution is to model arrivals of customers for a queue, such as when
there might be 6 customers per minute arriving at a checkout lane in the grocery store and the store manager
wants to ensure that customers are serviced within a certain amount of time.

The Poisson distribution is a discrete probability distribution used in these types of situations where the
interest is in a specific certain number of occurrences for a random variable in a certain interval such as time
or area.

The Poisson distribution is used where the following conditions are met:

• The experiment is based on counting the number of occurrences in a specific interval where the interval
could represent time, area, volume, etc.
• The number of occurrences in one specific interval is independent of the number of occurrences in a
different interval.

Notice that when we count the number of occurrences that a random variable occurs in a specific interval,
this will represent a discrete random variable. For example, the count of the number of customers that arrive
per hour to a queue for a bank teller might be 21 or 15, but the count could not be 13.32 since we are counting
the number of customers and hence the random variable will be discrete.

To calculate the probability of successes, the Poisson probability formula can be used, as follows:

Where:
is the average or mean number of occurrences per interval
is the constant 2.71828…

EXAMPLE 3.26

Problem

From past data, a traffic engineer determines the mean number of vehicles entering a parking garage is 7
per 10-minute period. Calculate the probability that the number of vehicles entering the garage is 9 in a
certain 10-minute period. Also, show a graph of the Poisson distribution to show the probability distribution
for various values of the random variable .

Solution

This example represents a Poisson distribution in that the random variable is based on the number of
vehicles entering a parking garage per time interval (in this example, the time interval of interest is 10
minutes). Since the average is 7 vehicles per 10-minute interval, we label the mean as 7. Since the

Access for free at openstax.org


3.5 • Discrete and Continuous Probability Distributions 135

engineer want to know the probability that 9 vehicles enter the garage in the same time period, the value of
the random variable is 9.

Thus, in this example, the parameters of the Poisson distribution are:

Substituting these values into the Poisson probability formula, the probability for 9 vehicles entering the
garage in a 10-minute interval can be calculated as follows:

Thus, there is about a 10% probability of 9 vehicles entering the garage in a 10-minute interval.

Figure 3.6 illustrates this Poisson distribution, where the horizontal axis shows the values of the random
variable and the vertical axis shows the Poisson probability for each value of .

Figure 3.6 Poisson Distribution for

As with calculations involving the binomial distribution, data scientists will typically use technology to solve
problems involving the Poisson distribution.

Normal Continuous Probability Distributions


Recall that a random variable is considered continuous if the value of the random variable can take on any of
infinitely many values. We used the example about that if the random variable represents the weight of a
bag of apples, then can take on any value such as pounds of apples.

Many probability distributions apply to continuous random variables. These distributions rely on determining
the probability that the random variable falls within a distinct range of values, which can be calculated using a
probability density function (PDF). The probability density function (PDF) calculates the corresponding area
under the probability density curve to determine the probability that the random variable will fall within this
specific range of values. For example, to determine the probability that a salary falls between $50,000 and
$70,000, we can calculate the area under the probability density function between these two salaries.

Note that the total area under the probability density function will always equal 1. The probability that a
continuous random variable takes on a specific value is 0, so we will always calculate the probability for a
random variable falling within some interval of values.
136 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

In this section, we will examine an important continuous probability distribution that relies on the probability
density function, namely the normal distribution. Many variables, such as heights, weights, salaries, and blood
pressure measurements, follow a normal distribution, making it especially important in statistical analysis. In
addition, the normal distribution forms the basis for more advanced statistical analysis such as confidence
intervals and hypothesis testing, which are discussed in Inferential Statistics and Regression Analysis.

The normal distribution is a continuous probability distribution that is symmetrical and bell-shaped. It is used
when the frequency of data values decreases with data values above and below the mean. The normal
distribution has applications in many fields including engineering, science, finance, medicine, marketing, and
psychology.

The normal distribution has two parameters: the mean, , and the standard deviation, . The mean represents
the center of the distribution, and the standard deviation measures the spread, or dispersion, of the
distribution. The variable represents the realization, or observed value, of the random variable that
follows a normal distribution.

The typical notation used to indicate that a random variable follows a normal distribution is as follows:
(see Figure 3.7). For example, the notation indicates that the random variable
follows a normal distribution with mean of 5.2 and standard deviation of 3.7.

A normal distribution with mean of 0 and standard deviation of 1 is called the standard normal distribution
and can be notated as . Any normal distribution can be standardized by converting its values to
-scores. Recall that a -score tells you how many standard deviations from the mean there are for a given
measurement.

Figure 3.7 Graph of the Normal (Bell-Shaped) Distribution

The curve in Figure 3.7 is symmetric on either side of a vertical line drawn through the mean, . The mean is
the same as the median, which is the same as the mode, because the graph is symmetric about . As the
notation indicates, the normal distribution depends only on the mean and the standard deviation. Because the
area under the curve must equal 1, a change in the standard deviation, , causes a change in the shape of the
normal curve; the curve becomes fatter and wider or skinnier and taller depending on . A change in causes
the graph to shift to the left or right. This means there are an infinite number of normal probability
distributions.

To determine probabilities associated with the normal distribution, we find specific areas under the normal
curve. There are several methods for finding this area under the normal curve, and we typically use some form
of technology. Python, Excel, and R all provide built-in functions for calculating areas under the normal curve.

EXAMPLE 3.27

Problem

Suppose that at a software company, the mean employee salary is $60,000 with a standard deviation of
$7,500. Assume salaries at this company follow a normal distribution. Use Python to calculate the

Access for free at openstax.org


3.5 • Discrete and Continuous Probability Distributions 137

probability that a random employee earns more than $68,000.

Solution

A normal curve can be drawn to represent this scenario, in which the mean of $60,000 would be plotted on
the horizontal axis, corresponding to the peak of the curve. Then, to find the probability that an employee
earns more than $68,000, calculate the area under the normal curve to the right of the data value $68,000.

Figure 3.8 illustrates the area under the normal curve to the right of a salary of $68,000 as the shaded-in
region.

Figure 3.8 Bell-Shaped Distribution for Example 3.27. The shaded region under the normal curve corresponds to the probability
that an employee earns more than $68,000.

To find the actual area under the curve, a Python command can be used to find the area under the normal
probability density curve to the right of the data value of $68,000. See Using Python with Probability
Distributions for the specific Python program and results. The resulting probability is calculated as 0.143.

Thus, there is a probability of about 14% that a random employee has a salary greater than $75,000.

The empirical rule is a method for determining approximate areas under the normal curve for measurements
that fall within one, two, and three standard deviations from the mean for the normal (bell-shaped)
distribution. (See Figure 3.9).

Figure 3.9 Normal Distribution Showing Mean and Increments of Standard Deviation

If is a continuous random variable and has a normal distribution with mean and standard deviation , then
the empirical rule states that:

• About 68% of the -values lie between and units from the mean (within one standard deviation
138 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

of the mean).
• About 95% of the -values lie between and units from the mean (within two standard
deviations of the mean).
• About 99.7% of the -values lie between and units from the mean (within three standard
deviations of the mean). Notice that almost all the x-values lie within three standard deviations of the
mean.
• The -scores for and are and , respectively.
• The -scores for and are and , respectively.
• The -scores for and are and , respectively.

EXAMPLE 3.28

Problem

An automotive designer is interested in designing automotive seats to accommodate the heights for about
95% of customers. Assume the heights of adults follow a normal distribution with mean of 68 inches and
standard deviation of 3 inches. For what range of heights should the designer model the car seats to
accommodate 95% of drivers?

Solution

According to the empirical rule, the area under the normal curve within two standard deviations of the
mean is 95%. Thus, the designer should design the seats to accommodate heights that are two standard
deviations away from the mean. The lower bound of heights would be inches, and the upper
bound of heights would be inches. Thus, the car seats should be designed to accommodate driver
heights between 62 and 74 inches.

EXPLORING FURTHER

Statistical Applets to Explore Statistical Concepts


Applets are very useful tools to help visualize statistical concepts in action. Many applets can simulate
statistical concepts such as probabilities for the normal distribution, use of the empirical rule, creating box
plots, etc.

Visit the Utah State University applet website (https://openstax.org/r/usu) and experiment with various
statistical tools.

Using Python with Probability Distributions


Python provides a number of built-in functions for calculating probabilities associated with both discrete and
continuous probability distributions such as binomial distribution and the normal distribution. These functions
are part of a library called scipy.stats (https://openstax.org/r/scipy).

Here are a few of these probability density functions available within Python:

• binom()—calculate probabilities associated with the binomial distribution


• poisson()—calculate probabilities associated with the Poisson distribution
• expon()—calculate probabilities associated with the exponential distribution
• norm()—calculate probabilities associated with the normal distribution

To import these probability density functions within Python, use the import command. For example, to import

Access for free at openstax.org


3.5 • Discrete and Continuous Probability Distributions 139

the binom() function use the following command:


from scipy.stats import binom

Using Python with the Binomial Distribution


The binom() function in Python allows calculations of binomial probabilities. The probability mass function for
the binomial distribution within Python is referred to as binom.pmf().

The syntax for using this function is binom.pmf(x, n, p)

Where:
is the number of trials in the experiment
is the probability of success
is the number of successes in the experiment

Consider the previous Example 3.24 worked out using the Python binom.pmf() function. A medical
researcher is conducting a study related to a certain type of shoulder surgery. A sample of 20 patients who
have recently undergone the surgery is selected, and the researcher wants to determine the probability that
18 of the 20 patients had a successful result from the surgery. From past data, the researcher knows that the
probability of success for this type of surgery is 92%. Round your answer to 3 decimal places.

In this example:
is the number of trials in the experiment
is the probability of success
is the number of successes in the experiment

The corresponding function in Python is written as:

binom.pmf (18, 20, 0.92)

The round() function is then used to round the probability result to 3 decimal places.

Here is the input and output of this Python program:

PYTHON CODE

# import the binom function from the scipy.stats library


from scipy.stats import binom

# define parameters x, n, and p:


x = 18
n = 20
p = 0.92

# use binom.pmf() function to calculate binomial probability


# use round() function to round answer to 3 decimal places
round (binom.pmf(x, n, p), 3)

The resulting output will look like this:

0.271
140 3 • Descriptive Statistics: Statistical Measurements and Probability Distributions

Using Python with the Normal Distribution


The norm() function in Python allows calculations of normal probabilities. The probability density function is
sometimes called the cumulative density function, and so this is referred to as norm.cdf() within Python. The
norm.cdf() function returns the area under the normal probability density function to the left of a specified
measurement.

The syntax for using this function is


norm.cdf(x, mean, standard_deviation)

Where:
x is the measurement of interest
mean is the mean of the normal distribution
standard_deviation is the standard deviation of the normal distribution

Let’s work out the previous Example 3.27 using the Python norm.cdf() function.

Suppose that at a software company, the mean employee salary is $60,000 with a standard deviation of $7,500.
Use Python to calculate the probability that a random employee earns more than $68,000.

In this example:
is the measurement of interest
mean is the mean of the normal distribution
standard deviation is the standard deviation of the normal distribution

The corresponding function in Python is written as:


norm.cdf(68000, 60000, 7500)

The round() function is then used to round the probability result to 3 decimal places.

Notice that since this example asks to find the area to the right of a salary of $68,000, we can first find the area
to the left using the norm.cdf() function and subtract this area from 1 to then calculate the desired area to
the right.

Here is the input and output of the Python program:

PYTHON CODE

# import the norm function from the scipy.stats library


from scipy.stats import norm
# define parameters x, mean and standard_deviation:
x = 68000
mean = 60000
standard_deviation = 7500
# use norm.cdf() function to calculate normal probability - note this is
# the area to the left
# subtract this result from 1 to obtain area to the right of the x-value
# use round() function to round answer to 3 decimal places
round (1 - norm.cdf(x, mean, standard_deviation), 3)

The resulting output will look like this:

Access for free at openstax.org

You might also like