Statistics Introduction
Statistics
What is statistics?
Statistics is the study and manipulation of data, including ways to gather, review, analyze, and draw conclusions from data.
Why study statistics?
Answers provided by statistical analysis can provide the basis for making better decisions and choices of actions. Statistical
reasoning and methods can help you become efficient at obtaining information and making useful conclusions.
For example, city officials might want to know whether the level of lead in the water supply is within safety standards. Because
not all of the water can be checked, answers must be based on the partial information from samples of water that are collected
for this purpose.
Descriptive Statistics
Interest in the field grew in 18th century
At first, descriptive statistics consisted merely of the presentation of data in tables
and charts.
Nowadays, it includes the summarization of data by means of numerical
descriptions and graphs.
Statistical Inference
Statistical inference is concerned with generalizations based on sample data.
When making a statistical inference always proceed with caution.
One must decide carefully how far to go in generalizing from a given set of data.
Careful consideration must be given to determining whether such generalizations
are reasonable and whether it might be wise to collect more data.
Population Vs Sample
Population Sample
A population is the collection of all items of A sample is a subset of the population. It is
interest to our study. representative of the population.
The measurable characteristic of the The measurable characteristic of the
population like the mean or standard sample is called a statistic.
deviation is known as the parameter.
A survey done of an entire population is A survey done using a sample of the
accurate and more precise with no margin population bears accurate results, only after
of error except human inaccuracy in further factoring the margin of error and
responses. However, this may not be confidence interval.
possible always.
All the students in the class are population. All the students who regularly attend class
is a sample.
Frequency Distributions
A frequency distribution is a table that divides a set of data into a suitable number
of classes (categories), showing also the number of items belonging to each class.
The table sacrifices some of the information contained in the data.
Instead of knowing the exact value of each item, we only know that it belongs to a
certain class.
Example
Data
245 333 296 304 276 336 289 234 253 292 366 323 309 284 310 338 297 314 305 330 266 391 315 305 290
300 292 311 272 312 315 355 346 337 303 265 278 276 373 271 308 276 364 390 298 290 308 221 274 343
(205,245]
Note that the class limits are given to as many decimal places as the original data. Had the original data been given to one
decimal place, we would have used the class limits 205.1–245.0, 245.1–285.0, …, 365.1–405.0.
Class Mark and Class Interval
Class Mark: The class marks of a frequency distribution are obtained by averaging
successive class boundaries.
Class Interval: If the classes of a distribution are all of equal length then
subtraction the lower limit from the upper limit gives the class interval.
Class mark: 225, 265, 305, 345, 385
Class Interval: 40
Cumulative Distribution(less than or equal to variant)
Intervals Cumulative Frequency
(205,245] 3
(246,285] 14
(286,325] 37
(326,365] 46
(366,405] 50
Descriptive Measures: Sample Mean
N measurements/data points
Mean : the sum of the observations divided by sample size.
Descriptive Measures: Sample Median
When median is used over mean sometimes?
Sometimes it is preferable to use the sample median as a descriptive measure of
the center, or location, of a set of data.
This is particularly true if it is desired to minimize the calculations
or
If it is desired to eliminate the effect of extreme (very large or very small) values.
Question
A sample of five university students responded to the question “How much time, in
minutes, did you spend on the social network site yesterday?”
100 45 60 130 30 35
Find the mean and median.
Question
A sample of five university students responded to the question “How much time, in
minutes, did you spend on the social network site yesterday?”
100 45 60 130 30 35
Find the mean and median.
Mean: 66.67
Median: 52.5
Descriptive Measures: Deviations from Mean
Descriptive Measures: Deviations from Mean
Data: 1 2 3 4 5 Mean 3
Data: -7 -3 3 10 12 Mean 3
We observe that the dispersion of a set of data is small if the values are closely
bunched about their mean, and that it is large if the values are scattered widely
about their mean.
It would seem reasonable, therefore, to measure the variation of a set of data in
terms of the amounts by which the values deviate from their mean.
Descriptive Measures: Deviations from Mean
The sum of the deviations about mean is always zero.
Because the deviations sum to zero, we need to remove their signs. Absolute
value and square are two natural choices.
If we take their absolute value, so each negative deviation is treated as positive,
we would obtain a measure of variation.
However, to obtain the most common measure of variation, we square each
deviation.
Descriptive Measures: Variance
Reason for dividing by n−1 instead of n is that there are only n−1 independent deviations
xi − x̄.
Because their sum is always zero, the value of any particular one is always equal to the
negative of the sum of the other n − 1 deviations.
If many of the deviations are large in magnitude, either positive or negative, their squares
will be large and s2 will be large. When all the deviations are small, s2 will be small.
Example
The delay times (handling, setting, and positioning the tools) for cutting 6 parts on
an engine lathe are 0.6, 1.2, 0.9, 1.0, 0.6, and 0.8 minutes. Calculate s2.
Descriptive Measures: Standard Deviation
Notice that the units of s2 are not those of the original observations.
In previous question the data are delay times in minutes, but s2 has the unit
(minute)2
Consequently, we define the standard deviation of n observations x1, x2,..., xn as
the square root of their variance.
The standard deviation is by far the most generally useful measure of variation. Its
advantage over the variance is that it is expressed in the same units as the
observations.
Descriptive Measures: Quartiles
In addition to the median, which divides a set of data into halves, we can consider
other division points.
When an ordered data set is divided into quarters, the resulting division points are
called sample quartiles.
The first quartile, Q1, is a value that has one-fourth, or 25%, of the observations
below its value. The first quartile is also the sample 25th percentile P0.25.
Descriptive Measures: Percentile
More generally, we define the sample 100 pth percentile as :
The sample 100 pth percentile is a value such that at least 100p% of the
observations are at or below this value, and at least 100(1 − p)% are at or above
this value.
Descriptive Measures: Percentile
Question
Given the data
136 143 147 151 158 160 161 163 165 167 173 174 181 181 185 188 190 205
Obtain the quartiles and the 10th percentile.
n = 18
First quartile: 18*(0.25) = 4.5 (round up to 5)
Q1 = 5th observation = 158
Number of observations below or equal to 158 = 5 (atleast 4.5 required acc to definition)
Number of observations equal to or above 158 = 14 (atleast 13.5 required acc to definition)
Question
Given the data
136 143 147 151 158 160 161 163 165 167 173 174 181 181 185 188 190 205
Obtain the quartiles and the 10th percentile.
n = 18
Second: 18*(0.5) = 9 Therefore, we average the 9th and 10th ordered values
Q2 = average the 9th and 10th ordered values = (165+167)/2 = 166
Q3 = 181 P0.10 = 143
Descriptive Measures: Range & Interquartile Range
The minimum and maximum observations also convey information concerning the
amount of variability present in a set of data. Together, they describe the interval
containing all of the observed values.
range = maximum − minimum
The amount of variation in the middle half of the data is described by the
interquartile range.
interquartile range = third quartile − first quartile = Q3 − Q1
Descriptive Measures: Box Plots
Probability
Experiment
An experiment is any activity or process whose outcome is subject to uncertainty.
Examples:
tossing a coin once or several times
selecting a card or cards from a deck
Sample Space
The sample space of an experiment, denoted by S, is the set of all possible
outcomes of that experiment.
Toss a coin → Sample Space H, T
Tossing two coins → Sample Space HH, TT, HT, TH
Tossing three coins → HHH, HHT, HTT, HTH,....
Toss n coins: 2n possibilities
Simple and Compound Event
An event is any collection (subset) of outcomes contained in the sample space S.
An event is simple if it consists of exactly one outcome and compound if it consists
of more than one outcome.
Coin tossed twice
Simple Event → HH
Compound Event → {HT,TH}
Set Theory
An event is just a set, so relationships and results from elementary set theory can
be used to study events.
Mutually Exclusive or Disjoint Events
A and B have no outcomes in common, so that the intersection of A and B
contains no outcomes.
Exercise: De Morgan’s Law
Proof the following
Properties
Given an experiment and a sample space , the objective of probability is to assign
to each event A a number P(A), called the probability of the event A, which will
give a precise measure of the chance that A will occur.
Properties
Question
The computers of six faculty members in a certain department are to be replaced. Two of
the faculty members have selected laptop machines and the other four have chosen
desktop machines.
Suppose that only two of the setups can be done on a particular day, and the two
computers to be set up are randomly selected from the six (implying 15 equally likely
outcomes; if the computers are numbered 1, 2, . . . , 6, then one outcome consists of
computers 1 and 2, another consists of computers 1 and 3, and so on).
[Link] is the probability that both selected setups are for laptop computers?
b. What is the probability that both selected setups are desktop machines?
c. What is the probability that at least one selected setup is for a desktop computer?
d. What is the probability that at least one computer of each type is chosen for setup?
Question
The computers of six faculty members in a certain department are to be replaced. Two of
the faculty members have selected laptop machines and the other four have chosen
desktop machines.
Suppose that only two of the setups can be done on a particular day, and the two
computers to be set up are randomly selected from the six (implying 15 equally likely
outcomes; if the computers are numbered 1, 2, . . . , 6, then one outcome consists of
computers 1 and 2, another consists of computers 1 and 3, and so on).
a. What is the probability that both selected setups are for laptop computers? 2C2/15
b. What is the probability that both selected setups are desktop machines? 4c2/15
c. What is the probability that at least one selected setup is for a desktop computer?
(15-1)/15 =14/15
d. What is the probability that at least one computer of each type is chosen for setup?
(2*4)/15
Propositions
A homeowner doing some remodeling requires the services of both a plumbing contractor
and an electrical contractor. If there are 12 plumbing contractors and 9 electrical contractors
available in the area, in how many ways can the contractors be chosen? 108
Permutation and Combination(nCk)
A permutation is used for the list of data (where the order of the data matters) and the combination
is used for a group of data (where the order of data doesn’t matter).
Question
A production facility employs 20 workers on the day shift, 15 workers on the swing
shift, and 10 workers on the graveyard shift. A quality control consultant is to select
6 of these workers for in-depth interviews.
Suppose the selection is made in such a way that any particular group of 6
workers has the same chance of being selected as does any other group (drawing
6 slips without replacement from among 45).
a. How many selections result in all 6 workers coming from the day shift? What is
the probability that all 6 selected workers will be from the day shift?
Question
A production facility employs 20 workers on the day shift, 15 workers on the swing
shift, and 10 workers on the graveyard shift. A quality control consultant is to select
6 of these workers for in-depth interviews.
Suppose the selection is made in such a way that any particular group of 6
workers has the same chance of being selected as does any other group (drawing
6 slips without replacement from among 45).
a. How many selections result would lead to all 6 workers coming from the day
shift? What is the probability that all 6 selected workers will be from the day shift?
20
C6, 20C6/45C6
Question
A production facility employs 20 workers on the day shift, 15 workers on the swing
shift, and 10 workers on the graveyard shift. A quality control consultant is to select
6 of these workers for in-depth interviews.
Suppose the selection is made in such a way that any particular group of 6
workers has the same chance of being selected as does any other group (drawing
6 slips without replacement from among 45).
b. What is the probability that all 6 selected workers will be from the same shift?
Question
A production facility employs 20 workers on the day shift, 15 workers on the swing
shift, and 10 workers on the graveyard shift. A quality control consultant is to select
6 of these workers for in-depth interviews.
Suppose the selection is made in such a way that any particular group of 6
workers has the same chance of being selected as does any other group (drawing
6 slips without replacement from among 45).
b. What is the probability that all 6 selected workers will be from the same shift?
(20C6+ 15C6 +10C6)/45C6
Question
A production facility employs 20 workers on the day shift, 15 workers on the swing shift, and 10
workers on the graveyard shift. A quality control consultant is to select 6 of these workers for
in-depth interviews.
Suppose the selection is made in such a way that any particular group of 6 workers has the
same chance of being selected as does any other group (drawing 6 slips without replacement
from among 45).
c. What is the probability that at least two different shifts will be represented among the selected
workers?
Question
A production facility employs 20 workers on the day shift, 15 workers on the swing shift,
and 10 workers on the graveyard shift. A quality control consultant is to select 6 of these
workers for in-depth interviews.
Suppose the selection is made in such a way that any particular group of 6 workers has the
same chance of being selected as does any other group (drawing 6 slips without
replacement from among 45).
c. What is the probability that at least two different shifts will be represented among the
selected workers?
(1-(20C6+ 15C6 +10C6))/45C6
Question
A production facility employs 20 workers on the day shift, 15 workers on the swing
shift, and 10 workers on the graveyard shift. A quality control consultant is to select
6 of these workers for in-depth interviews. Suppose the selection is made in such
a way that any particular group of 6 workers has the same chance of being
selected as does any other group (drawing 6 slips without replacement from
among 45).
d. What is the probability that at least one of the shifts will be unrepresented in the
sample of workers?
Question
A1 = Day shift workers unrepresented
P(A1) = 25C6 / 45C6
A2 = Swing shift workers unrepresented
P(A2) = 30C6 / 45C6
A2 = Graveyard shift workers unrepresented
P(A2) = 35C6 / 45C6
P(A1 ∩ A2) = 10C6 / 45C6
P(A1 ∩ A2 ∩ A3) = 0
Question
A production facility employs 20 workers on the day shift, 15 workers on the swing shift, and 10 workers on
the graveyard shift. A quality control consultant is to select 6 of these workers for in-depth interviews.
Suppose the selection is made in such a way that any particular group of 6 workers has the same chance
of being selected as does any other group (drawing 6 slips without replacement from among 45).
d. What is the probability that at least one of the shifts will be unrepresented in the sample of workers?
P(Dunrepresented U Sunrepresnted U Gunrepresented) = P(Dunrespresented) + P(Sunrespresented) + P(Gunrespresented)
- P(Dunrespresented ∩ Sunrespresented) - P(Dunrespresented ∩ Gunrespresented)
- P(Gunrespresented ∩ Sunrespresented) + P(Dunrespresented ∩Gunrespresented ∩ Sunrespresented)
References
Probability and statistics for engineers RA Johnson, I Miller, JE Freund - 2000 -
[Link]
Statistics for business & economics DR Anderson, DJ Sweeney, TA Williams, JD
Camm
Probability and statistics for engineering and science J Deovre