Statistic Analysis
Statistic Analysis
PRELIMS
● HEALTH CARE
STATISTICS ➜ Evaluate 100 incoming patients
● the branch of mathematics that transforms using a 42-item physical and
data into useful information for decision mental assessment questionnaire.
makers. ● QUALITY IMPROVEMENT
➜ DESCRIPTIVE STATISTICS ➜ Initiate a triple inspection program,
★ Collecting, summarizing, setting penalties for workers who
and describing data produce poor-quality output.
➜ INFERENTIAL STATISTICS ● PURCHASING
★ Drawing conclusions and/or ➜ A food producer purchases plastic
making decisions containers for packaging its
concerning a population product. Inspection of the most
based only on sample data recent shipment of 500 containers
found that 3 of the containers were
WHAT IS STATISTICS? (Doane & Seward, defective. The supplier’s historical
2019) defect rate is .005. Has the defect
● is the science of collecting, organizing, rate really risen or is this simply a
analyzing, interpreting, and presenting “bad” batch?
data. ● MEDICINE
● A STATISTIC is a single measure, reported ➜ Determine whether a new drug is
as a number, used to summarize a sample really better than the placebo or if
data set; for example, the average height the difference is due to chance.
of students in a university. ● OPERATIONS MANAGEMENT
● Examples of statistics: ➜ Manage inventory by forecasting
➜ average height for the length of the consumer demand.
gowns ● PRODUCT WARRANTY
➜ maximum height to design the ➜ Determine the average dollar cost
height of the doorways of the of engine warranty claims on a new
classrooms, etc. hybrid engine.
USES OF STATISTICS (Doane & Seward, KINDS OF STATISTICS (Doane & Seward,
2019) 2019)
● AUDITING ● DESCRIPTIVE STATISTICS
➜ The firm has learned that some ➜ refers to the collection,
invoices are being paid incorrectly, presentation, and summary of data
but it doesn’t know how widespread (either using charts and graphs or
the problem is. A sample of invoices using numerical summary).
can be used to estimate the ● INFERENTIAL STATISTICS
proportion of incorrectly paid ➜ refers to the generalizing from a
invoices. sample to a population, estimating
● MARKETING unknown population parameters,
➜ Many companies use Customer drawing conclusions, and making
Relationship Management (CRM) decisions.
to analyze customer data from
multiple sources. With statistical SOME DEFINITIONS
and analytics tools such as ● VARIABLE
correlation and data mining, they ➜ is any characteristic of a person or
identify specific needs of different an object that may vary across
customer groups, and this helps persons or across different time
them market their products and points.
services more effectively.
Age Student number Asset size
MODE
● It is simply the observation or value that
occurs most frequently in the data set.
● What is the mode given the following: A, A,
B, C, B, C, D, A, A? How do you call the
distribution with respect to the mode?
Mode = A, distribution is unimodal
● What is the mode of the distribution: 10, 10,
20, 20, 30, 30, 20, 30? How do you call the
distribution with respect to the mode?
Mode = 20, 20; distribution is bimodal.
● What is the mode of the distribution: 500,
500, 700, 600, 700, 600? How do you call
the distribution with respect to the mode?
➜ Answer: Mode = { } or none. The
GEOMETRIC MEAN
distribution has no mode.
● This measure is useful for growth rates.
● The mode is suitable for nominal, ordinal,
● It mitigates high extremes.
interval and ratio level variables.
● It is, however, less familiar.
● It is not affected by extreme values.
● It requires that data is positive.
● There may be no mode.
● There may be several modes.
𝐺 = 𝑛√𝑥1 𝑥2 . . . 𝑥𝑛
MEDIAN
● It is simply the middle score or middle value TRIMMED MEAN
when scores are ranked in order of ● Computed in the same manner as the
magnitude. arithmetic mean but it omits the highest and
● It is unique. lowest k% of data values (e.g., 5%).
● It is relatively unaffected by extreme scores ● It mitigates the effects of extreme values.
at either end of the distribution. ● It has the disadvantage of excluding some
● Not all values in the distribution contribute data values that could be relevant.
to the value of the median.
● It can be used with ordinal, interval and MEASURES OF POSITION
ratio data. ● A measure of position, or quantile, is a
● It is not suitable for nominal data because general descriptive measurement used to
nominal data have no numerical order. separate quantitative data into distinct
MEAN groups. To compute quartiles of ungrouped
● The mean of a set of numerical data is data, the values must first be arranged
unique. either in ascending or descending order.
● It is the only measure of central tendency ● Quartiles divide the values into four groups
where the sum of the deviation of each of equal size, each comprising 25% of
value from the mean will always be zero.
3 diamla, foronda, gan
observations. If n = 50, 25% of the values ● The more the data are concentrated, the
is less than or equal to 𝑄1. smaller, the quartile deviation, variance,
● Deciles divide the values into ten groups of and standard deviation.
equal size, each comprising 50% of ● If the values are all the same (no variation),
observations. If n = 50, 30% of the values all these measures will be zero.
is less than or equal to D3. ● None of these measures are ever negative.
● Percentiles divide the values into 100
groups of equal size, each comprising 1% TABLES AND CHARTS FOR
of observations. If n = 200, 65% of the CATEGORICAL DATA
values is less than or equal to P65. Summary Table
● A summary table indicates the frequency,
MEASURES OF VARIABILITY amount, or percentage of items in a set of
● Variation measures the spread, or categories so that differences in categories
dispersion, of values in a data set. can be seen.
➜ Range
➜ Quartile Deviation Bar Chart
➜ Variance ● A bar chart shows each category, the
➜ Standard deviation length of which represents the amount of
frequency or percentage of values falling
➜ Coefficient of Variation
under each category.
● It measures the difference of each value
around the mean.
Pie Chart
● It functions as a measure of risk or
● A pie chart shows a circle broken up into
uncertainty in the field of finance.
slices that represent categories. The size of
● It provides a measure of volatility in
each slice of the pie varies according to the
considering alternatives for pricing
percentage in each category.
commodities.
● It may be used as a measure of error in the
field of forecasting. TABLES AND CHARTS FOR NUMERICAL
DATA
RANGE Stem and Leaf Display
● The range of a set of data with n ● A stem and leaf display organizes data into
observation is defined as the difference groups (called stems) so that the values
between the highest and lowest values. within each group (the leaves) branch out
● The quartile deviation, QD, is the amount to the right on each row.
of spread with the middle half of the items
arranged in an ordered array. It is also Frequency Distribution Table
called semi-interquartile range. It is used ● A frequency distribution table is a summary
for ordinal data. table in which the data are arranged into
numerically ordered class groupings.
VARIANCE
● The variance is the average Histogram
(approximately) of squared deviations of ● A histogram is the graph of data in a
values from the mean. frequency distribution where the class
boundaries are shown on the horizontal
COEFFICIENT OF VARIATION axis while the vertical axis is either
● The coefficient of variation is the standard frequency, relative frequency or
deviation divided by the mean, multiplied percentage. Bars of the appropriate
by 100. heights are used to represent the number
● It is always expressed as a percentage, %. of observations within each class.
● It shows variation relative to the mean.
● The CV can be used to compare two or Line Graph
more sets of data measured in different ● A percentage polygon is formed by having
units (e.g. weight in kgs and height in the midpoint of each class represent the
meters). data in that class and then connecting the
sequence of midpoints at their respective
SUMMARY CHARACTERISTICS class percentages.
● The more the data are spread out, the
greater the range, quartile deviation,
variance, and standard deviation.
GROWTH RATES
Formula: ● A variation on the geometric mean used to
Excel Formula: =AVERAGE(Data) find the average growth rate for a time
Pro: Familiar and uses all the sample series.
information
Con: Influenced by extreme values
● Statistic: Median
Formula: Middle value in sorted array
Excel Formula: =MEDIAN(Data)
5 diamla, foronda, gan
● Given by taking the geometric mean of the QUARTILE MEASURES
ratios of each year’s revenue to the ● Quartiles split the ranked data into 4
preceding year. segments with an equal number of values
per segment.
MEDIAN
● In an ordered array, the median is the
“middle” number (50% above, 50% below).
● Not affected by extreme values. ➔ The first quartile, Q1 , is the value
for which 25% of the observations
LOCATING THE MEDIAN are smaller and 75% are larger
● The median of an ordered set of data is ➔ Q2 is the same as the median (50%
𝑛+1 are smaller, 50% are larger)
located at the 2 ranked value.
➔ Only 25% of the values are greater
● If the number of values is odd, the
than the third quartile
median is the middle number.
● GUIDELINES
● If the number of values is even, the
median is the average of the two middle ➔ Rule 1: If the result is a whole
numbers. number, then the quartile is equal to
𝑛+1 that ranked value.
● Note that 2 is NOT the value of the
➔ Rule 2: If the result is a fraction half
median, only the position of the median in (2.5, 3.5, etc), then the quartile is
the ranked data. equal to the average of the
corresponding ranked values.
MODE ➔ Rule 3: If the result is neither a
● Value that occurs most often whole number or a fractional half,
● Not affected by extreme values you round the result to the nearest
● Used for categorical data integer and select that ranked
● Used for numerical primarily when grouped value.
● There may be no mode
● There may be several modes DISPERSION
● Variation - “spread” of data points about
TRIMMED MEAN the center of the distribution in a sample.
● To calculate the trimmed mean, first ● MEASURES OF VARIATION
remove the highest and lowest k percent of
➔ Statistic: Range
the observations.
Formula: Xmax - Xmin
● To determine how many observations to
Excel: =MAX(Data)-MIN(Data)
trim, multiply k by n and round off the
Pro: Easy to calculate
result.
Con: Sensitive to extreme data
values
LOCATING EXTREME OUTLIERS
➔ Statistic: Variance (s2)
Z-SCORE
● To compute the Z-score of a data value,
subtract the mean and divide by the Formula:
standard deviation. Excel: =VAR(Data)
● Z-score - number of standard deviations a Pro: Plays a key role in
data value is from the mean. mathematical statistics.
● A data value is considered an extreme Con: Non-intuitive meaning.
outlier if its Zscore is less than -3.0 or
➔ Statistic: Standard deviation (s)
greater than +3.0.
● The larger the absolute value of the Z-
score, the farther the data value is from the
mean. Formula:
Excel: =STDEV(Data)
Pro:.Most common measure. Uses
same units as the raw data ($ , £, ¥,
etc.).
where X represents the data value Con: Nonintuitive meaning.
𝑥 is the sample mean
➔ Statistic: Coefficient. of variation
S is the sample standard deviation (CV)
NORMAL DISTRIBUTION
● It is the most common continuous
distribution.
BIVARIATE DATA ● Also known as the Gaussian distribution or
● Sample Covariance - measures the the bell curve.
strength of the linear relationship between ● In this distribution, the probability that
two numerical variables. various values occur within certain ranges
or intervals can be calculated.
● ‘Bell Shaped’
● Symmetrical
● The covariance is only concerned with the ● Mean, Median and Mode are equal
strength of the relationship. ● Location is characterized by the mean, μ
● No causal effect is implied. ● Spread is characterized by the standard
● Covariance between two random deviation, σ
variables: ● The random variable has an infinite
● Statistical function covariance also in Data theoretical range: -∞ to +∞
Analysis
ASSESSING NORMALITY
● It is important to evaluate how well the data
set is approximated by a normal
distribution.
● Normally distributed data should
approximate the theoretical normal
● Values above the mean have positive Z- distribution:
values, values below the mean have ➔ The normal distribution is bell
negative Z-values. shaped (symmetrical) where the
mean is equal to the median.
➔ The empirical rule applies to the
normal distribution.
➔ The interquartile range of a normal
distribution is 1.33 standard
● Note that the distribution is the same, only deviations.
the scale has changed. We can express
the problem in original units (X) or in THE EMPIRICAL RULE AS APPLIED TO
standardized units (Z). THE NORMAL DISTRIBUTION
● This rule states that for symmetrical bell-
NORMAL PROBABILITIES shaped data sets, one can find that roughly
● Probability is measured by the area under two out of every three observations are
the curve. contained within a distance of 1 standard
deviation around the mean and roughly.
DATA MEASUREMENT