EM-104-Module
EM-104-Module
I. INTRODUCTION
The Role of Statistics in Research. Research is a critical study into the nature of, reasons
for, and consequences of a set of conditions. Research is “re-search,” meaning a voyage of
discovery. “Re” means again and again, and “search” means a voyage of knowledge. It leads to
the enrichment of knowledge.
In research, Statistics functions as a tool for data collection, analysis, and interpretation of
results. Statistics is the tool of all sciences.
The objectives of research are the discovery of new facts and the revision of accepted
theories or laws based on newly discovered evidence.
Statistics – the science which deals with the collection, presentation, analysis and interpretation
of quantitative data, as well as the theories which are used as bases of the analysis of such data.
Statistics is the science of making sense of data
Variable – is a characteristic of interest measurable on each and every individual or object in the
group.
TYPES OF DATA
1. Qualitative data (or categorical or attribute data) can be separated into different
categories that are distinguished by some nonnumeric characteristics.
2. Quantitative data consist of numbers representing counts or measurements. May either
be discrete or continuous data.
Discrete data result from either a finite or a countable number of possible
values.
Continuous data result from infinitely many possible values that can be
associated with points on a continuous scale in such a way that there are
no gaps or interruptions.
TYPES OF DATA ACCORDING TO LEVELS OF MEASUREMENT
1. Nominal – when assigned numerical values, the values do not contain quantifiable
information, so that mathematical operations cannot be performed on these values.
Ex: name of school, province, brand of cellphone, sex, course in college, and subject in
school. All these variables can be assigned numbers. For example, for sex, 0 for female
and 1 for male can be used.
2. Ordinal – the values of the variable can be arranged in order of magnitude. Ex: year level
(1st, 2nd,…), military rank (sergeant, lieutenant, coronel,..). The values of these variables
can be assigned numbers 1, 2, …, but the assigned numbers are quantifiable only to the
extent that they can be arranged in order of magnitude.
3. Interval – the values can be regarded as points on a number line. The variable is
quantifiable. The difference between two values of an interval variable provides a
numerical measure of the amount by which the values differ. It incorporates the concept
of equality or intervals, but has an arbitrary zero point. Ex: temperature (0C) and time of
arrival of the airplane (2:30 PM).
4. Ratio – represents the actual amount of a variable. It has an absolute zero or origin.
Ex: amount of money in the bank, number of trips, and length of time in minutes late.
There are concepts that need to be well defined and established. These concepts are the
population and the sample. In statistics, population refers to the set of all observations made on
all objects under study for a given characteristic of interest or variable. The number of
observations in a given population is referred to as the population size and is designated as N.
A population may be so large that it may be impossible or impractical for the researcher
to study all its elements. In such a case, the study of a sample from the given population would
be more appropriate.
Basically, there are two broad classifications of sampling, namely: probability and non-probability
sampling.
1. Simple Random Sampling. Basic to all sampling designs, this procedure is suitable when the
population being studied is homogeneous with respect to the characteristic under investigation.
A sample is a simple random sample if all members of the population have equal chance of being
included in the sample. This is usually done by draw lots, by the use of the table of random
numbers or by the use of a hand-held calculator.
3. Stratified Sampling. If the population of size N is heterogeneous and can be subdivided into
non-overlapping L homogeneous subpopulations called strata, of sizes N1, N2, …, NL
respectively, such that
N1 + N2 + … + NL = N.
A stratified sample of size n, consists of samples of sizes n1, n2, … nL, drawn independently
from one stratum to another, where
n1 + n2 + … + nL = n.
Allocation of samples. Depending on the size of the sample taken from the strata, stratified
sampling can be categorized as one with equal allocation or proportional allocation.
Equal stratified sampling involves drawing samples of the same size from each stratum. The total
sample size n is divided equally to the different strata.
ni = n/L
Proportionate stratified sampling involves drawing a sample from each stratum in proportion to
the stratum’s share in the total population.
ni = n(Ni/N)
5. Multi-Stage Sampling. This sampling procedure is done in stages. For example, in medical
studies, the provinces selected in the first stage of the analysis maybe partitioned into
municipalities, and then the selected municipalities can be partitioned into barangays. Then, from
the selected barangays, the researcher could select a random sample of families wherein data or
information on medical expenditures can be elicited.
where:
𝜎 – standard derivation of the population (or its estimate S)
e – maximum error deemed acceptable
Z – standard normal variable for the specified degree of confidence interval
2. Determining the required sample size for estimating the mean (N< 100,000)
where:
N – population size
S – standard derivation of the population
3. Determining the required sample size for estimating the proportion (N > 100,000)
where:
𝑃 – initial estimate of the population proportion
Remark: If an initial estimate of P is not possible, then it should be estimated asbeing 0.50.
Such an estimate is conservative.
4. Determining the Required Sample size for Estimating the Proportion (N < 100,000)
1. If there are continuous and categorical variables in the survey, determine the most
important ones. Compute the required sample size for each of these variables
2. If they are equally important, use the largest sample size. However, consider the
practicality and the cost of using the different sample sizes.
a) If the largest sample size is too costly, adjust or relax the precision to make the sample
size smaller.
b) If the sample sizes are very different from each other, drop some items; these may require
another sampling approach. Combine the variables with similar sample sizes and separate
those variables that may need special methods.
where, 𝑋𝑖 is the value of the ith observation, N is the number of observations, and 𝑖 is
the index of summation whose value ranges from 1 to N.
2. The Median
The median is the middle value in a set of data, where the observations are arranged from
highest to lowest. It is a single value that divides the array of observations into two equal parts,
such that half of the observations are above it and half are below it.
3. The Mode
The mode is the value which occurs most frequently in a given data set. The mode of a
given set of data is determined by inspection.
4. The Midrange
The midrange is the average of the maximum (highest) and minimum (lowest)
observations in the data set.
1. The Range
The Range, denoted by R, is defined as the difference between the highest value (HV) and the
lowest value (LV) in the data set. In symbols,
2. The Variance
The variance of a set of data, denoted by 𝝈𝟐, is the mean of the squared deviations of the
observations from the mean.
1. Combination is a collection or group formed by taking all or part of a given set objects
WITHOUT regard to order by which the objects are selected
𝑛 𝑛!
𝑛𝐶𝑟 = =
𝑟 (𝑛 − 𝑟)! 𝑟!
where: nCr – number of combinations of n objects taken r at a time
Ex. In a farm, 50 plots are available for soil analysis, the budget is only for 4 plots, in how many
possible sets of 4 plots are there for analysis?
50
50𝐶4 = = 230300
4
2. Permutation is an ordered collection of all or part of a given set of objects
𝑛!
𝑛𝑃𝑟 =
(𝑛 − 𝑟)!
where: nPr – number of permutations of n distinct items taken r at a time
Ex. Ten available tractors for inspection are to be lined up in a shop. How many ways can the
tractors be arranged?
10𝑃10 = 3628800
B. DEFINITION OF TERMS
Random Experiment – any process that can be repeated under basically the same conditions,
and which yields well-defined outcomes.
Sample Space – set of all possible outcomes of a random experiment. It is usually denoted by
the capital letter S
Sample Point – an element of the sample space (S) = number of sample points in the sample
space S
Event – a subset of the sample space. It is usually represented by capital letters of the English
alphabet
n(A) = number of sample points in event A
Elementary Event – an event that contains only one sample point
Compound Event – an event containing more than one sample point
A discrete random variable is one that may assume a finite or countably infinite number of
numerical values.
Continuous Random Variable is a random variable which is not discrete, that is, it is one that
may assume an uncountably infinite number of possible values. It may assume any value in a
given interval.
Probability Distribution of a Random Variable. A listing of all possible values that a random
variable can take on together with their corresponding probabilities is called a probability
distribution. The values of the random variable correspond to events that are mutually exclusive.
This is because each outcome or sample point corresponds to exactly one value of the random
variable. The probability distribution then of a random variable, say X, provides a probability for
each possible value x. These probabilities must sum to 1.
The notation f(x; N) indicates that the uniform distribution depends on the parameter N. The
graphical representation of the uniform distribution by means of a histogram always turns out to
be a set of rectangles with equal heights.
A special case of the uniform probability distribution is when the values of the uniform random
variable corresponds to the natural numbers 1 to N, that is,
In this case, the mean and variance of a uniformly distributed random variable are given
respectively by
2. Binomial Distribution
A binomial experiment is one that possesses the following properties:
a) The experiment consists of n repeated trials.
b) Each trial results in an outcome that may be classified as a success or a failure.
c) The probability of success, denoted by p, remains constant from trial to trial.
d) The repeated trials are independent.
The binomial random variable X is defined as the number of successes in n trials. Since it depends
on the number of trials and the probability of a success on a given trial, then the probability
distribution of this discrete variable is called the binomial distribution. The probability function or
formula of the binomial distribution is
The mean and variance of the binomial random variable are given respectively by
3. Poisson Distribution
A Poisson experiment is one that possesses the following properties:
a. The number of outcomes occurring in one-time interval or specified region is independent
of the number that occur in any other disjoint time interval or region of space.
b. The probability that a single outcome will occur during a very short time interval or in a
small region is proportional to the length of the time interval or the size of the region and
does not depend on the number of outcomes occurring outside this time interval or region.
c. The probability that more than one outcome will occur in such a short time interval or fall
in such a small region is negligible.
The number X of outcomes occurring in a Poisson experiment is called a Poisson random
variable. The probability function or formula of the Poisson distribution is
where: λ – is the average number of outcomes occurring in the given time interval or
specified region
e – 2.71828…
If n is large and p is small, the binomial probabilities are often approximated by means of the
formula
Estimation refers to estimating the value of the parameter of interest (point estimation and
confidence interval estimation). The objective of estimation is to determine the approximate value
of a population parameter on the basis of a sample statistic.
For example, suppose in the above example, for n=25 individuals, the sample mean 𝑋̅ for the
headache to be gone is 13 minutes. 𝑋̅ is called point estimator while its specific value, 13 minutes
is called point estimate.
INTERVAL ESTIMATION
Interval estimation is based on sample data. Two numbers are calculated to form an interval,
consisting of the lower limit and an upper limit. This interval is expected to contain the parameter
with probability (1-α)100 percent. The resulting pair of numbers is called an interval estimate or a
confidence interval.
An alternative statement for the example on the mean effectivity of a Paracetamol in treating
headache is “The length of time headache is gone right after paracetamol intake is between 10.5
and 15.5 minutes.” Here, (10.5 minutes – 15.5 minutes) is called the interval estimate.
A confidence interval (or interval estimate) is a range (or an interval) of values that is likely to
contain the true value of the population parameter with some degree of confidence.
The degree of confidence is the probability 1-𝛼 that the confidence interval contains the
population parameter. This probability is often expressed as the equivalent percentage value. The
degree of confidence is also referred to as the level of confidence or the confidence level.
Margin of Error – is the difference between the observed sample statistic and the value of the
population parameter.
When sample data are used to estimate a population mean, the margin of error, denoted by E is
the maximum likely (with probability 1-𝛼) difference between the observed statistic
𝜃̂ and the population parameter 𝜃. The margin of error E is also called the maximum error
of the estimate and can be found by multiplying the standard normal distribution critical
value 𝑍𝛼/2 and the standard deviation of the sample statistics. The standard deviation of
the sample statistics is called the standard error SE.
Interpretation of confidence interval. The limits 𝜃̂ – E and 𝜃̂ + E, either enclose the population
parameter 𝜃 or not, and it is not possible to determine if it is, without knowing the true value of the
parameter 𝜃. It is incorrect to state that the parameter 𝜃 has a 95% chance of falling within the
specific limits obtained, because 𝜃 is constant, being not a random variable, and either it will fall
within these limits or it will not. There is no probability involved. It is correct to say that in the long
run these methods will result in confidence intervals that will contain 𝜃 in 95% of the cases.
Four commonly used alpha:
p(1 − p) p(1 − p)
p−𝑍 < 𝑷< p+𝑍
𝑛 𝑛
The sample size should be large enough such that there are at least 5 successes and at least 5
failures in the sample.
Example
The National Student Org A (NSOA) is considering a proposal to merge with another National
Student Org B (NSOB). According to the NSOA bylaws, at least three fourths of the organization
membership must approve any merger. A random sample of 2000 current NSOA members
reveals 1600 plan to vote for the merger proposal. What is the estimate of the population
proportion? Develop a 95 percent confidence interval for the population proportion. Basing your
decision on this sample information, can you conclude that the necessary proportion of NSOA
members favor the merger? Why?
Solution
Given: n1 = 1600 n = 2000 95% CI therefore 𝜶 = 0.05 𝜶/𝟐= 0.025
First, calculate the sample proportion: 𝑝̂= 𝑛1/𝑛 = 1600/2000= 0.80
Thus, we estimate that 80 percent of the population favor the merger proposal.
( )
We determine the 95% CI using the formula p̂ ± 𝒁𝒂 𝒏
𝟐
The endpoints of the confidence interval are 0.782 and 0.818. The lower endpoint is greater than
0.75. Hence, we conclude that the merger proposal will likely pass because the interval estimate
includes values greater than 75 percent of the organization membership.
Example A random sample of 25 fish caught at Taal Lake has a mean length of 35.5 cm with a
standard deviation of 5 cm. Construct a 95% confidence interval for the variability (SD) of the
length of fish in Taal Lake
Solution
Given: n = 25 100(1-𝛼) % = 95% 𝛼 = 5% or 0.05
𝑋̅ = 35.5 cm S = 5 cm k = n-1 = 24
Required: Construct a 95% confidence interval for the variability (SD) of the length of fish in Taal
Lake.