odule 4.1.
1 Data Management
M
- Practice of organizing and maintaining
data processes to meet ongoing
information lifecycle needs
egend:
L
𝑥 = repeating value
𝑤 = number of occurrences of weight
𝑥 (w/ a line on top) = mean
. M
2 edian
- Middle score for a set ofdatathathas
been arranged in order of magnitude
- Formula (location of median):
Measures of central tendency median = (n+1)/2
. M
1 ean Legend:
- The sum of all values in a dataset n = total number of data values in the sample
divided by the total number of values Note:
*if n is odd, median is the middle value
pros cons
* if n is even, median is the average of the 2
an be used with
C It is susceptible to middle values
both discrete and the influence of
continuous data outliers . M
3 ode
- The most frequent score in the data
(what value pops up the most)
Type of variable (based Best measure
of level of measurement)
nominal mode
ordinal median
interval/ratio (not skewed) mean
interval/ratio (skewed) median
.1 Arithmetic Mean (population mean)
1
- Average of a complete set of data egend:
L
(population) Nominal = data can be categorized
1.2 Sample Mean Ordinal = can be categorized and ranked
- The average of a subset of data taken Interval = can be categorized, ranked, and
from a larger population (sample) evenly spaced
1.3 Weighted Mean Ratio = can be categorized, ranked, evenly
- An average computed by giving spaced, and has a natural zero
different weights to some of the
individual values.
Examples w/ explanation
- It is a non-negative
nominal Ethnicity
- Can't be ranked
ordinal Top 5 olympic medalist
- Does not tell you how close or far
they are in terms of number of wins
interval Temp in celsius
- There are equal intervals of one
degree, but the zero point is not
true as there that measurement can
reach negative degree celsius egend:
L
ratio height
xi = each individual value
μ = population mean
xˉ = sample mean
Measures of variation
N = number of values in the population
- Gives information on the spread or
n = number of values in the sample
variability of the data values
- Although 2 dataset’s could have the
teps on how to calculate:
S
same center, there variation could be
1. Calculate the mean
very different
2. Subtract the mean from each value
3. Square each deviation
4. Add up all the squared deviations
5. Divide by the number of values
5.1 if population, do it as is
5.2 for sample, do n-1
. R
1 ange
6. For standard deviation:
- Used to compare obvious data sets
6.1 square root the variance
- Formula:Range=highestvalue-lowest
value
4.Coefficient of variation
- Considered as the weakestmeasureof
- This kind ofmeasureallowstoormore
spread
distributions measured in same r
Note: heavily influenced by extreme values
different units to be compared
(outliers) and only compares 2 values
- How big is the standard deviation
compared to the mean
. V
2 ariance (σ2)
- Formula:CV =σ/(x w/ a line on top) x 100 %
- A measure of howfarasetofdataare
Legend:
dispersed from the mean
σ = St Dev
- Itisnon-negativesinceeachterminthe
x w/ a line on top = mean
variance is squared
- Allunitsissquared,ex:asetofweights
ote:thelowertheCV,thelesserthedispersion
N
in kg will be given in kg squared
of data values
. S
3 tandard deviation (σ)
5.Interquartile Range (IQR)
- Measuredthedeviationofdatafromits
- Definesthedifferencebetweenthethird
mean
and the first quartile
- ormula: upper quartile -lowerquartile
F
= q3 - q1
How to calculate the quartiles:
Note:
Q1 = 25% of data falls below this value
Q2 = 50% of the data falls below this value
Q3 = 75% of the data falls below this value
1. Arrange the data (smallest to largest)
2. Find q2 (median)
3. Find Q1 (median of the lower half of
data)
4. Find Q3 (median of the upper half of
data)
Correlation (Pearson)
- easure describing the way two
M
variables vary together
- Statisticthatmeasuresthestrengthand
direction of a linear relationship
between two quantitative variables
- r represents the correlation coefficient
Size of Correlation Interpretation
.90 to 1 (-0.90 to -1) ery high positive
V
(negative) correlation
.70 to .90 (-.70 to -.90) igh positive (negative)
H
correlation
.50 to .70 (-.50 to -.70) oderate positive
M odule 4.4 Simple Regression
M
(negative) correlation
- A statistical tool that is used in the
.30 to .50 (-.30 to -.50) ow positive (negative)
L quantification of the relationship
correlation between a single independent variable
.00 to .30 (.00 to -.30) Negligible correlation and a single dependent variable on
observationsthathavebeencarriedout
in the past
Formula for regression: y = a + bx
Legend! lternative Hypothesis (H₁ or Ha)
A
y = dependent variable - Represents what you aim to support or
x = independent variable prove
a = intercept (value of y when x = 0) - Indicates the presence of an effect,
b = slope (change in y for every 1 unit difference, or relationship
increase in x) - Always contains inequality ( ≠, >, < )
—---------------------------------------------------- Purpose:proposed if H₀ is rejected
Formula for slope: Examples:
● H₁: μ ≠ 50 (the mean is not 50)
= number of variables
n ● H₁: p < 0.7 (the proportion is less than
—---------------------------------------------------- 70%)
Formula for intercept a: Note: Direction depends on research
question
and y (w/ a line): represents the means
x ● One-tailed: H₁ uses < or >
*Meansmustbecalculatedfirstbeforeyoucan
get the intercept ● Two-tailed: H₁ uses ≠
—----------------------------------------------------
Interpretation of Regression Steps in Hypothesis Testing
“For every 1 extra unit of smth, the dependent . S
1 tate H₀ and H₁
variable (y) will increase by (b). If independent 2. Choose a significance level (α, usually
variable (x) stays at 0, the predicted value will be 0.05)
(a)” 3. Collect and analyze sample data
4. Compute test statistic (e.g., z, t)
Hypothesis Testing 5. Compare with critical value or use
p-value
- A method of making statistical decisions using 6. Make a decision:
experimental data ○ If p-value ≤ α →Reject H₀
Used to test assumptions (claims) about a ○ If p-value > α →Fail to reject
population parameter based on sample data H₀
Null Hypothesis (H₀) ype I and Type II Errors
T
- epresents the default or status quo
R - Types of incorrect decisions that may
assumption occur in hypothesis testing
- Assumes no effect, no difference, or no - Related to the truth or falsity of the null
relationship between variables hypothesis (H₀) and what decision is
- Always contains equality ( =, ≥, ≤ ) made based on the data
Purpose:to test whether there's
enough evidenceagainstit ype I Error (α)
T
Examples: - Occurs when we reject the null
● H₀: μ = 50 (the population mean is 50) hypothesis (H₀) even though it is
● H₀: p ≥ 0.7 (the population proportion is actually true
at least 70%) - Also called a false positive
Note: If evidence is strong, we reject H₀ - We detect an effect that isn’t really there
Example: inomial Distribution
B
- A person is diagnosed with a disease – Used when there are repeated trials, each with
(reject H₀) but is actually healthy (H₀ is two possible outcomes: success or failure
true) – The probability of success stays the same for
Controlled by: Significance level (α) each trial
– Common value: α = 0.05 – Each trial is independent (one doesn’t affect
the others)
ype II Error (β)
T – Think: “yes or no” situations, repeated several
- Occurs when we fail to reject the null times
hypothesis (H₀) even though it is
actually false xample:
E
- Also called a false negative – Flipping a coin 10 times and counting how
- We fail to detect an effect that is really many heads
there – Guessing on a multiple-choice quiz with 5
Example: questions
- A person is told they’re healthy (fail to Conditions to use binomial:
reject H₀) but actually has the disease – Fixed number of trials (n)
(H₀ is false) – Two outcomes: success/failure
Related to: Power of the test – Constant probability (p)
– Power = 1 - β – Trials are independent
Formula:
rror Summary Table
E
Decision Made
– H₀ is True → Type I Error (α)
– H₀ is False → Type II Error (β)
– Reject H₀ → may lead to Type I Error if H₀ is
true
– Fail to Reject H₀ → may lead to Type II Error if
H₀ is false
egend:
L
– H₀ = Null Hypothesis Poisson Distribution
– α = Probability of Type I Error – Used to count how often an event happens
– β = Probability of Type II Error over time or space
– Power = Probability of detecting a true effect – You don’t know the number of trials, but you
(1 − β) know the average rate
Note: – Events happen randomly and independently
– Lowering α reduces Type I errors but – Best for rare events
increases the chance of Type II errors
– Increasing sample size helps reduce both xample:
E
errors - Number of emails received in an hour
– Cars arriving at a toll booth in a minute
– Number of errors in a book
Conditions to use Poisson:
– Events occur randomly and independently
– Happen at a constant average rate (λ) Z-Score
– Based on time, area, volume, or distance
– A Z-score tells you how far a data point is
from the mean, in terms of standard deviations
– Helps standardize different data values for
comparison
– A positive Z-score means the value is above
the mean
– A negative Z-score means the value is below
the mean
Hypergeometric Distribution
– Used when sampling without replacement
from a group
– The probability of success changes with each
draw
– The trials are dependent on each other
– Useful when the population is small and
known
xample:
E
- Drawing 5 cards from a deck without
replacement and counting the number of kings
– Selecting students from a class and counting
how many are seniors
– Picking colored balls from a bag and not
putting them back teps to Calculate Z-score:
S
Conditions to use hypergeometric: .
1 Find the mean of the data
– Population is finite (N) 2. Find the standard deviation
– Known number of successes in the 3. Subtract the mean from the data point
population (K) 4. Divide the result by the standard
– Drawing without replacement (n) deviation