AP Statistics Study Guide
AP Statistics Study Guide
Distribution tells us what values a variable takes and how frequently it takes these values
● Ex: Histograms, box plots, dot plots, scatter plots, stem and leaf plots, and line graphs for
quantitative data
● Ex: Bar graphs, two-way tables, and pie charts for categorical data
A Two-way Table describes two categorical variables, organizing counts according to a row
variable and a column variable
Source:
https://www.statology.org/conditional-relative-frequency-two-way-table/
The Marginal Distribution of one of the categorical variables is the distribution of values of
that variable among all individuals described by the table
● Ex: Marginal distribution of gender: Male: 48/100 = 48% Female: 52/100 = 52%
● The marginal distributions should total to 100%
https://www.khanacademy.org/math/ap-statistics/quantitative-data-ap/describing-
comparing-distributions/v/classifying-distributions
● Outliers
● Context: What does the distribution represent?
● Center: The median or mean (depending on distribution)
● Spread: The range (most of the time) or the standard deviation
Stem-and-Leaf Plots are a simple graphical display for small sets of data
● They give us a visual of the distribution while including the actual numerical values
Source:
https://en.wikipedia.org/wiki/Stem-and-leaf_display
Histograms are graphs that display the distribution of a quantitative variable by showing each
interval of the values as a bar
● The heights of the bars show the frequencies of values in each interval
● Histograms show off distributions very clearly
● Histograms are the most common graph of distribution
Source: https://online.stat.psu.edu/stat500/book/export/html/539
The standard deviation - average distance between each value and the mean
● The “average” squared deviation is called the variance
● The standard deviation is susceptible to outliers
Source: https://www.simplypsychology.org/boxplots.html
Percentile: The nth percentile of a distribution is the value with n percent of the observations
less than it
● Ex: 60th percentile of data is 50. This means that 60% of the data is less than 50 and 40%
of the data is 50 or above
The z-score tells us how many standard deviations away from the mean an observation falls, and
what direction it falls in
● A positive z-score is above the mean, a negative z-score is below the mean
● Z-scores have no units
𝑥−𝑚𝑒𝑎𝑛
● It is also called a standardized value of x, and the formula is 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
When data has a regular overall pattern, we can use a simplified model called a density curve to
describe it
● Always on or above the horizontal axis
● It has an area of exactly 1 underneath it
Source:
http://www.stat.yale.edu/Courses/1997-98/101/normal.htm
The Empirical Rule: In the normal distribution with mean m and standard deviation s:
● Approximately 68% of observations fall within one s of m
● Approximately 95% of observations fall within 2s of m
● Approximately 99.7% of observations fall within 3s of m
Source: http://stevegallik.org/cellbiologyolm_statistics.html
The Standard Normal Distribution is the normal distribution with mean 0 and standard
deviation 1
● We obtain this by converting every value into itz z-score and representing each data point
as its z-score in the distribution
● This gives us the standard Normal distribution, N(0, 1)
Source: https://statistics-
made-easy.com/standard-normal-distribution/
We use Table A to find the proportion of observations in a standard normal distribution that
satisfies each z-score:
● Ex: if z < -1.52, you find the intersection of column -1.5 and row 0.02, which is 0.0643
We can also use the calculator to find the proportion of observations in a standard normal
distribution that satisfies each z-score:
● normalcdf (lower bound, upper bound, mean, standard deviation)
● If they give us the area and we need to find the z-score, we use invNorm(area under the
curve, mean, standard deviation)
A normal probability plot provides a good assessment of the adequacy of the normal model for
a set of data
● We are looking for a linear model to be present to conclude that the distribution is
approximately normal.
Source:
https://mathcracker.com/normal-probability-plot-maker
When analyzing two or more variables, there are two types you should keep in mind:
● Response Variable: Measures the outcome of a study (dependent variable)
● Explanatory Variable: Attempts to explain the observed outcomes (independent
variable)
When examining the relationship between variables, these steps should be taken:
● Plot the data and examine any numerical summaries (five number summary, mean,
standard deviation)
● Describe the scatter plot
○ Direction: positive association, negative association, no association
○ Form: Linear or nonlinear
○ Strength: Weak, moderate, strong
○ Unusual Features: Outliers and clusters
○ Context of the problem
Source:
https://www.mathsisfun.com/data/scatter-xy-plots.html
For a linear association between two quantitative variables, the correlation (r) measures both the
direction and strength of the association
● + means positive direction, - means negative direction
● The closer to 1 or -1, the stronger the association
○ The closer to 0, the weaker the association
● Correlation is NOT resistant to outliers
A regression line displays the relationship between two variables, but only when one of the
variables helps explain or predict the other
● It is a model for the datal the equation gives us a compact mathematical description of
what this model tells us about the relationship between y and x
Source: https://learningstatisticswithr.com/book/regression.html
The Coefficient of Determination measures the percent of the variability in the response
variable that is accounted for by the least-square regression line
● It measures the percent of data values that are accurately depicted by the least-squares
regression line
● We can find the linear regression line and the correlation coefficient by using LinReg on
our calculator
A residual is the difference between the actual value of y and the predicted value of y by the
regression line
● Residual = y - ŷ
● Least-Square Regression Line: The line that makes the sum of the squared residuals as
small as possible
Source:
https://www.statisticshowto.com/least-squares-regression-line/
Residual Plot: A scatter plot that displays the residuals on the vertical axis and the explanatory
variable on the horizontal axis
● If there is no leftover pattern, the regression model is appropriate
● If there is a leftover pattern in the residual plot, consider using a regression model with a
different form.
Source: https://opexresources.com/analysis-residuals-explained/
Observational studies of the effect of one variable on another often fail because of these reasons:
● Lurking Variable: A variable that is not among the explanatory or response variables in
a study but that may influence the response variable
● Confounding: Occurs when two variables are associated in such a way that their effects
on a response variable cannot be distinguished from each other
Probability: any outcome of chance process is a number between 0 and 1 that describes the
proportion of times the outcome would occur in a series of repetitions
● outcomes that never occur have a probability of 0
● an outcome that happens on every repetition has a probability of 1
● an outcome that happens half the time has a probability of .5
Law of Large numbers: If we observe more and more repetitions of any chance process, the
proportion of times that a specific outcome occurs approaches its probability
Probability Model: A description of some chance process that consists of two parts: a list of all
possible outcomes and the probability for each outcome.
● Sample Space: A list of all the possible outcomes
● Event: any collection of outcomes from some chance process
If all outcomes in the sample size are equally likely, the probability that event A occurs can be
found using this formula:
● P=number of outcomes in event A/total number of outcomes in a sample space
Two events are mutually exclusive if they have no outcomes in common and can never occur
together
● P(A or B) = P(A) + P(B)
If A and B are any two events resulting from some chance process, the general addition rule says
that:
● P(A or B) = P(A) + P(B) - P(A and B)
Conditional Probability: The probability that one event happens given that another event is
known to have happened is called a conditional probability
● The conditional probability that B happens given that A has happened is P(B|A)
● To find the conditional probability P(A|B), use this formula:
○ P(both events occur(A and B)) / P(given event occurs(B))
Independent: Two events are independent if the occurrence of one event has no effect on the
chance that the other will happen
● The are independent if P(A|B) = P(A) and P(B|A) = P(B)
General Multiplication Rule: For any chance process, the events A and B both occur can be
found using the general multiplication rule:
● P(A and B) = P(A) x P(B|A) or P(A and B) = P(B) x P(A|B)
Tree Diagram: Shows the sample space of a chance process involving multiple stages
Source:
https://www.onlinemathlearning.com/probability-tree-diagrams.html
If A and B are independent events, the probability that A and B both occur is:
● P(A and B) = P(A) x P(B)
Random Variable: a numerical outcome of some chance process
● The probability distribution of a random variable gives it possible values and their
probabilities
Discrete Random Variable: Takes a fixed set of possible values with gaps between them
● Has a countable number of possible values (finite)
● To find the mean (expected value) of X, multiply each possible value of X by its
probability, then add all of the products
● To find the variance, subtract the value by the mean, square it, multiply it by the
probability, and add
○ The square root of this is the standard deviation
Continuous Random Variable: Can take any value in an interval on the number line
● Use normalcdf!
For any two independent random variables X and Y, if S = X + Y, the variance of S is:
● Variance of S = (SD of x)^2 + (SD of y)^2
○ To get the standard deviation of S, take the square root of the variance
For any two independent random variables X and Y, if D = X - Y, the variance of D is:
● Variance of D = (SD of x)^2 + (SD of y)^2
○ It’s the same as adding them!!!
○ To get the standard deviation of D, take the square root of the variance
A binomial setting arises when we perform n independent trials of the same chance process and
count the number of times that a particular outcome (a success) occurs.
It must pass these conditions:
● Binary = The possible outcomes of each trial are classified as success or failure
● Independent = Trials must be independent
● Number = The number of trials of the chance process must be fixed in advance
● Same probability = There is the same probability of success p on each trial
If a count of X successes has a binomial distribution with n number of trials and p probability of
success:
● Mean of X = np
𝑝̂(1−𝑝̂)
● Standard deviation of X = √ 𝑛
When taking an SRS of size n from a population of size N, we can use a binomial distribution to
model the count of success in the sample as long as:
● n < 0.10(N)
As the number of trials increases, the binomial distribution gets closer to a normal one
● Large Counts Condition: normal if np > 10 and n(1-p) > 10
A geometric setting arises when we perform independent trials of the same chance process and
record the number of trials it takes to get one success
It must pass these conditions:
● Binary = The possible outcomes of each trial are classified as success or failure
● Independent = Trials must be independent
● Trials = The variable of interest is the number of trials to obtain the first success
● Same probability = There is the same probability of success p on each trial
The variable Y = The number of trials it takes to get a success in a geometric setting
● To find the probability that first success happens on the nth trial: geometpdf(p, n)
○ You can use geometcdf (p, n) also
● The at most/at least rules are the same for binomial distributions
The sampling distribution of the sample proportion describes the distribution of values taken
by the sample proportion in ALL POSSIBLE samples of the same size from the same population.
● SD = square root((p(1-p)) / n) *All conditions must be met*
○ Conditions: SRS, Independent, Large Counts
The sampling distribution of the sample mean describes the distribution of values taken by the
sample mean in ALL POSSIBLE samples of the same size from the same population.
● SD = population sd / square root (sample size)
○ Conditions: SRS, Independent, Central Limit Theorem
The Central Limit Theorem states that when n is large (>30), the sampling distribution of the
sample mean is approximately normal
A Confidence Interval gives an interval of plausible values for a parameter based on sample
data
● The Margin of Error of an estimate describes how far, at most, we expect that estimate
to vary from the true population value.
A Confidence Level gives the overall success rate of the method used to calculate the
confidence interval
A Critical Value is a multiplier that makes the interval wide enough to have the stated captured
rate
When the conditions are met, a C% confidence interval for the unknown proportion p is p̂
𝑝̂(1−𝑝̂)
±𝑧∗ √ 𝑛
● z* is the critical value for the standard Normal curve with C% of its area between -z* and
z*
○ When sampling without replacement, the 10% condition must be met (n < 0.10N)
To summarize, these are the conditions for constructing a confidence interval about a proportion:
● Random
● 10% Condition
● Large Counts Condition
When the standard deviation of a statistic is estimated from data, the result is called the standard
error of the statistic
𝑝̂(1−𝑝̂)
● √ 𝑛
These are the four-steps you MUST take when constructing a confidence interval:
● State: State the parameter you want to estimate and the confidence level
● Plan: Identify the appropriate inference method and check all three conditions
● Do: If the conditions are met, perform calculations
● Conclude: Interpret your interval in the context of the problem
We can also construct a confidence interval for an unknown population proportion on our
calculator by using Stat > Tests > 1-PropZInt
● We need to input the amount of people for what we are testing (the population x the
percentage), the population, and the confidence level
To determine the sample size n that will give us a C% confidence interval for a population with a
𝑝̂(1−𝑝̂)
maximum margin of error, solve the following equality for n: √ ≤ 𝑀𝐸
𝑛
Source: http://www.real-
statistics.com/students-t-distribution/t-distribution-basic-concepts/
There is also a different t distribution for each sample size, specified by its degrees of freedom
● df = n - 1
● As the degrees of freedom increase, the density curve approaches the standard normal
distribution more closely
𝑠
When the conditions are met, a C% confidence interval for the unknown mean is 𝑥̄ ± 𝑡 ∗ ( 𝑥𝑛)
√
● t* is the critical value for the t distribution with n - 1 degrees of freedom and C% of its
area between -t* and t*
Null Hypothesis (Ho): The claim we weigh evidence against in a significance test
● The hypothesis that says there is no effect or no change in the population
● Ex: p = 0.8, σ = 2
Alternative Hypothesis (Ha): The claim that we are trying to find evidence for
● The effect that we suspect is true
● The alternative hypothesis is one-sided if it states that a parameter is greater than or less
than the null value
○ Ex: p > 0.8, σ < 2
● The alternative hypothesis is two-sided if it states that a parameter could be either greater
than or less than the null value
○ Ex: p ≠ 0.8, σ ≠ 2
The significance level (α) is the value that we use as a boundary for deciding whether an
observed result is unlikely to happen by chance alone when the null hypothesis is true
● We need to include the significance level in the “State” portion of a significance test
● If a problem does not give us a significance level, use 0.05
The p-value of a test is the probability of getting evidence for the alternative hypothesis as
strong or stronger than the observed evidence when the null hypothesis is true.
● If the p-value is small (less than α), we reject the null hypothesis
○ We conclude that there is convincing evidence for the alternative hypothesis
(include context)
● If the p-value is large (greater than or equal to α), we fail to reject the null hypothesis
○ We conclude that there is not convincing evidence for the alternative hypothesis
(include context)
This is the formula to use when asked to interpret a p-value for a one-tailed test:
● Assuming that the (null hypothesis in context), there is a (p-value) probability of getting a
(sample statistic) of (statistic value) or less in a (sample in context)
● Ex: Assuming that the true proportion of students who turn their homework in time is 0.8,
there is a 0.09 probability of getting a sample proportion of 110/160 or less in a random
sample of 160 students in Ivy’s school
This is the formula to use when asked to interpret a p-value for a two-tailed test:
● Assuming that the (null hypothesis in context), there is a (p-value) probability of getting a
(sample statistic) at least as far from (po) as (statistic value) in either direction in (sample
in context)
● Ex: Assuming that the true proportion of students who turn in their homework in time is
0.8, there is a 0.09 probability of getting a sample proportion at least as far from 0.8 as
0.7 in either direction from a random sample of 160 students in Ivy’s school
This must be included in the conclusion for a significance test:
● State the decision about the null hypothesis (reject Ho or fail to reject Ho), based on the
relationship between the p-value and the significance level
● State whether or not there is convincing evidence for the alternative hypothesis in context
of the problem
To summarize, here is everything you should include in a significance test:
● State: Explain what the experiment is testing
○ State the null and alternative hypotheses you want to test
○ Define the parameter in context
○ Include the significance level
● Plan: Check conditions
○ Name of procedure (what kind of significance test, are you testing mean or
proportion, etc).
○ Random Condition
○ 10% Condition
○ Large Counts Condition
● Do: Perform calculations if conditions are met
○ State the sample statistic in context
○ Show general formula and input numbers
○ State procedure name, test statistic, and p-value
● Conclude: Formula included above
When drawing conclusions from a significance test, there are two types of mistakes we can
make:
● Type I Error: Occurs if a test rejects the null hypothesis when the null hypothesis is
actually true
○ The test finds convincing evidence that the alternative hypothesis is true when it
really isn’t
● Type II Error: Occurs if a test fails to reject the null hypothesis when the alternative
hypothesis is actually true
○ The test does not find convincing evidence that the alternative hypothesis is true
when it really is
Source:
https://www.researchgate.net/figure/Graphical-representation-of-type-1-and-type-2-
errors_fig1_268035363
The probability of making a Type I error in a significance test is equal to the significance level
● So, if we decrease the significance level, we also decrease the probability of making a
Type I error
● However, this then increases the probability of making a Type II Error
○ It is important to consider the consequences of each error before deciding on a
significance level
Standardized Test Statistic: Measures how far a sample statistic is from what we would
expect if the null hypothesis were true in standard deviation units
● Standardized test statistic = (statistic - parameter)/standard deviation of statistic
𝑝̂−𝑝0
○ 𝑧 = 𝑝 (1−𝑝0 )
for population
√ 0
𝑛
𝑥̄−𝜇0
○ 𝑧= 𝑠𝑥
√𝑛
These are the conditions for using a standardized test statistic (proportion):
● Data must come from a random sample
○ This helps us ensure that 𝑝̂ − 𝑝0 is a good estimate for the difference between the
true value of p and the null value 𝑝0
● The sampling distribution of p̂ must be approximately normal
○ When the large counts condition is met and Ho is true, the standardized test
statistic z has approximately the standard normal distribution
● Individual observations must be independent
𝑝0 (1−𝑝0 )
○ This allows us to calculate the standard deviation √ 𝑛
One Proportion Z-Test: To perform a test of Ho: 𝑝 = 𝑝0, compute the standardized test
statistic
● Find the p-value by calculating the probability of getting a z statistic this large or larger in
the direction specified by the alternative hypothesis
○ We compute this by using the standard normal distribution
● We can also perform one by going to Stat > Tests > 1-PropZTest on the calculator
One Sample t Test for a Mean: To perform a test of 𝜇 = 𝜇0 , compute the standardized
test statistic
● Find the p-value by calculating the probability of getting a t statistic this large or larger in
the direction specified by the alternative hypothesis
○ We can run this on our calculator using Stat > Tests > T-Test
There is a link between two-sided tests and confidence intervals for a population mean:
● If a 95% confidence interval for μ does not capture the null value μ0, we can reject the
null hypothesis in a two-sided test at the 0.05 significance level
● If a 95% confidence interval for μ captures the null value μ0, we can fail to reject the null
hypothesis in a two-sided test at the 0.05 significance level
The power of a test is the probability that the test will find convincing evidence for Ha when a
specific alternative value of the parameter is true
● Power = 1 - P(Type II error)
● P(Type II Error) = 1 - Power
These are some things you can do to increase the power of a significance test:
● Increase the sample size
● Increase the significance level
● Make the null and alternative parameter values farther apart
Sampling Distribution of p̂1 - p̂2: Choose a simple random sample of size n1 from
population 1 with proportion of successes p1 and an independent simple random sample of size
n2 from population 2 with proportion of successes p2
● The mean of the sampling distribution of p̂1 - p̂2 = p1 - p2
𝑝1 (1−𝑝1 ) 𝑝2 (1−𝑝2 )
● The standard deviation of the sampling distribution of p̂1 - p̂2 = √ +
𝑛1 𝑛2
■ We can do this on our calculator through Stat > Tests > 2-PropZInt
○ The 10% condition must be met for both samples
● The sampling distribution of p̂1 - p̂2 is approximately normal if the large counts condition
is met for both samples
In a significance test when comparing two proportions, the null hypothesis has this form:
● p1 - p2 = hypothesized value
○ The hypothesized difference is often 0
○ We then find the p-value by calculating the probability of getting a z statistic this
large or larger in the direction specified by Ha
○ We can do this on our calculator by using Stat > Tests > 2-PropZTest
Sampling Distribution of x̅1 - x̅2: Choose a simple random sample of size n1 from
population 1 with mean μ1 and standard deviation σ1 and an independent simple random sample
of size n2 from population 2 with mean μ2 and standard deviation σ2
● The mean of the sampling distribution of x̅1 - x̅2 = μ1 - μ2
𝜎2 𝜎2
● The standard deviation of the sampling distribution of x̅1 - x̅2 = √𝑛1 + 𝑛2
1 2
𝑠2 𝑠2
○ The confidence interval is therefore (𝑥̅1 − 𝑥̅2) ± 𝑡 ∗ √𝑛1 + 𝑛2
1 2
■ We can use this through Stat > Tests > 2-SampTInt on the calculator
○ The 10% condition must be met for both samples
● The sampling distribution of x̅1 - x̅2 is approximately normal if both sample sizes are
large( > 30) or if one population is normally distributed and the other sample size is large
In a significance test when comparing two means, the null hypothesis has this form:
● μ1 - μ2 = hypothesized value
○ The hypothesized difference is often 0
○ We then find the p-value by calculating the probability of getting a t statistic this
large or larger in the direction specified by Ha
○ We can do this on our calculator by using Stat > Tests > 2-SampTTest
Source: https://apcentral.collegeboard.org/pdf/ap-statistics-course-and-exam-description.pdf