0% found this document useful (0 votes)
7 views

ST Formula Sheet Midterm

The document is a midterm formula sheet covering statistical methods, sampling techniques, types of data, key measures, data display, relationships between variables, probability, expected value, and distributions. It outlines various statistical concepts such as descriptive and inferential statistics, measures of central tendency, variability, and probability rules. Additionally, it includes formulas for binomial and Poisson distributions, as well as methods for analyzing data relationships and distributions.

Uploaded by

meklerlevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

ST Formula Sheet Midterm

The document is a midterm formula sheet covering statistical methods, sampling techniques, types of data, key measures, data display, relationships between variables, probability, expected value, and distributions. It outlines various statistical concepts such as descriptive and inferential statistics, measures of central tendency, variability, and probability rules. Additionally, it includes formulas for binomial and Poisson distributions, as well as methods for analyzing data relationships and distributions.

Uploaded by

meklerlevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

summaryking.escp.

b1 Midterm Formula Sheet Page 1

Statistical approach
Statistical methods Sampling techniques Types of data
Random sampling:
Selection from the population in such Qualitative Quantitative
Descriptive Inferential way that every different sample of the
Description of the Using data from a (Categorical) Numerical measures
same size has an equal chance of Description of attributes or counts
properties of the sample to make selection
sample data forecasts about a
larger group Systematic sampling:
Selection of every kth experimental unit Nominal Ordinal Discrete Continuous
from a list of all experimental units No order Order on a scale Integers Decimals
Ways of obtaining data (e. g. hair colour) (e. g. ranking) (e. g. number of people) (e. g. speed)
§ Published source Stratified sampling:
§ Designed experiment Identification of subgroups, selection of a Identifier variable = Categorical variable with the special variable
§ Survey random sample within each subgroup that there is only one case in each category (e. g. ID number)
§ Observational study and putting them together
Cluster sampling:
Steps in a statistical study Division of a population into clusters and
random selection of some of these
1. Identify goals.
clusters
2. Draw a sample from a population.
3. Collect raw data and summarise. Convenience sampling:
4. Make inferences about population. Selection of experimental units that
5. Draw conclusions. convenient to reach

Key measures

Example Mean Median Mode


data set: = Average of a data set = Middle of a data set = Most common value in a data set
Grades on ∑$GH! $3 $1 + $2 + ⋯ + $4 $7!
(Not affected by extreme values)
x̄ = = Position of median in a sequence =
a test: F
4 4 Example: Mo = 11
3, 5, 7, 7, 8, (Not affected by extreme values)
8, 9, 10, 11, (Affected by extreme values)
'7 & '
Example: Position of Md = = 9.5 (between 9th and 10th
11, 11, 12, 0 & 3 & 4 & 4 & ... & '3 & '6 & '7 & '7
)
Example: x̄ = = 10.83 value) ® Md = 11
12, 14, 15, '7

16, 18, 18
Range Variance Standard deviation
= Difference between the largest and the = Degree of dispersion = Degree of dispersion
smallest value ∑!
89.("G * %̄ )^F
s2 = = ∑!
89.("G * %̄ )^F
Range = xlargest – xsmallest $*! s = √8^2 = ) =
("! * %̄ ): 7 ("F * %̄ ): 7 … 7 ("$ * %̄ ): $*!

Example: R = 18 – 3 = 15 $*! ("! * %̄ ): 7 ("F * %̄ ): 7 … 7 ("$ * %̄ ):


)
$*!
(Denominator is n – 1 when using a sample
and n when using a population) Example: s = √17.91 = 4.23
(0 ( ';.70)! & (3 ( ';.70)! & … & ('7 ( ';.70)!
Example: s2 = = 17.91
'7 ( '

Data display

Qualitative data

Table of counts Pie chart Bar graph Pareto diagram

(Bars arranged in descending order)

Quantitative data

Stem-and-leaf plot Dot plot Absolute frequency histogram Relative frequency histogram
Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
2 144677
3 028
4 1

(Absolute number of a specific event) (Proportion of a specific event within the


total number)
summaryking.escp.b1 Midterm Formula Sheet Page 2

Examining a distribution
Mode Skewness Unusual features Variation
K
§ Outlier Coefficient of variation: CV =

§ Cluster
§ Gap Interpretation:
CV < 1 ® Low variability
Whenever one of those is CV > 1 ® High variability
present, it is better to use the
median instead of the mean. The higher the CV, the greater the
Relative standing level of dispersion around the mean.
" * %̄
Z-score: z =
L Empirical Rule Chebyshev’s Rule
Interpretation:
z > 0 → Data value is above the mean
z < 0 → Data value is below the mean
z is close to 0 → Data value is not unusual
|z| > 2 → Data value is unusual
|z| > 3 → Data value is very unusual
pth percentile = Number such that p% of the
data fall below it
Box plot:
(Any distribution)
(Only normal (bell-shaped) distributions)
25% 25%
25% 25%
Interquartile range

Relationship between quantitative variables


Scatterplot Regression line (Line of best fit) Correlation conditions
Equation: with 1. Both variables must be quantitative.
2. The shape of the scatterplot must
Y be a straight linear line.
Response
/ Predicted 3. Outliers do not distort the results.
/ Dependent
variable
Correlation coefficient r

with
X
Each case in the data Explanatory
Interpretation:
set is assigned to a / Predictor
/ Independent
dot of the form (Xi,Yi). variable Interpolation: Using values within the domain
Extrapolation: Using values outside the domain
Elements to describe a scatterplot:

Direction Shape Strength Residual


Residual scatterplot
Positive slope: As X Linear Strong relationship Residual = Observed – Predicted = y –
a If a regression model was well
increases, Y increases. Interpretation: done, the obtained residual
Negative slope: As X Negative residual: Overestimate scatterplot (plotting residuals
Moderate relationship Positive residual: Underestimate against x-values or predicted
increases, Y decreases. Curve (→ Straighten values) stretches horizontally, has
with transformation) no bends, no or very few outliers
Unusual features No relationship Coefficient of determination and an approximate bell shape.
Outliers r2 = Percentage of the variation in Y which
Standard deviation of residuals se:
Level of variation of the y-values
Clusters / Subgroups has been accounted for by the model around the fitted line

Relationship between categorical variables


Contingency table Independency

(Absolute) Count Two events A and B are


Relative frequency
independent if:
P(A∩B) = P(A) · P(B)

Simpson’s Paradox
Statistical situation in which a
trend or relationship that is
observed between two
variables within multiple
groups disappears when the
Chi-squared statistic
Cramer’s V groups are combined
Difference between the observed counts and the counts that would according to a third variable
be expected if there were no relationship between the variables at all (lurking variable)

with X2 = 0 ® Total independence


(never in real life) The higher v, the stronger the association between the variables.
summaryking.escp.b1 Midterm Formula Sheet Page 3

Probability
Fundamental conditions Disjoint Independent
1. For any event A: 0 ≤ P(A) ≤ 1 Events that have no outcomes in common If the outcome of one event does not influence the outcome of
2. P(S) = 1 (with S representing are called disjoint or mutually exclusive. another event, those events are independent.
the set of all possible
outcomes) Independent if:
Complement P(A∩B) = P(A) · P(B)
The set of outcomes that are not
Disjoint Potentially independent
in the event A is called the
complement of A, denoted AC.
Law of large numbers
Calculation rules As a random trial is repeated over and over again, the proportion of
Complement Rule: P(AC) = 1 – P(A) times that an event occurs gets closer and closer to a single value
(empirical probability).
Addition Rule (for disjoint events): P(A∪B) = P(A or B) = P(A) + P(B)
MN,0OP :E QG,OL 2 :RRNPL
Multiplication Rule (for disjoint events): P(A∩B) = P(A and B) = P(A) · P(B) Empirical probability (in the long run): P(A) =
MN,0OP :E QPG/9L

General Addition Rule: P(A∪B) = P(A or B) = P(A) + P(B) – P(A∩B) Requirements:


1. The probability for each event remains the same for each trial.
General Multiplication Rule: P(A∩B) = P(A and B) = P(A) · P(B|A) = P(B) · P(A|B)
2. The outcome of a trial is not influenced by the outcome of
previous trials.
Circle notation

Conditional probability
P(B|A) is the probability of event B occurring, given that event A
occurs.
S(2∩3)
P(B|A) = P(B given A) = PAB =
S(2)

P(A) = Whole left circle P(A) = Left circle without intersection Sampling without replacement (drawn individual does not return to the pool) is an
instance of working with conditional probability. When dealing with a large population,
P(B) = Whole right circle P(B) = Right circle without intersection
sampling without replacement does not really matter. However, in a small population,
P(A∪B) = P(A) + P(B) – P(A∩B) P(A∪B) = P(A) + P(B) + P(A∩B) probabilities need to be adjusted accordingly.

Tree diagram Bayes’ Rule


All final outcomes are disjoint, and their probabilities must P(A|B) =
S(3|2) · S(2)

add up to 1. S(3)

Example:
To calculate the probability of a final outcome, all
Given: P(Cancer) = 0.05, P(Smoker) = 0.10, P(Smoker|Cancer) = 0.20
probabilities of the branches leading towards that
;.); · ;.;3
outcome are multiplied together. ® P(Cancer|Smoker) = = 0.1
;.';

Expected value Measures of variability


Value that is most likely the result of the next repeated trial of a statistical experiment Variance: = 2 = Var(X) = ∑$GH!(xi - ;)2 · P(X = xi)
General formula: ; = E(x) = P(X = x1) · x1 + … + P(X = xn) · xn = ∑$GH! P(X = i) · xi Standard deviation: = = '?@A(B)

Binomial model
Bernoulli trials Probability
Trials with only two possible outcomes (success and failure), where If there are n Bernoulli trials given with a probability of success p, the probability
the probabilities are p for success and q = 1 – p for failure, and for of having k successful trials can be calculated like this:
which successive trials are independent D
P(X = k) = B(n,p,k) = ( ) · pr · qn-r with
® Binomial model examines the number of successful trials out of a E
total of n Bernoulli trials

Expected value Variance Standard deviation


E(x) = n · p Var(X) = n · p · q = = 'n · p · q

Poisson distribution
Case of binomial distribution with a large number of trials (n → ∞), and a small probability of success (p → 0)
If a random variable X occurs in a Poisson distribution, the probability of having x events per unit of measurement is given by:

P(X = x) = with 2 = Mean number of events per unit of measurement

E(X) = Var(X) = L
= = √L
summaryking.escp.b1 Midterm Formula Sheet Page 4

Individual space

You might also like