0% found this document useful (0 votes)
28 views13 pages

Statistical Methods I Outline of Topics: (Tjones@cog - Ufl.edu)

This document outlines topics in statistical methods including descriptive statistics, hypothesis testing, and parametric and nonparametric statistical tests. Descriptive statistics are used to summarize nominal, ordinal, and interval medical data using measures of central tendency like the mean, median, and mode, and measures of dispersion like the range and standard deviation. Key concepts in descriptive statistics include symmetric and skewed data distributions and choosing the appropriate central tendency and dispersion measures based on the measurement scale and shape of the data.

Uploaded by

non
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views13 pages

Statistical Methods I Outline of Topics: (Tjones@cog - Ufl.edu)

This document outlines topics in statistical methods including descriptive statistics, hypothesis testing, and parametric and nonparametric statistical tests. Descriptive statistics are used to summarize nominal, ordinal, and interval medical data using measures of central tendency like the mean, median, and mode, and measures of dispersion like the range and standard deviation. Key concepts in descriptive statistics include symmetric and skewed data distributions and choosing the appropriate central tendency and dispersion measures based on the measurement scale and shape of the data.

Uploaded by

non
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Outline of Topics

Statistical Methods I
Tamekia L. Jones, Ph.D.
I. Descriptive Statistics
([email protected]) II. Hypothesis Testing
Research Assistant Professor III. Parametric Statistical Tests
Children’s Oncology Group Statistics & Data Center IV Nonparametric Statistical Tests
IV.
Department of Biostatistics
Colleges of Medicine and Public Health & Health
V. Correlation and Regression
Professions
2

Types of Data
Descriptive Statistics
• Nominal Data
– Gender: Male, Female • Descriptive statistical measurements are used
in medical literature to summarize data or
• Ordinal Data describe the attributes of a set of data
– Strongly disagree, Disagree, Slightly disagree,
Neutral,, Slightly
g y agree,
g , Agree,
g , Stronglyg y agree
g • Nominal data – summarize using
rates/proportions.
/ i
• Interval Data – e.g. % males, % females on a clinical study
– Numeric data: Birth weight Can also be used for Ordinal data
3 4

1
Descriptive Statistics (contd) Measures of Central Tendency
• Summary Statistics that describe the
• Two parameters used most frequently in location of the center of a distribution of
clinical medicine numerical or ordinal measurements where
– Measures of Central Tendency - A distribution consists of values of a characteristic
– Measures of Dispersion and the frequency of their occurrence

– Example: Serum Cholesterol levels (mmol/L)


6.8 5.1 6.1 4.4 5.0
7.1 5.5 3.8 4.4
5 6

Measures of Central Tendency (contd)


Measures of Central Tendency (contd)
Mean (Arithmetic Average)
Mean – used for numerical data and for
symmetric distributions

Median – used for ordinal data or for


numerical data where the distribution is
skewed

Mode – used primarily for multimodal • Sensitive to extreme observations


distributions − Replace 5.5 with, say, 12.0
The new mean = 54.7 / 9 = 6.08
7 8

2
Measures of Central Tendency (contd)
Measures of Central Tendency (contd)

Median (Positional Average) Mode


• Middle observation: ½ the values are less than and half the values • The observation that occurs most frequently in the data
are greater than this observation
• Example: 3.8 4.4 4.4 5.0 5.1 5.5 6.1 6.8 7.1
• Order the observations from smallest to largest Mode = 4.4
3.8 4.4 4.4 5.0 5.1 5.5 6.1 6.8 7.1
• Example: 3.8 4.4 4.4 5.0 5.1 5.5 6.1 6.1 7.1
• Median = middle observation = 5.1 Mode = 4.4; 6.1

• Less Sensitive to extreme observations


• Two modes – Bimodal distribution
• Replace 5.5 with say 12.0
• New Median = 5.1

9 10

Measures of Central Tendency (contd)


Measures of Central Tendency (contd) Shape of the distribution
• Symmetric
Which measure do I use?

Depends on two factors:


• Skewed to the Left (Negative)
1. Scale of measurement (ordinal or
numerical)) and
2. Shape of the Distribution of Observations
• Skewed to the Right (Positive)

11 12

3
Measures of Dispersion Measures of Dispersion (contd)

• Measures that describe the spread or variation in Range = difference between the largest and the
the observations smallest
ll t observation
b ti
• Used with numerical data to emphasize
• Common measures of dispersion extreme values
• Range
• Standard Deviation
• Coefficient of Variation • Serum cholesterol example
• Percentiles Minimum = 3.8, Maximum = 7.1
• Inter-quartile Range
Range = 7.1 – 3.8 = 3.3

13 14

Measures of Dispersion (contd)


Measures of Dispersion (contd)
Standard Deviation
6.8 5.1 6.1 4.4 5.0 7.1 5.5 3.8 4.4
Standard Deviation
Mean = 5.35 n=9
– Measure of the spread of the observations about the mean
– Used as a measure of dispersion when the mean is used to
measure central tendency for symmetric numerical data

– Standard
St d d deviation
d i ti like
lik the
th mean requires
i numerical
i l data
d t

– Essential part of many statistical tests

– Variance = s2
15 16

4
Measures of Dispersion (contd)

Measures of Dispersion (contd) Coefficient of Variation


• Measure of the relative spread in data
If the observations have a Bell-Shaped • Used to compare variability between two numerical data
Di ib i
Distribution, th the
then th following
f ll i is i always
l true
t - measuredd on different
diff scales
l
• Coefficient of Variation (C of V) = (s / mean) x 100%

67% of the observations lie between X 1s and X 1s • Example:


95% of the observations lie between X  2s and X  2s Mean Std Dev (s) C of V
Serum Cholesterol ((mmol/L)) 5.35 1.126
99.7% of the observations lie between X  3s and X  3s Change in vessel diameter (mm) 0.12 0.29

The Normal (Gaussian) Distribution


17 18

Measures of Dispersion (contd)

Coefficient of Variation Measures of Dispersion (contd)


• Measure of the relative spread in data e.g. DiMaio et al evaluated the use of the test measuring maternal
• Used to compare variability between two numerical data serum alphafetoprotein (for screening neural tube defects), in a
measuredd on different
diff scales
l prospective study of 34,000 women.
• Coefficient of Variation (C of V) = (s / mean) x 100%
Reproducibility of the test procedure was determined by
• Example: repeating the assay 10 times in each of four pools of serum. Mean
and s of the 10 assays were calculated in each of the 4 pools.
Mean Std Dev (s) C of V
Coeffs of Variation were computed for each pool: 7.4%, 5.8%,
Serum Cholesterol ((mmol/L)) 5.35 1.126 21%
2 7% and 22.4%.
2.7%, 4% These values indicate relatively good
Change in vessel diameter (mm) 0.12 0.29 241.7% reproducibility of the assay, because the variation as measured by
the std deviation, is small relative to the mean. Hence readers of
• Relative variation in Change in Vessel Diameter is more their article can be confident that the assay results were
than 10 times greater than that for Serum Cholesterol consistent.
19 20

5
Measures of Dispersion (contd) Measures of Dispersion (contd)

Interquartile Range (IQR)


Percentile
• A number that indicates the percentage of the distribution of • Measure of variation that makes use of percentiles
data that is equal to or below that number • Difference between the 25th and 75th percentiles
• Used to compare an individual value with a set of norms • Contains the middle 50% of the observations (independent
of shape of the distribution)
• Example - Standard physical growth chart for girls from
birth to 36 months of age • Example –
• For g g , the 95th p
girls 21 months of age, percentile of weight
g is 13.4
• IQR for weights of 12 month old girls is the difference
kg. That is, among 21 month old girls, 95% weigh 13.4 kg or
less, and only 5% weigh more than 13.4 kg. between 10.2 kg (75th percentile) and 8.8 kg (25th
percentile);
• i.e., 50% of infant girls at 12 months weigh between 8.8 and
• 50th percentile is the Median
10.2 kg.
21 22

Hypothesis Testing Hypothesis Testing (contd)


• Permits medical researchers to make generalizations
about a population based on results obtained from a • Statistical Hypothesis – a statement about the value of a
study population parameter

• Confirms (or refutes) the assertion that the observed • Null Hypothesis (Ho )
findings did not occur by chance alone but due to a – Usually the hypothesis that the researcher wants to gather evidence
true association between the dependent and against
independent
p variable
• Alternative (or Research) Hypothesis (Ha)
• The aim of the researcher is to demonstrate that the – Usually the hypothesis for which the researcher wants to gather
observed findings from a study are statistically supporting evidence
significant.
23 24

6
Hypothesis Testing (contd)
Hypothesis Testing (contd)
Ho : There is no difference between smokers and nonsmokers
Example: A researcher studied the relationship with respect to the risk of developing lung cancer.
cancer That is,
is
between Smoking and Lung cancer. the observed difference (in the sample), if any, is by
chance alone.

Lung Cancer Ha : There is a difference between smokers and nonsmokers


Present Absent with respect to the risk of developing lung cancer and that
Smoker A B the observed difference ((in the sample)
p ) is not byy chance
Non-Smoker C D alone.

Conclusion: If the findings of the study are statistically


significant, then reject Ho and fail to reject the alternative
hypothesis Ha.
25 26

Hypothesis Testing (contd) Hypothesis Testing (contd)

Test Statistic
• Statistics whose primary use is in testing hypotheses Types of Errors
are called test statistics
Truth

• Hypothesis testing, thus, involves determining the Ho True Ho False


value the test statistic must attain in order for the test
to be declared significant. Accept Ho Correct Type II error
D ii
Decision
Reject Ho Type I error Correct
• The test statistic is computed from the data of the
sample.

27 28

7
Hypothesis Testing (contd)
Hypothesis Testing (contd)
Alpha (α) = Probability of Type I error; significance level of the test)
• Type I Error
− Rejecting the null hypothesis when it is true Beta (β) = Probability of Type II error
− If Ho is true in reality and the observed finding of a study
is statistically significant, the decision to reject Ho is Power of a test = 1 – β; probability that a test detects differences that
incorrect and an error has been made. actually exist; typically use 80%

• Type II Error Level of Significance (p-value) in a study:


− Failingg to reject
j the null hypothesis
yp when it is false. – Probability of obtaining a result as extreme as or more extreme than
the one observed, if the null hypothesis is true
− If in reality Ho is false and the observed finding of a study
is statistically not significant, the decision to accept Ho is – Probability that the observed result is due to chance alone.
incorrect and an error has been made. – Most researchers use p≤ 0.05 to reject Ho, and p>0.05 to accept the
null hypothesis Ho and reject the alternative hypothesis Ha.

29 30

Hypothesis Testing (contd) One-Sided Test


e.g. Incidence of tuberculosis among Dade county (Miami)
residents is known to be no more than 0.0002 (2 cases per 10,000
Sided Test of Hypothesis is one in which the alternative
One-Sided
One people) After conducting medical checks,
people). checks a medical researcher
hypothesis is directional (typically includes the ‘<‘ symbol or believes that Haitian refugees arriving in Miami have a much
the ‘>’ symbol). higher incidence of tuberculosis. To check this belief, he will test
the null hypothesis.
Two-Sided Test of Hypothesis is one in which the alternative H 0 :   0.0002
hypothesis does not specify departure from the null in a where  is the proportion of Haitians in Miami who contract TB.
particular
ti l direction
di ti (typically
(t i ll willill be
b written
itt with
ith the
th ‘≠’
symbol. Versus the alternative hypothesis

H a :   0.0002
because he is interested in detecting whether the true incidence of TB
31
in the Haitian population is Miami is larger than 0.0002. 32

8
Two-Sided Test One-Sample Tests
e.g. A researcher would like to determine whether mean age of
onset of heart disease in males differs from the mean age for
females The null hypothesis of interest is
females. One Sample
p hypothesis
yp tests involve inferences about a
single population parameter – based on data from a
H0 : M  F single sample.
where M is the mean age of onset of heart disease for males and
F is the the mean age of onset for females The parameter (mean, proportion) is compared to a single
yp
Versus the alternative hypothesis numeric value.

Ha : M  F e.g. Hypothesis to test whether the incidence of TB


since she has no reason to believe that one could be higher than among Haitian refugees is 0.0002.
the other.
33 34

Two-Sample Tests
Parametric Tests
Two-Sample hypothesis tests involve comparisons of the • Parametric tests are based on assumptions about the distribution
of the observed data. (E.g., Normal distribution)
parameter values between two independent groups .
The parameter (mean, proportion) value is compared • Hypotheses are formulated in terms of the Mean or the Standard
between two groups. Deviation. Some examples of tests include:

Example: Does mean age of onset of heart disease in 1. Z-test (when sample sizes are large, or the population
males differ from the mean age for females? standard deviation is known) used to make inferences about
− Groups: Males versus Females means or proportions
− Age of onset measured in both groups − Example: Observe serum cholesterol levels among 150 Native
Americans in Arizona to study the association with coronary
− The observations from the sample of males and the sample artery disease
of females are used to conduct the test
35 36

9
Parametric Tests (contd) Parametric Tests (contd)

2. T-test (when sample sizes are small n < 30, or the 3. F-test is used to
population standard deviation is not known and the – Test hypotheses about a single population standard
sample standard deviation is used ) to make inferences deviation, or to compare two standard deviations.
about means.
– Compare three or more group means: Analysis of Variance
− Example: Temperatures of 26 patients were recorded 48 (ANOVA)
g y A researcher is interested in
hours after surgery.
determining if the mean temperatures of the surgical – Example: The four blood groups A, B, O, and AB were studied to
patients are significantly different from the standard normal compare the quantitative serologic differences among their
temperature of 98.6oF. antigenic structures. Use ANOVA to compare the 4 group means.

37 38

Parametric Tests (contd)


Parametric Tests (contd)
RF
Present Absent
4. Chi-Square Test
100% + 36 31
− Used for comparing two or more independent proportions Oxygen
O - 22 46
within two or more groups – for example when the data are
arranged in a 2–by-2 table.
Ho: Proportion of infants developing RF is independent of
whether they got 100% Oxygen therapy (or proportion
− Example: To assess the possible association between 100%
oxygen therapy and the subsequent development of retrolental developing RF in the +oxygen and –oxygen group are the
fibroplasia (RF),
(RF) a total of 135 premature infants in the same)
intensive care unit were studied.
Ha: Proportion of infants developing RF is dependent on whether
they got 100% Oxygen therapy (or proportion developing RF
in the +oxygen and –oxygen group are not the same)
39 40

10
Nonparametric Tests (contd)
Nonparametric Tests
• Sign Test
Nonparametric tests – Used to test hypotheses about the Median of a population
• Based
B d on weaker
k assumptions
ti – One-Sample
O S l TTest
• Do not assume a normal distribution
• Wilcoxon Rank-Sum Test
• Called Distribution-Free Tests – Used to compare Medians of two groups
• Used when the assumption of normality (required for – Two-Sample Test
parametric tests) of the data is not met
• Fisher’s Exact test
• Hypotheses may be framed in terms of the Median,
quartiles, etc. instead of the Mean scores – Nonparametric equivalent of the Chi-square test
– Used when the expected frequencies in a table are small
(<5)
41 42

Paired Data Association and Prediction


• Correlation Coefficient (r)
• Paired T-test – Measure of the strength of the linear relationship between two variables
– Tests hypotheses about Mean change in a population measured on a numerical scale

– Example: – Ranges
g from -1.0 to +1.0
• Test hypotheses about the mean reduction in weight for a group of – Positive and negative correlation.
subjects in a new weight loss program – Example: Interested in the correlation between cholesterol and
• Test whether the mean difference (change in weight) is greater than triglyceride levels (r = +1, +0.8, 0, -0.8, -1 )
zero

• Nonparametric test for paired data – Wilcoxon Signed-Rank Test


– To test the hypothesis that the medians, rather than the means, are equal in
the two paired samples.

43 44

11
Prediction
Association and Prediction
Regression Analysis
• Models the relationship between two or more variables
• Pearson’s Product Moment Correlation Coefficient p
such that one can be expressed in terms of the other
(parametric) variables (mathematical equation)
Y = aX + b
• Spearman’s Rank Correlation Coefficient (nonparametric) Z = aX +bY + c

• Coefficient of Determination ( r2 ) • Dependent variable/ Response variable


0 8 , r2 = (0.8)(0.8)
r = 0.8 (0 8)(0 8) = 00.64
64 • Independent variable / Explanatory variable
or 64% of the variation in the values for cholesterol are
explained by changes in triglyceride levels • Linear Regression
– Example: MCAT (Science) and ACT
scores (Y = aX + b)
45 46

Prediction Prediction
• Simple Linear Regression – only one explanatory variable
Example:
p MCAT(Science)
( ) and ACT scores for 42
medical school applicants (Y = aX + b)

• Least squares method used to estimate the regression


coefficients ‘a’ and ‘b’

• Regression equation:

• Multiple Regression – two or more explanatory variables

47 48

12
Conclusions

• Summary of some Statistical Methods used in Medical research


p
presented

• The objective is for you to be able to recognize the various


methods and understand/interpret the statistical analyses/results
presented in published articles – NOT for you to conduct the
statistical analyses

• It is highly recommended & encouraged that you


work with a trained Biostatistician on your research
projects!
49

13

You might also like