Lecture 8
Lecture 8
Lecture 8
Univariate Statistics
1
Univariate Statistics
Empirical distributions
Histograms
Mean, Median, Quartiles, Variance, Skewness,
Kurtosis
Boxplot
Statistical distributions
Discrete distributions
Continuous distributions
• Gaussian distribution and the central limit theorem
• Chi-squared, F, and Student’s t-distributions
2
Univariate Statistics
Statistical testing
Chi-squared test
F-test
Student’s t-test
Extreme value distributions
Generalized extreme value distribution
• Return period
Extreme threshold distributions
• Weibull distribution
3
Empirical distribution: Histogram
Histogram shows the number of data points in a
given data bin
[n,xout]=hist(data)
Syntax %n: row vector if the number of data in each bin
%xout: bin locations
hist (data)
hist(data, # of bins)
hist(data, vector of data bins)
Updated functions:
hist histogram
[n, edges]=histcounts(data)
center=edges(1:end-1)+diff(edges)/2
4
Empirical distribution: Histogram
x=randn (1000,1);
histogram (x)
hist(x, 22) %gives similar results
y=-2:0.1:2;
hist(x,y) %not pretty
histogram(x,y) %much better
5
Empirical distributions
How do we describe a dataset?
Discrete parameters
min, max, mean
Median, quartile
Standard deviation
Variance
Skewness
kurtosis
6
Mean: Why different definitions?
Arithmetic mean
Geometric mean
Harmonic mean
7
Median: write a median function
function m=mymedian(x)
a=sort(x);
b=length(x);
b2=floor(b/2);
if (b/2 > b2) %if mod(b,2)
m=a(b2+1);
else
m=0.5*(a(b2)+a(b2+1));
end
end 8
Quantiles
4-quantiles: quartiles
100-quantiles: percentiles
9
Quantiles
11
Moment statistics
12
Moment statistics
14
Moment statistics
15
Dealing with NaN
x=[1:120, NaN];
mean(x), var(x)
nanmean(x), nanvar(x)
skewness(x)
kurtosis(x)
org=load('organicmatter_one.txt');
%checkout the data
plot(org,'o-'), ylabel('wt %')
%histogram
%sqrt of the number of data is often a good
first guess of intervals to use
hist(org,8)
17
Statistics:
18
Historgram
org=load('organicmatter_one.txt');
[n,xout]=hist(org,8);
%n: raw with the number of data of each bin
%xout: bin locations
bar(xout, n, 'r') %red bar
%3d bar
bar3(xout, n, 'b')
19
Sensitivity to outliers
sodium = load('sodiumcontent.txt');
whos sodium
hist(sodium,11)
%add an outliner
sodium2=sodium;
sodium2(121,1) = 0.1;
%sodim2=[sodium;0.1];
21
Boxplot
boxplot(org)
load carsmall
boxplot(MPG,Origin)
%MPG is a vector of numbers, Origin a vector of strings that
define “group”
22
Boxplot
23
Box plot: group assignment {}
sodium = load('sodiumcontent.txt');
sodium2=[sodium;0.1];
data=[sodium; sodium2];
name(1:length(sodium))={'original'};
ed= length(sodium);
name(ed+1:ed+length(sodium2))={'outlier'};
boxplot(data, name)
24
Statistical distribution
25
Discrete distribution: Poisson
26
Continuous PDF: Boltzman
Maxwell–Boltzmann distribution
27
https://en.wikipedia.org/wiki/Maxwell%E2%80%93Boltzmann_distribution#/media/File:MaxwellBoltzmann-en.svg
Gaussian (normal) distributions
Syntax
Y=pdf(name, p1,..)
Y=cdf(name, p1,…)
name: distribution name
pi: parameters for the distribution
Guassian
Y=pdf(‘norm’,data vector, mean,std)
Y=cdf(‘norm’,data vector, mean,std)
Or
Y=normpdf(data vector, mean,std)
Y=normcdf(data vector, mean,std)
28
Gaussian distribution
30
Central limit theorem
31
Central limit theorem
32
Central limit theorem
33
Flipping a rigged coin
ymean=mean(y)
ystd=std(y)
cdf('norm',-50,ymean,ystd)
cdf('norm',-100,ymean,ystd)
34
Central limit theorem
35
Gaussian distribution
36
Estimate of the errors
37
Propagation of error (normal distribution)
38
Central limit theorem
39
Log-normal distribution
40
Log-normal distribution
41
Log-normal distribution
43
Chi-squared distribution
https://en.wikipedia.org/wiki/Chi-squared_distribution
44
F-distribution
U1 and U2 have chi-squared distributions with d1 and
d2 degrees of freedom respectively, and
U1 and U2 are independent
F-distribution is
https://en.wikipedia.org/wiki/F-distribution
47
Using t-distribution
CDF
48
t-distribution
49
t-distribution
The probability that x
is greater than the
confidence interval
zα/2 is α.
50
https://www.studypug.com/statistics-help/confidence-intervals-with-t-distribution
t-distribution
52
t-distribution
x=[30.02, 29.99, 30.11, 29.97, 30.01, 29.99];
xmean=mean(x)
xstd=std(x)
n=length(x)
%t-value at 2.5% (5%/2), DOF=n-1
%t-distribution cdf is symmetrical
tvalue=abs(tinv(0.025,n-1))
%low/high bound
low=xmean-tvalue*xstd/sqrt(n)
high=xmean+tvalue*xstd/sqrt(n)
53
Comparison to normal distribution
As the number of
sample increases,
the t-distribution
value for 95%
confidence interval
approaches that of
Gaussian.
55
Statistical testing
Null hypothesis is used to test differences in
treatment and control groups, and the assumption
at the outset of the experiment is that no difference
exists between the two groups for the variable being
compared.
"A statistically significant difference" simply means
there is statistical evidence that there is a difference;
it does not mean the difference is necessarily large,
important or significant in the common meaning of
the word.
Confidence level:
56
Statistical testing
The null hypothesis must be stated in mathematical/statistical terms
that make it possible to calculate the probability of possible samples
assuming the hypothesis is correct.
A test must be chosen that will summarize the information in the
sample that is relevant to the hypothesis. In the example given
above, it might be the numerical difference between the two sample
means, m1 − m2.
The distribution of the test statistics is used to calculate the
probability sets of possible values (usually an interval or union of
intervals).
Among all the sets of possible values, we must choose one that we
think represents the most extreme evidence against the hypothesis.
That is called the critical region of the test statistic. The probability
of the test statistic falling in the critical region when the null
hypothesis is correct, is called when the null hypothesis is correct,
is called the p value (“surprise” value) of the test. 57
Probability
Frequency probability (frequentists) is the interpretation of
probability that defines an event's probability as the limit of its
relative frequency in a large number of trials. The problems
and paradoxes of the classical interpretation motivated the
development of the relative frequency concept of probability.
Bayesian probability is an interpretation of the probability
calculus which holds that the concept of probability can be
defined as the degree to which a person (or community)
believes that a proposition is true. A posteriori is a function of
a priori and observations.
The groups have agreed that Bayesian and Frequentist
analyses answer genuinely different questions, but disagreed
about which class of question it is more important to answer
in scientific and engineering contexts.
58
Pearson’s Chi-squared test
59
Pearson’s Chi-squared test
DOF = n-1
60
Pearson’s Chi-squared test
61
Pearson’s Chi-squared test
A random sample of 100 people has been drawn
from a population in which men and women are
equal in frequency. There were 45 men and 55
women in the sample, what is chi-squared value?
x=linspace(0,8,100);
v (probability)
v=cdf('chi2',x,ones(1,100));
plot(x,v)
%cdf('chi2',1,1) = 0.68
x (chi-squared value)
63
chi2inv
%chi2inv(probability, DOF)
64
Pearson’s Chi-squared test
65
Pearson’s Chi-squared test
corg = load('organicmatter_one.txt');
% 60 data points, define 8 bins
[n_exp,v] = hist(corg,8);
%redistribute;
n_syn = n_syn.* sum(n_exp)/sum(n_syn);
subplot(1,2,1), bar(v,n_syn,'r')
subplot(1,2,2), bar(v,n_exp,'b')
66
Pearson’s Chi-squared test
%test
chi2 = sum((n_exp - n_syn).^2./n_syn)
%0.05 value
chi2inv(0.95,dof)
67
F-test
F_crit=finv(0.95, DOF1,DOF2)
68
F-test
load('organicmatter_four.mat');
%compare std
s1 = std(corg1)
s2 = std(corg2)
%DOF
df1 = length(corg1) -1;
df2 = length(corg2) -1;
69
F-test
if s1>s2
Freal=(s1/s2)^2
else
Freal=(s2/s1)^2
end
70
Student’s t-Test
Assumptions
Normal distribution of data (What test to use?)
Equality of variances (what test to use?)
71
Student’s t-Test
Matlab syntax
[h,s,ci] = ttest2(x,y,p-value)
h: 1 rejects the null hypothesis;
0 cannot reject
s: significance for the difference
of the means between x and y
ci: confidence interval
p-value=0.05
72
Student’s t-Test
load('organicmatter_two.mat');
[n1,x1] = hist(corg1);
[n2,x2] = hist(corg2);
h1 = bar(x1,n1);
hold on
h2 = bar(x2,n2);
set(h1,'FaceColor','none','EdgeColor','r')
set(h2,'FaceColor','none','EdgeColor','b')
hold off
73
Student’s t-Test
%difference of mean
mean(corg1)-mean(corg2)
74
There are more!
(e.g., Extremal type distribution)
75
Tutorial
We compare the observed and simulated ozone mixing
ratios (ppbv, parts per billion by volume) in July and August.
Negative values represent no measurements.
Data: Atlanta_O3.txt
76