0% found this document useful (0 votes)
22 views76 pages

Lecture 8

Uploaded by

김채원
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views76 pages

Lecture 8

Uploaded by

김채원
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Data Analytics

for Civil & Environmental Engineers

Lecture 8
Univariate Statistics

1
Univariate Statistics

 Empirical distributions
 Histograms
 Mean, Median, Quartiles, Variance, Skewness,
Kurtosis
 Boxplot
 Statistical distributions
 Discrete distributions
 Continuous distributions
• Gaussian distribution and the central limit theorem
• Chi-squared, F, and Student’s t-distributions
2
Univariate Statistics

 Statistical testing
 Chi-squared test
 F-test
 Student’s t-test
 Extreme value distributions
 Generalized extreme value distribution
• Return period
 Extreme threshold distributions
• Weibull distribution

3
Empirical distribution: Histogram
 Histogram shows the number of data points in a
given data bin

[n,xout]=hist(data)
Syntax %n: row vector if the number of data in each bin
%xout: bin locations
hist (data)
hist(data, # of bins)
hist(data, vector of data bins)
Updated functions:
hist histogram
[n, edges]=histcounts(data)
center=edges(1:end-1)+diff(edges)/2
4
Empirical distribution: Histogram
x=randn (1000,1);
histogram (x)
hist(x, 22) %gives similar results

histogram (x, 50) %50 bins

y=-2:0.1:2;
hist(x,y) %not pretty
histogram(x,y) %much better
5
Empirical distributions
How do we describe a dataset?

 Discrete parameters
 min, max, mean
 Median, quartile
 Standard deviation
 Variance
 Skewness
 kurtosis

6
Mean: Why different definitions?

 Arithmetic mean

 Geometric mean

 Harmonic mean

7
Median: write a median function
function m=mymedian(x)
a=sort(x);
b=length(x);
b2=floor(b/2);
if (b/2 > b2) %if mod(b,2)
m=a(b2+1);
else
m=0.5*(a(b2)+a(b2+1));
end
end 8
Quantiles

 Divide ordered data into (approximately) equal-


sized subsets of data.

 4-quantiles: quartiles

 100-quantiles: percentiles

 1st quartile: 25th percentile

 2nd quartile: median: 50th percentile

9
Quantiles

x=1:15, what is the 3rd quartile?

1. Use the median to divide the data to 2 subsets (do not


include the median value)
2. The upper quartile is the median of the upper half.

The 3rd quartile is 12.

Matlab uses linear interpolation:


prctile(x,[25 50 75])
10
Dispersion of the data: Central moments

11
Moment statistics

12
Moment statistics

Which variable is most sensitive to the difference between the mean


with median values?
13
Moment statistics

14
Moment statistics

15
Dealing with NaN

x=[1:120, NaN];
mean(x), var(x)
nanmean(x), nanvar(x)
skewness(x)
kurtosis(x)

How do we remove the NaN values?


x(isnan(x))=[]
x=x(~isnan(x))
NaN==NaN always return 0; must use isnan
16
Organic matter data

org=load('organicmatter_one.txt');
%checkout the data
plot(org,'o-'), ylabel('wt %')
%histogram
%sqrt of the number of data is often a good
first guess of intervals to use
hist(org,8)

17
Statistics:

18
Historgram

org=load('organicmatter_one.txt');
[n,xout]=hist(org,8);
%n: raw with the number of data of each bin
%xout: bin locations
bar(xout, n, 'r') %red bar

%3d bar
bar3(xout, n, 'b')

19
Sensitivity to outliers

sodium = load('sodiumcontent.txt');
whos sodium
hist(sodium,11)

%add an outliner
sodium2=sodium;
sodium2(121,1) = 0.1;
%sodim2=[sodium;0.1];

Which variable is most sensitive?


20
Sensitivity to outliers

21
Boxplot
boxplot(org)

 Box shows the lower quartile, median, and upper quartile


values.
 Whiskers show the most extreme data within 1.5 times
interquatile range (25th-75th percentile) from the ends of
the box (25th, 75th percentile)
 Red + signs: outliners

load carsmall
boxplot(MPG,Origin)
%MPG is a vector of numbers, Origin a vector of strings that
define “group”
22
Boxplot

23
Box plot: group assignment {}

sodium = load('sodiumcontent.txt');
sodium2=[sodium;0.1];

data=[sodium; sodium2];
name(1:length(sodium))={'original'};
ed= length(sodium);
name(ed+1:ed+length(sodium2))={'outlier'};
boxplot(data, name)

24
Statistical distribution

25
Discrete distribution: Poisson

26
Continuous PDF: Boltzman
Maxwell–Boltzmann distribution

27
https://en.wikipedia.org/wiki/Maxwell%E2%80%93Boltzmann_distribution#/media/File:MaxwellBoltzmann-en.svg
Gaussian (normal) distributions

Syntax
Y=pdf(name, p1,..)
Y=cdf(name, p1,…)
name: distribution name
pi: parameters for the distribution

Guassian
Y=pdf(‘norm’,data vector, mean,std)
Y=cdf(‘norm’,data vector, mean,std)

Or
Y=normpdf(data vector, mean,std)
Y=normcdf(data vector, mean,std)

28
Gaussian distribution

mu=[0, 2, -2, 0]; sig=[0.2,1,0.5,3];


x=linspace(-5,5,100);
for i=1:4
xpdf(:,i)=pdf('norm',x,mu(i),sig(i));
xcdf(:,i)=cdf('norm',x,mu(i),sig(i));
end
subplot(2,1,1), plot(x,xpdf)
subplot(2,1,2), plot(x,xcdf)
29
Gaussian distribution

30
Central limit theorem

 The sum of a large number of independent and


identically distributed random variables, each
with finite mean and variance, is approximately
normally distributed.

-the 2nd fundamental theorem of probability

31
Central limit theorem

 To simulate flipping a coin, we use a


random number generator. If the
random number is less than 0.5, it is for i=1:2000
considered as heads (represented x=rand(1000,1)<0.5;
by 1); if the random number is equal heads=sum(x);
to or greater than 0.5, it is tails=1000-heads;
considered as tails (represented by y(i)=heads-tails;
-1). The sum of 1000 random flips in end
2000 game plays can be calculated hist(y)
by generating random numbers
1000 times in 2000 trips and
summing the results (1 or -1).

32
Central limit theorem

33
Flipping a rigged coin

 What are the probabilities of losing more than


$50 and $100 if the winning odds is 50%?

ymean=mean(y)
ystd=std(y)
cdf('norm',-50,ymean,ystd)
cdf('norm',-100,ymean,ystd)

34
Central limit theorem

 Let X1, X2, X3, ... be a set of n independent and identically


distributed (not necessarily normal) random variables having
finite values of mean μ and variance σ2. As the sample size n
increases, the distribution of the sample average approaches
the normal distribution with a mean μ and variance σ2/n
irrespective of the shape of the original distribution.

 The PDF of the sum of two or more independent variables is


the convolution of their densities (if these densities exist). The
convolution of a number of density functions tends to the
normal density as the number of density functions increases
without bound, under the conditions stated above.

35
Gaussian distribution

36
Estimate of the errors

37
Propagation of error (normal distribution)

38
Central limit theorem

 The log of a product of random variables that


take only positive values tends to have a normal
distribution, which makes the product itself have
a log normal distribution.

39
Log-normal distribution

 If Y is a random variable with a normal


distribution, then X = exp(Y) has a log-normal
distribution.

40
Log-normal distribution

41
Log-normal distribution

mu=[0, 0, 1, 1]; sig=[1/4,1/2,1,2];


x=linspace(0,3,100);
for i=1:4
xpdf(:,i)=pdf('logn',x,mu(i),sig(i));
xcdf(:,i)=cdf('logn',x,mu(i),sig(i));
end
subplot(2,1,1), plot(x,xpdf)
subplot(2,1,2), plot(x,xcdf)
42
Chi-squared distribution

 One of the most widely used in statistical significance


tests.

43
Chi-squared distribution

https://en.wikipedia.org/wiki/Chi-squared_distribution
44
F-distribution
 U1 and U2 have chi-squared distributions with d1 and
d2 degrees of freedom respectively, and
 U1 and U2 are independent

 F-distribution is

https://en.wikipedia.org/wiki/F-distribution

 The F-distribution arises frequently as the null


distribution in the analysis of variance
45
Student’s t-distribution
 The t-distribution is often
used as an alternative to
the normal distribution as
a model for data. It is
frequently the case that
real data have heavier
tails than the normal
distribution allows for. The
classical approach was to
identify outliers and
exclude or downweight
them in some way.
46
https://en.wikipedia.org/wiki/Student%27s_t-distribution
Student’s t-distribution
 Arises in the problem of estimating the mean of a
normally distributed population when the sample size is
small (hence the standard deviation is not known well).

 Suppose X1, ..., Xn are independent random variables


that are normally distributed with expected value μ and
variance σ2. DOF=n-1 Let

47
Using t-distribution

Scaled distribution with a mean of 0 and


a standard deviation of 1.

If T were a Gaussian distribution,

PDF

CDF

48
t-distribution

 A random sampling of screws gives weights 30.02,


29.99, 30.11, 29.97, 30.01, 29.99.
 Calculate a 95% confidence interval for the
population's mean weight.

49
t-distribution
 The probability that x
is greater than the
confidence interval
zα/2 is α.

50
https://www.studypug.com/statistics-help/confidence-intervals-with-t-distribution
t-distribution

 This shows the confidence interval on μ at the 1- α


confidence level. This is a commonly used statistic for
estimation of the population mean. Often the level used
is 95%, or 2σ.
51
t-distribution

Find the data value corresponding to probability

α/2= cdf(‘t’, zα/2, DOF) (forward)


zα/2 = tinv(α/2, DOF) (invert)

For Gaussian distribution


zα/2 = norminv(α/2, mean, std/sqrt(n))

52
t-distribution
x=[30.02, 29.99, 30.11, 29.97, 30.01, 29.99];
xmean=mean(x)
xstd=std(x)
n=length(x)
%t-value at 2.5% (5%/2), DOF=n-1
%t-distribution cdf is symmetrical
tvalue=abs(tinv(0.025,n-1))
%low/high bound
low=xmean-tvalue*xstd/sqrt(n)
high=xmean+tvalue*xstd/sqrt(n)
53
Comparison to normal distribution

%if the sample size is large


xmean=mean(x);
t-distribution
xsig=xstd/sqrt(n); [29.963, 30.067]
tvalue=2.57
tvalue=abs(norminv(0.025,0,1))
Normal dist
xmean+tvalue*xsig [29.975, 30.055]
tvalue=1.96
xmean-tvalue*xsig
…….…or ………….
norminv(0.975,xmean,xsig)
norminv(0.025,xmean,xsig)
54
Student’s t and Guassian distributions

As the number of
sample increases,
the t-distribution
value for 95%
confidence interval
approaches that of
Gaussian.

55
Statistical testing
 Null hypothesis is used to test differences in
treatment and control groups, and the assumption
at the outset of the experiment is that no difference
exists between the two groups for the variable being
compared.
 "A statistically significant difference" simply means
there is statistical evidence that there is a difference;
it does not mean the difference is necessarily large,
important or significant in the common meaning of
the word.
 Confidence level:

56
Statistical testing
 The null hypothesis must be stated in mathematical/statistical terms
that make it possible to calculate the probability of possible samples
assuming the hypothesis is correct.
 A test must be chosen that will summarize the information in the
sample that is relevant to the hypothesis. In the example given
above, it might be the numerical difference between the two sample
means, m1 − m2.
 The distribution of the test statistics is used to calculate the
probability sets of possible values (usually an interval or union of
intervals).
 Among all the sets of possible values, we must choose one that we
think represents the most extreme evidence against the hypothesis.
That is called the critical region of the test statistic. The probability
of the test statistic falling in the critical region when the null
hypothesis is correct, is called when the null hypothesis is correct,
is called the p value (“surprise” value) of the test. 57
Probability
 Frequency probability (frequentists) is the interpretation of
probability that defines an event's probability as the limit of its
relative frequency in a large number of trials. The problems
and paradoxes of the classical interpretation motivated the
development of the relative frequency concept of probability.
 Bayesian probability is an interpretation of the probability
calculus which holds that the concept of probability can be
defined as the degree to which a person (or community)
believes that a proposition is true. A posteriori is a function of
a priori and observations.
 The groups have agreed that Bayesian and Frequentist
analyses answer genuinely different questions, but disagreed
about which class of question it is more important to answer
in scientific and engineering contexts.
58
Pearson’s Chi-squared test

 The null hypothesis: the relative frequencies of


occurrence of observed events follow a specified
frequency distribution.
 Pearson's chi-squared is the original and most
widely-used chi-squared test.
 The test compares the difference between each
observed and theoretical frequency for each
possible outcome.

59
Pearson’s Chi-squared test

DOF = n-1

60
Pearson’s Chi-squared test

 A chi-squared probability of 0.05 or less (alpha


value, α =0.05) is commonly used for rejecting the
null hypothesis.
 Critics of α values point out that the criterion used to
decide "statistical significance" is based on the
somewhat arbitrary choice of level (often set at
0.05).

61
Pearson’s Chi-squared test
 A random sample of 100 people has been drawn
from a population in which men and women are
equal in frequency. There were 45 men and 55
women in the sample, what is chi-squared value?

Probability in Chi-squared distribution: cdf(‘chi2’, value, DOF)

Is this Chi-squared value within 95% of the distribution?


cdf('chi2',1,1)
62
Pearson’s Chi-squared test

 What should we use?

x=linspace(0,8,100);

v (probability)
v=cdf('chi2',x,ones(1,100));
plot(x,v)

%cdf('chi2',1,1) = 0.68
x (chi-squared value)

63
chi2inv

%chi2inv(probability, DOF)

%the value exceeds 95% of the data with DOF =1


chi2inv(0.95,1)

if the chi2 values < chi2inv(0.95,1), the hypothesis


cannot be rejected.

64
Pearson’s Chi-squared test

65
Pearson’s Chi-squared test

corg = load('organicmatter_one.txt');
% 60 data points, define 8 bins
[n_exp,v] = hist(corg,8);

% generate a synthetic dataset


n_syn = pdf('norm',v,mean(corg),std(corg));

%redistribute;
n_syn = n_syn.* sum(n_exp)/sum(n_syn);

subplot(1,2,1), bar(v,n_syn,'r')
subplot(1,2,2), bar(v,n_exp,'b')

66
Pearson’s Chi-squared test

%test
chi2 = sum((n_exp - n_syn).^2./n_syn)

%dof (# of bins - # of parameters -1)


%For Gaussian, # of parameters is 2, mean & std
dof=8-2-1;

%0.05 value
chi2inv(0.95,dof)
67
F-test

 The null hypothesis: the standard deviations of


two normally distributed populations are
equal.

F_crit=finv(0.95, DOF1,DOF2)

68
F-test

load('organicmatter_four.mat');
%compare std
s1 = std(corg1)
s2 = std(corg2)

%DOF
df1 = length(corg1) -1;
df2 = length(corg2) -1;

69
F-test

if s1>s2
Freal=(s1/s2)^2
else
Freal=(s2/s1)^2
end

%find the table value for 5% extreme, inv-f cdf


Ftable = finv(0.95,df1,df2)

70
Student’s t-Test

 The null hypothesis: the means of two


distributions are equal.

 Assumptions
 Normal distribution of data (What test to use?)
 Equality of variances (what test to use?)

71
Student’s t-Test
Matlab syntax
[h,s,ci] = ttest2(x,y,p-value)
h: 1 rejects the null hypothesis;
0 cannot reject
s: significance for the difference
of the means between x and y
ci: confidence interval
p-value=0.05

72
Student’s t-Test
load('organicmatter_two.mat');
[n1,x1] = hist(corg1);
[n2,x2] = hist(corg2);

h1 = bar(x1,n1);
hold on
h2 = bar(x2,n2);

set(h1,'FaceColor','none','EdgeColor','r')
set(h2,'FaceColor','none','EdgeColor','b')

hold off
73
Student’s t-Test

%difference of mean
mean(corg1)-mean(corg2)

%h=1 rejects the null hypothesis


%significance: p-value associated with the t-statistic
%ci: 95% confidence interval on the difference of the
means (if there is no statistically significant difference)

[h, significance, ci] = ttest2(corg1,corg2,0.05)

74
There are more!
(e.g., Extremal type distribution)

75
Tutorial
We compare the observed and simulated ozone mixing
ratios (ppbv, parts per billion by volume) in July and August.
Negative values represent no measurements.
Data: Atlanta_O3.txt

1. Apply a statistical test to show if the two datasets are


normal distributions.

2. Assuming that both are normal distributions, apply


statistical tests to show if the two datasets have equivalent
variances and if the two datasets have equivalent means.

76

You might also like