0% found this document useful (0 votes)
20 views3 pages

Coding Final Study Guide Notes

Uploaded by

antadiiagne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views3 pages

Coding Final Study Guide Notes

Uploaded by

antadiiagne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 5: Stats & Probability Lecture 7: Hypothesis Testing

Population vs Sample Central Limit Theorem


population: all possible values that could’ve been collected Distro of sample mean as sample size increases → approaches normal
sample: each singular data point actually collected Small N: sampling distro resembles original pop distro
rand num gen: pop= range of values that could’ve been, Moderate N (8): distro smooths, clusters toward true pop.mean (bell)
sample =values gen Large N >30: distro approaches normal
Calculate Stats & Discuss their Meaning Distro of raw data → approaches original pop distro
if np.mean & np.median = similar → distribution is not skewed Drawing Random Samples
np.std(name, ddof=1): measurements +/- std away from mean
range: np.max() - np.min() if large relative to mean → outliers
scipystats.mode: helpful if data = discrete values, unhelpful if
data= decimaled Manipulating Random Sample
scipystats.skew: negative means tail to left, positive =tail to right Np.random.rand(N): draws from uniform distro with default interval [0, 1]
scipystats.kurtosis(name, fisher=False): 3 = normal, <3 = flatter 0.5 * np.random.rand(N): multiply by decimal make interval smaller [0, 0.5]
(platykurtic), >3 =peaked (leptokurtic) 6.0 + np.random.rand(N): add a number shifts interval [6, 7]​
Plotting Histogram w/ Correct Bins Calculate Bounds for 99% Confidence Interval:​

Occurrence Probability for Theoretical Distros:


Prob that sample from norm distro w/mean 6.5 will be > than
5.5:

Performing Hypothesis Test for 2 : comparing 2 slices within dataset

Sampling Distribution, Sample Size & Number of Samples:​


Population distr: total set of measurements ​
Sample distr of sample mean: distr of means collected from
diff samples​
Number of Samples = # sets of data → increasing will make
distro converge at normal, no effect on mean​
Sample size = # of measurements w/in each set → increasing
will make sample distro narrower & decrease uncertainty of
mean SEM = sigma/sqrt(n)​
Practice Problems:
select data along specific coordinate values →sel()​
timeseries = temp.mean(dim=('lon','lat'))
Best way to select data at specific lon & lat:
ds.temperature.sel(lat=34.05, lon=-118.25, method="nearest")​
plot time-averaged spatial heatmap using temp variable from ds:
ds.temperature.mean(dim="time").plot()
“The t-stat x > the crit value y at a 90% significance level. At this sig level,
ds = xr.open_dataset(“path”) we reject the null hypothesis that noon mean pH is similar or < in the
morning and adopt the alt hypo that pH > in the afternoon”
Lecture 6: Time Series Analysis​ Lecture 7: Hypothesis Testing Continued
Fitting Polynomial Functions to Data:​ SubPlot Sample Distr of Sample Mean @ Sample Sizes:


Overfitting: model too complex & captures noise → poor generalization
to new data.​
Underfitting: model too simple & fails to capture true pattern

Linear Interpolation:


easy to implement & no extreme oscillations, use on sparse data points​
Spline Interpolation:​

Lecture 8: Multi-Dimensional Data Analysis

Same as linear, add cubic argument to 3rd code line​
Use when data has natural continuous variation & need smooth curve

Global Fit & Applied to a Value:​


Extrapolation:​
interp.interp1d(x, y, bounds_error=False, fille_value=”extrapolate”​
How Polynomial Functions Fit Data to Curves: (LSR)​
1 specify function form (polynomial, exponential, constant)​
2 guess initial values for constants in function​
3 define squared error residual metric quantifying mismatch between
observed data & current function values​
4 use algorithm to change coefficient values to minimize error metric→
finds least-square solution best fitting data ​
Quality of Functional Fit Quality: ​
improves when quantity of data points increases or noise decreases​
Higher order fits have extreme oscillations between data points, even if
data seems perfectly matched by a higher order fit → default is to
choose SIMPLEST fit matching data → less prone to high frequency
oscillations​ Using Xarray.plot(), .contour, etc.
Calculate Correlation Coefficient between Datasets:​


always linear relationship, >0.7 strong, 0.3-0.7 moderate, <0.3 weak​
2 independent datasets can still have strong correlation, indicating they
are impacted by a common 3rd variable ​
Other​
Ddof: If pop std → Ddof = 1/n, if sample std → Ddof = 1/(n-1)​
-matrices in format (#rows, #columns)
Calculating Degrees of Freedom
For confidence interval→ dof = n-1
For 2-sample t-test→ dof =n1​+n2​−2

You might also like