0% found this document useful (0 votes)

8 views22 pages

Session 01 (Introduction)

lecture

Uploaded by

Gerry Contillo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views22 pages

Session 01 (Introduction)

lecture

Uploaded by

Gerry Contillo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Analysis, Statistics, Machine Learning

Leland Wilkinson

Adjunct Professor
UIC Computer Science
Chief Scien<st
H2O.ai

[email protected]
Data Analysis
o What is data analysis?
o Summaries of batches of data
o Methods for discovering paJerns in data
o Methods for visualizing data
o Beneﬁts
o Data analysis helps us support supposi<ons
o Data analysis helps us discredit false explana<ons
o Data analysis helps us generate new ideas to inves<gate

http://blog.martinbellander.com/post/115411125748/the-colors-of-paintings-blue-is-the-new-orange

2 Copyright © 2016 Leland Wilkinson

Sta<s<cs
o What is (are) sta<s<cs?
o Summaries of samples from popula<ons
o Methods for analyzing samples
o Making inferences based on samples
o Beneﬁts
o Sta<s<cs help us avoid false conclusions when evalua<ng evidence
o Sta<s<cs protect us from being fooled by randomness
o Sta<s<cs help us ﬁnd paJerns in nonrandom events
o Sta<s<cs quan<fy risk
o Sta<s<cs counteract ingrained bias in human judgment
o Sta<s<cal models are understandable by humans

http://www.bmj.com/content/342/bmj.d671

3 Copyright © 2016 Leland Wilkinson

Machine Learning
o What is machine learning?
o Data mining systems
o Discover paJerns in data
o Learning systems
o Adapt models over <me

o Beneﬁts
o ML helps to predict outcomes
o ML oXen outperforms tradi<onal sta<s<cal predic<on methods
o ML models do not need to be understood by humans
o Most ML results are unintelligible (the excep<ons prove the rule)
o ML people care about the quality of a predic<on, not the meaning of the result
o ML is hot (Deep Learning!, Big Data!)

http://swift.cmbi.ru.nl/teach/B2/bioinf_24.html

4 Copyright © 2016 Leland Wilkinson

Course Outline
1. Introduc<on
2. Data
3. Visualizing
4. Exploring
5. Summarizing
6. Distribu<ons
7. Inference
8. Predic<ng
9. Smoothing
10. Time Series
11. Comparing
12. Reducing
13. Grouping
14. Learning
15. Anomalies
16. Analyzing

5 Copyright © 2016 Leland Wilkinson

Data
o What is (are) data?
o A datum is a given (as in French donnée)
o data is plural of datum
o Data may have many diﬀerent forms
o Set, Bag, List, Table, etc.
o Many of these forms are amenable to data analysis
o None of these forms is suitable for sta<s<cal analysis
o Sta<s<cs operate on variables, not data
o A variable is a func<on mapping data objects to values
o A random variable is a variable whose values are each associated with a
probability p (0 ≤ p ≤ 1)
o Visualiza<ons operate on data or variables

6 Copyright © 2016 Leland Wilkinson

Visualizing
o Visualiza<ons represent data
o Tallies, stem-‐and-‐leaf plots, histograms, pie charts, bar charts, …
o Sta<s<cal visualiza<ons represent variables
o Probability plots, density plots, …
o Sta<s<cal visualiza<ons aid diagnosis of models
o Does a variable derive from a given distribu<on?
o Are there outliers and other anomalies?
o Are there trends (or periodicity, etc.) across <me?
o Are there rela<onships between variables?
o Are there clusters of points (cases)?

7 Copyright © 2016 Leland Wilkinson

Exploring
o Exploratory Data Analysis (John W. Tukey , EDA)
Summaries
Transforma<ons
Smoothing
Robustness
Interac<vity
What EDA is not …
Lelng the data speak for itself
Fishing expedi<ons
Null hypothesis tes<ng
Qualita<ve Data Analysis
Mixed methods
Old wine in new boJles

8 Copyright © 2016 Leland Wilkinson

Summarizing
o We summarize to remove irrelevant detail
o We summarize batches of data in a few numbers
o We summarize variables through their distribu<ons
o The best summaries preserve important informa<on
o All summaries sacriﬁce informa<on (lossy)
o Summaries
o Loca<on
o Popular: mean, median, mode
o Others: weighted mean, trimmed mean, …
o Spread
o Popular: sd, range
o Others: Interquar<le Range, Median Absolute Devia<on, …
o Shape
o Skewness
o Kurtosis

9 Copyright © 2016 Leland Wilkinson
Distribu<ons
o A probability func<on is a nonnega<ve func<on
o Its area (or mass) is 1
o Distribu<ons are families of probability func<ons
o Most sta<s<cal methods depend on distribu<ons
o Nonparametric methods are distribu<on-‐free
o The Normal (Gaussian) distribu<on is most popular
o Other distribu<ons (Binomial, Poisson, …) are oXen used
o We use the Normal because of the Central Limit Theorem
o Variables based on real data are rarely normally distributed
o But sums or means of random variables tend to be
o So if we are drawing inferences about means, Normal is usually OK
o This involves a leap of faith

10 Copyright © 2016 Leland Wilkinson

Inference
o Inference involves drawing conclusions from evidence
o In logic, the evidence is a set of premises
o In data analysis, the evidence is a set of data
o In sta<s<cs, the evidence is a sample from a popula<on
o A popula<on is assumed to have a distribu<on
o The sample is assumed to be random (Some<mes there are ways around that)
o The popula<on may be the same size as the sample (not usually a good idea)
o There are two historical approaches to sta<s<cal inference
o Frequen<st
o Bayesian
o There are many widespread abuses of sta<s<cal inference
o We cherry pick our results (scien<sts, journals, reporters, …)
o We didn’t have a big enough sample to detect a real diﬀerence
o We think a large sample guarantees accuracy (the bigger the beJer)

11 Copyright © 2016 Leland Wilkinson

Predic<ng
o Most sta<s<cal predic<on models take one of two forms
o y = Σj(βjxj) + ε
(addi<ve func<on)

o y = f(xj, ε)

(nonlinear func<on)

o The dis<nc<on is important
o The ﬁrst form is called an addi<ve model
o The second form is called a nonlinear model
o Addi<ve models can be curvilinear (if terms are nonlinear)
o Nonlinear models cannot be transformed to linear
o Examples of linear or linearizable models are
o y =β0 + β1x1 + … + βpxp + ε

o y =αeβx+ ε

o Examples of nonlinear models are
o y =β1x1 / β2x2 + ε

o y = logβ1x1ε

12 Copyright © 2016 Leland Wilkinson

Smoothing
o Some<mes we want to smooth variables or rela<ons
o Tukey phrased this as
o data = smooth + rough
o The smoothed version should show paJerns not evident in raw data
o Many of these methods are nonparametric
o Some are parametric
o But we use them to discover, not to conﬁrm

Time Series
o Time series sta<s<cs involve random processes over <me
o Spa<al sta<s<cs involve random processes over space
o Both involve similar mathema<cal models
o When there is no temporal or spa<al inﬂuence, these boil down to
ordinary sta<s<cal methods
DO NOT USE i.i.d. methods on temporal/spa<al data
These require stochas<c models, not “trend lines”
measurements at each <me/space point are not independent
Autocorrelation Plot
10000000
8000000 1.0

6000000
Sales

0.5
4000000 Correlation
0.0
2000000
-0.5
0
1998 2004 2010 2016 -1.0
Year 0 10 20 30 40 50 60
Lag
Quarterly US Ecommerce Retail Sales, Seasonally Adjusted

Comparing
o Sta<s<cal methods exist for comparing 2 or more groups
o The classical approach is Analysis of Variance (ANOVA)
o This method invented by Sir Ronald Fisher
o It revolu<onized industrial/scien<ﬁc experiments
o The researcher was able to examine more than one treatment at a <me
o With only two groups, results of Student’s t-‐test and F-‐test are
equivalent
o Mul<variate Analysis of Variance (MANOVA)
o This is ANOVA for more than one dependent variable (outcome)
o Hierarchical modeling is for nested data
o There are several forms of this mul<level modeling

15 Copyright © 2016 Leland Wilkinson

Reducing
o Reducing takes many variables and reduces them to a
smaller number of variables
o There are many ways to do this
o Principal components (PC) constructs orthogonal weighted
composites based on correla<ons (covariances) among variables
o Mul<dimensional Scaling (MDS) embeds them in a low-‐dimensional
space based on distances between variables
o Manifold learning projects them onto a low-‐dimensional nonlinear
manifold
o Random projec<on is like principal components except the weights
are random.

Grouping
o We can create groups of variables or groups of cases
o These methods involve what we call Cluster Analysis
o Hierarchical methods make trees of nested clusters
o Non-‐hierarchical methods group cases into k clusters
o These k clusters may be discrete or overlapping

o Two considera<ons are especially important

o Distance/Dissimilarity measure
o Agglomera<on or splilng rule
o The collec<on of clustering methods is huge
o Early applica<ons were for numerical taxonomy in biology

17 Copyright © 2016 Leland Wilkinson
Learning
o Machine Learning (ML) methods look for paJerns that persist across a large
collec<on of data objects
o ML learns from new data
o Key concepts
o Curse of dimensionality
o Random projec<ons
o Regulariza<on
o Kernels
o Bootstrap aggrega<on
o Boos<ng
o Ensembles
o Valida<on
o Methods
o Supervised
o Classiﬁca<on (Discriminant Analysis, Support Vector Machines, Trees, Set Covers)
o Predic<on (Regression, Trees, Neural Networks)
o Unsupervised
o Neural Networks
o Clustering
o Projec<ons (PC, MDS, Manifold Learning)

Anomalies
o Anomalies are, literally, lack of a law (nomos)
o The best-‐known anomaly is an outlier
o This presumes a distribu<on with tail(s)
o All outliers are anomalies, but not all anomalies are outliers
o Iden<fying outliers is not simple
o Almost every soXware system and sta<s<cs text gets it wrong
o Other anomalies don’t involve distribu<ons
o Coding errors in data
o Misspellings
o Singular events
o OXen anomalies in residuals are more interes<ng than the
es<mated values

19 Copyright © 2016 Leland Wilkinson
Analyzing
o What Sta<s<cs is not
o mathema<cs
o machine learning
o computer science
o probability theory
o Sta<s<cal reasoning is ra<onal
o Sta<s<cs condi<ons conclusions
o Sta<s<cs factors out randomness
o Wise words
o David Moore
o Stephen S<gler
o TFSI

References
o Sta<s<cs
o andrewgelman.com

o statsblogs.com

o jerrydallal.com
o Visualiza<on
o flowingdata.com

o eagereyes.org

o Machine Learning
o hunch.net

o nlpers.blogspot.com

o Math
o quomodocumque.wordpress.com

o terrytao.wordpress.com

References
o Abelson, R.P. (2005). Statistics as Principled Argument. Hillsdale, N.J.: L. Erlbaum.

o DeVeaux, R.D., Velleman, P., and Bock, D.E. (2013). Intro Stats (4th Ed.). New York:

Pearson.

o Freedman, D.A., Pisani, R. and Purves, R,A. (1978). Statistics. New York: W.W. Norton.

Practical Bayesian Inference
100% (2)
Practical Bayesian Inference
322 pages
Introduction To Predictive Analytics
No ratings yet
Introduction To Predictive Analytics
92 pages
Unit .......
No ratings yet
Unit .......
45 pages
CS194 Lec 06 EDA
No ratings yet
CS194 Lec 06 EDA
40 pages
UNIT 1,2
No ratings yet
UNIT 1,2
17 pages
Data Science and Visualization
No ratings yet
Data Science and Visualization
37 pages
Lec448B 20160406
No ratings yet
Lec448B 20160406
30 pages
Apuntes Estadistica
No ratings yet
Apuntes Estadistica
116 pages
Book review by Anang Tawiah_ Comprehensive Summary and Review of Practical Statistics for Data Scientists by Andrew Bruce, Peter Bruce, and Peter Gedeck
No ratings yet
Book review by Anang Tawiah_ Comprehensive Summary and Review of Practical Statistics for Data Scientists by Andrew Bruce, Peter Bruce, and Peter Gedeck
14 pages
What Exactly Is Data Science
No ratings yet
What Exactly Is Data Science
15 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
54 pages
Stan Users Guide 2 32
No ratings yet
Stan Users Guide 2 32
456 pages
statistics for applied science 200l
No ratings yet
statistics for applied science 200l
122 pages
STTN 225 R Summary
No ratings yet
STTN 225 R Summary
18 pages
Lecture-4: Introduction To Data Science
No ratings yet
Lecture-4: Introduction To Data Science
41 pages
Francis Diebold - Econometrics Slides
No ratings yet
Francis Diebold - Econometrics Slides
281 pages
Lec 14
No ratings yet
Lec 14
32 pages
TYCS DS Unit1
No ratings yet
TYCS DS Unit1
28 pages
10. Ai_foundations of Machine Learning III
No ratings yet
10. Ai_foundations of Machine Learning III
98 pages
Session 1 Canvas
No ratings yet
Session 1 Canvas
62 pages
PDF (Ebook PDF) Introductory Statistics: Exploring The World Through Data 3rd Edition Download
100% (3)
PDF (Ebook PDF) Introductory Statistics: Exploring The World Through Data 3rd Edition Download
51 pages
Data Science and Machine Learning
100% (1)
Data Science and Machine Learning
190 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
ADS IA 1 syllabus prep (1)
No ratings yet
ADS IA 1 syllabus prep (1)
5 pages
F23 Final Exam Review
No ratings yet
F23 Final Exam Review
12 pages
Introduction To Statistical Modeling With SAS/STAT Software
No ratings yet
Introduction To Statistical Modeling With SAS/STAT Software
60 pages
Ann
No ratings yet
Ann
88 pages
Statistics With R Programming PDF
No ratings yet
Statistics With R Programming PDF
53 pages
Data Mining
No ratings yet
Data Mining
34 pages
Econometrics 2019 PDF
No ratings yet
Econometrics 2019 PDF
143 pages
Unit_3 (1)
No ratings yet
Unit_3 (1)
36 pages
r lang-Unit-04
No ratings yet
r lang-Unit-04
12 pages
Download Introduction to Functional Data Analysis 1st Edition Piotr Kokoszka ebook All Chapters PDF
100% (6)
Download Introduction to Functional Data Analysis 1st Edition Piotr Kokoszka ebook All Chapters PDF
55 pages
(eBook PDF) Introductory Statistics: Exploring the World Through Data 3rd Editioninstant download
100% (5)
(eBook PDF) Introductory Statistics: Exploring the World Through Data 3rd Editioninstant download
51 pages
Master class data uses 100712
No ratings yet
Master class data uses 100712
69 pages
Lecture Slides - 11-16 - 2024
No ratings yet
Lecture Slides - 11-16 - 2024
128 pages
(eBook PDF) Introductory Statistics: Exploring the World Through Data 3rd Edition instant download
100% (2)
(eBook PDF) Introductory Statistics: Exploring the World Through Data 3rd Edition instant download
50 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Module 1-1 - Intro To Statistics
No ratings yet
Module 1-1 - Intro To Statistics
9 pages
Block 1 ST3189
No ratings yet
Block 1 ST3189
2 pages
research lecture 17
No ratings yet
research lecture 17
49 pages
Important Questions
No ratings yet
Important Questions
26 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
Lecture Notes
No ratings yet
Lecture Notes
138 pages
Sta 222 - New (1) - 1-1
No ratings yet
Sta 222 - New (1) - 1-1
25 pages
Ccpda Book
No ratings yet
Ccpda Book
46 pages
Introductory Statistics for Data Analysis Warren J. Ewens instant download
No ratings yet
Introductory Statistics for Data Analysis Warren J. Ewens instant download
33 pages
PDF Introduction To Functional Data Analysis 1st Edition Piotr Kokoszka Download
100% (5)
PDF Introduction To Functional Data Analysis 1st Edition Piotr Kokoszka Download
46 pages
Chapter 2_ Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2_ Data Exploration, Preprocessing and Visualization
92 pages
Fundamentals of Statistics I - Lecture Notes
No ratings yet
Fundamentals of Statistics I - Lecture Notes
77 pages
Module1-Talk-GITAA-modified (Autosaved)
No ratings yet
Module1-Talk-GITAA-modified (Autosaved)
328 pages
data science dse
No ratings yet
data science dse
24 pages
(eBook PDF) Introductory Statistics: Exploring the World Through Data 3rd Edition download
100% (3)
(eBook PDF) Introductory Statistics: Exploring the World Through Data 3rd Edition download
51 pages
Advanced Data Analysis in Neuroscience Integrating Statistical and Computational Models Full Digital Edition
100% (10)
Advanced Data Analysis in Neuroscience Integrating Statistical and Computational Models Full Digital Edition
14 pages
prob-toc
No ratings yet
prob-toc
12 pages
Research: Things You Should Know (Questions and Answers)
From Everand
Research: Things You Should Know (Questions and Answers)
Rumi Michael Leigh
No ratings yet
Probability and Statistics Made Easy
From Everand
Probability and Statistics Made Easy
Pasquale De Marco
No ratings yet
Statistics for All
From Everand
Statistics for All
Pasquale De Marco
No ratings yet
Statistical Explorations: Unraveling Insights From Data
From Everand
Statistical Explorations: Unraveling Insights From Data
Pasquale De Marco
No ratings yet
Servant Leadership
From Everand
Servant Leadership
Phil Callahan
No ratings yet
For Pitching
No ratings yet
For Pitching
16 pages
Transforming Education With Information Technology
No ratings yet
Transforming Education With Information Technology
8 pages
Session 09 (Smoothing)
No ratings yet
Session 09 (Smoothing)
23 pages
Revolutionizing Agriculture Information and Computing Solutions
No ratings yet
Revolutionizing Agriculture Information and Computing Solutions
16 pages
Session 08 (Predicting)
No ratings yet
Session 08 (Predicting)
54 pages
Session 10 (Time Series)
No ratings yet
Session 10 (Time Series)
23 pages
Session 07 (Inference)
No ratings yet
Session 07 (Inference)
62 pages
Session 11 (Comparing)
No ratings yet
Session 11 (Comparing)
34 pages
Session 06 (Distributions)
No ratings yet
Session 06 (Distributions)
52 pages
Session 05 (Summarizing)
No ratings yet
Session 05 (Summarizing)
17 pages
Session 03 (Visualizing)
No ratings yet
Session 03 (Visualizing)
53 pages
MMSU July 6 Module 1
No ratings yet
MMSU July 6 Module 1
71 pages
Session 02 (Data)
No ratings yet
Session 02 (Data)
28 pages
A Control Lyapunov Function Approach To Multi Agent CoordinationclfCas03
No ratings yet
A Control Lyapunov Function Approach To Multi Agent CoordinationclfCas03
18 pages
Waste Heat Recovery
No ratings yet
Waste Heat Recovery
20 pages
Power System Protection: S.A.Soman
No ratings yet
Power System Protection: S.A.Soman
18 pages
Obiee 12c - BAR
No ratings yet
Obiee 12c - BAR
3 pages
A New Planar Marchand Balun
No ratings yet
A New Planar Marchand Balun
4 pages
PepsiRefresh CaseStudy Kamischke PDF
0% (2)
PepsiRefresh CaseStudy Kamischke PDF
16 pages
Video Release Form
No ratings yet
Video Release Form
1 page
Guangzhouoperahouse PDF
No ratings yet
Guangzhouoperahouse PDF
2 pages
Connectwell Terminal Block
No ratings yet
Connectwell Terminal Block
98 pages
Helling Katalog NDT
No ratings yet
Helling Katalog NDT
43 pages
Program 56th
No ratings yet
Program 56th
173 pages
User Manual For Danesmoor System Discontinued Dec 07
No ratings yet
User Manual For Danesmoor System Discontinued Dec 07
16 pages
SmartPlant3D Grid-Structure Labs
100% (2)
SmartPlant3D Grid-Structure Labs
436 pages
Microsoft Dynamics 365 For Operations On-Premises, Enterprise Edition Licensing Guide
100% (1)
Microsoft Dynamics 365 For Operations On-Premises, Enterprise Edition Licensing Guide
27 pages
Via EPIA-P820 Datasheet v120627
No ratings yet
Via EPIA-P820 Datasheet v120627
2 pages
The Conceptual Fixed Asset System
100% (1)
The Conceptual Fixed Asset System
10 pages
RTN 950 Site Preparation Guide (V100R003 - 03) PDF
No ratings yet
RTN 950 Site Preparation Guide (V100R003 - 03) PDF
95 pages
Unit3 Assignment
No ratings yet
Unit3 Assignment
2 pages
Article On Cryptocurrency
No ratings yet
Article On Cryptocurrency
1 page
Minu Adhar
No ratings yet
Minu Adhar
1 page
Project Report 2019 Ti+ GT 1.1 Pltgu Grati
No ratings yet
Project Report 2019 Ti+ GT 1.1 Pltgu Grati
4 pages
SN Wind 7
No ratings yet
SN Wind 7
9 pages
Wilo Pumpcurves
No ratings yet
Wilo Pumpcurves
25 pages
Name Company Mobile Designatioe-Mail Id No. of Cas Product Subscripti
100% (1)
Name Company Mobile Designatioe-Mail Id No. of Cas Product Subscripti
4 pages
12 Ø MM Sag Rods 16 Ø MM Cross Bracing 2" X 4" C-PURLIN 1.5 MM Thk. at 625 MM O.C
No ratings yet
12 Ø MM Sag Rods 16 Ø MM Cross Bracing 2" X 4" C-PURLIN 1.5 MM Thk. at 625 MM O.C
1 page
Caribbean Sustaniable Energy Roadmap Strategy PDF
No ratings yet
Caribbean Sustaniable Energy Roadmap Strategy PDF
179 pages
World Pharma Market PDF
No ratings yet
World Pharma Market PDF
49 pages
ICE Maker User Manual V1 2021 1
No ratings yet
ICE Maker User Manual V1 2021 1
24 pages
Codificacion-en-Matlab IFN°3
No ratings yet
Codificacion-en-Matlab IFN°3
11 pages
HP Desingjet MFP t2300
No ratings yet
HP Desingjet MFP t2300
2 pages

Session 01 (Introduction)

Uploaded by

Session 01 (Introduction)

Uploaded by

Data Analysis, Statistics, Machine Learning

2 Copyright © 2016 Leland Wilkinson

3 Copyright © 2016 Leland Wilkinson

4 Copyright © 2016 Leland Wilkinson

5 Copyright © 2016 Leland Wilkinson

6 Copyright © 2016 Leland Wilkinson

7 Copyright © 2016 Leland Wilkinson

8 Copyright © 2016 Leland Wilkinson

10 Copyright © 2016 Leland Wilkinson

11 Copyright © 2016 Leland Wilkinson

12 Copyright © 2016 Leland Wilkinson

13 Copyright © 2016 Leland Wilkinson

14 Copyright © 2016 Leland Wilkinson

16 Copyright © 2016 Leland Wilkinson

o Two considera<ons are especially important

18 Copyright © 2016 Leland Wilkinson

20 Copyright © 2016 Leland Wilkinson

21 Copyright © 2016 Leland Wilkinson

22 Copyright © 2016 Leland Wilkinson

You might also like