Session 01 (Introduction)
Session 01 (Introduction)
Leland
Wilkinson
Adjunct
Professor
UIC
Computer
Science
Chief
Scien<st
H2O.ai
[email protected]
Data
Analysis
o What
is
data
analysis?
o Summaries
of
batches
of
data
o Methods
for
discovering
paJerns
in
data
o Methods
for
visualizing
data
o Benefits
o Data
analysis
helps
us
support
supposi<ons
o Data
analysis
helps
us
discredit
false
explana<ons
o Data
analysis
helps
us
generate
new
ideas
to
inves<gate
http://blog.martinbellander.com/post/115411125748/the-colors-of-paintings-blue-is-the-new-orange
http://www.bmj.com/content/342/bmj.d671
o Benefits
o ML
helps
to
predict
outcomes
o ML
oXen
outperforms
tradi<onal
sta<s<cal
predic<on
methods
o ML
models
do
not
need
to
be
understood
by
humans
o Most
ML
results
are
unintelligible
(the
excep<ons
prove
the
rule)
o ML
people
care
about
the
quality
of
a
predic<on,
not
the
meaning
of
the
result
o ML
is
hot
(Deep
Learning!,
Big
Data!)
http://swift.cmbi.ru.nl/teach/B2/bioinf_24.html
Inference
o Inference
involves
drawing
conclusions
from
evidence
o In
logic,
the
evidence
is
a
set
of
premises
o In
data
analysis,
the
evidence
is
a
set
of
data
o In
sta<s<cs,
the
evidence
is
a
sample
from
a
popula<on
o A
popula<on
is
assumed
to
have
a
distribu<on
o The
sample
is
assumed
to
be
random
(Some<mes
there
are
ways
around
that)
o The
popula<on
may
be
the
same
size
as
the
sample
(not
usually
a
good
idea)
o There
are
two
historical
approaches
to
sta<s<cal
inference
o Frequen<st
o Bayesian
o There
are
many
widespread
abuses
of
sta<s<cal
inference
o We
cherry
pick
our
results
(scien<sts,
journals,
reporters,
…)
o We
didn’t
have
a
big
enough
sample
to
detect
a
real
difference
o We
think
a
large
sample
guarantees
accuracy
(the
bigger
the
beJer)
Smoothing
o Some<mes
we
want
to
smooth
variables
or
rela<ons
o Tukey
phrased
this
as
o data
=
smooth
+
rough
o The
smoothed
version
should
show
paJerns
not
evident
in
raw
data
o Many
of
these
methods
are
nonparametric
o Some
are
parametric
o But
we
use
them
to
discover,
not
to
confirm
6000000
Sales
0.5
4000000 Correlation
0.0
2000000
-0.5
0
1998 2004 2010 2016 -1.0
Year 0 10 20 30 40 50 60
Lag
Quarterly
US
Ecommerce
Retail
Sales,
Seasonally
Adjusted
15
Copyright
©
2016
Leland
Wilkinson
Reducing
o Reducing
takes
many
variables
and
reduces
them
to
a
smaller
number
of
variables
o There
are
many
ways
to
do
this
o Principal
components
(PC)
constructs
orthogonal
weighted
composites
based
on
correla<ons
(covariances)
among
variables
o Mul<dimensional
Scaling
(MDS)
embeds
them
in
a
low-‐dimensional
space
based
on
distances
between
variables
o Manifold
learning
projects
them
onto
a
low-‐dimensional
nonlinear
manifold
o Random
projec<on
is
like
principal
components
except
the
weights
are
random.
17
Copyright
©
2016
Leland
Wilkinson
Learning
o Machine
Learning
(ML)
methods
look
for
paJerns
that
persist
across
a
large
collec<on
of
data
objects
o ML
learns
from
new
data
o Key
concepts
o Curse
of
dimensionality
o Random
projec<ons
o Regulariza<on
o Kernels
o Bootstrap
aggrega<on
o Boos<ng
o Ensembles
o Valida<on
o Methods
o Supervised
o Classifica<on
(Discriminant
Analysis,
Support
Vector
Machines,
Trees,
Set
Covers)
o Predic<on
(Regression,
Trees,
Neural
Networks)
o Unsupervised
o Neural
Networks
o Clustering
o Projec<ons
(PC,
MDS,
Manifold
Learning)
Anomalies
o Anomalies
are,
literally,
lack
of
a
law
(nomos)
o The
best-‐known
anomaly
is
an
outlier
o This
presumes
a
distribu<on
with
tail(s)
o All
outliers
are
anomalies,
but
not
all
anomalies
are
outliers
o Iden<fying
outliers
is
not
simple
o Almost
every
soXware
system
and
sta<s<cs
text
gets
it
wrong
o Other
anomalies
don’t
involve
distribu<ons
o Coding
errors
in
data
o Misspellings
o Singular
events
o OXen
anomalies
in
residuals
are
more
interes<ng
than
the
es<mated
values
19
Copyright
©
2016
Leland
Wilkinson
Analyzing
o What
Sta<s<cs
is
not
o mathema<cs
o machine
learning
o computer
science
o probability
theory
o Sta<s<cal
reasoning
is
ra<onal
o Sta<s<cs
condi<ons
conclusions
o Sta<s<cs
factors
out
randomness
o Wise
words
o David
Moore
o Stephen
S<gler
o TFSI