0% found this document useful (0 votes)
4 views70 pages

Ict515 Lec1

The document provides an overview of statistical learning, defining it as a set of tools for understanding data, which can be categorized into supervised and unsupervised learning. It discusses various applications, including predicting wages and stock market movements, and emphasizes the importance of understanding simpler methods before moving to more complex ones. Additionally, the document distinguishes between statistical learning and machine learning, highlighting their overlapping yet distinct focuses.

Uploaded by

Fahad Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views70 pages

Ict515 Lec1

The document provides an overview of statistical learning, defining it as a set of tools for understanding data, which can be categorized into supervised and unsupervised learning. It discusses various applications, including predicting wages and stock market movements, and emphasizes the importance of understanding simpler methods before moving to more complex ones. Additionally, the document distinguishes between statistical learning and machine learning, highlighting their overlapping yet distinct focuses.

Uploaded by

Fahad Ahmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

ICT 515

Foundations of Data Science

TOPIC 1: The Data Science


Process/Definition and Concepts of Statistical
Learning
· Some of the figures in this presentation are
taken from "An Introduction to Statistical
Learning, with applications in R" (Springer,
2013) with permission from the authors: G.
James, D. Witten, T. Hastie and R. Tibshirani,
and from the publicly available slides by Prof.
Abbass Al-Sharif, University of Southern
California.
An overview of Statistical
Learning
· Statistical Learning refers to a vast set of tools
for understanding data. These tools can be
classified as supervised or unsupervised
· Supervised statistical learning involves building
a statistical model for predicting or estimating
an output based on one or more inputs
· With unsupervised statistical learning, there
are inputs but no supervising output – still, we
can learn relationships and structure from such
data
Objectives

On the basis of the training data we would like to:


· Accurately predict unseen test cases
· Understand which inputs affect the outcome,
and how
· Assess the quality of our predictions and
inferences
Simple is often efficient

It is important to understand the ideas behind the


various techniques, in order to know how and
when to use them
One has to understand the simpler methods first,
in order to grasp the more sophisticated ones
It is important to accurately assess the
performance of a method, to know how well or
how badly it is working - simpler methods often
perform as well as fancier ones
This is an exciting research area, having important
The Netflix prize

Competition started in October 2006. Training data is ratings for


18,000 movies by 400,000 Netflix customers, each rating between
1 and 5.
training data is very sparse-> about 98% missing.
objective is to predict the rating for a set of 1 million customer-movie
pairs that are missing in the training data.
Netflix's original algorithm achieved a root MSE of 0.953 (in the 1-5
scale)
The first team to achieve a 10% improvement wins one million
dollars (it took 3 years!)
Statistical Learning versus
Machine Learning (1/2)

Machine learning arose as a subfield of Artificial Intelligence.


Statistical learning arose as a subfield of Statistics.
Machine learning: an algorithm that can learn from data
without relying on rules-based programming.
Statistical modeling: the formalization of relationships
between variables in the form of mathematical equations.
Statistical Learning versus
Machine Learning (2/2)
There is much overlap - both fields focus on supervised and
unsupervised problems:
Machine learning has a greater emphasis on large scale
applications and prediction accuracy. Statistical learning
emphasizes models and their interpretability, precision
and uncertainty.
But the distinction has become more and more blurred
First example (1/3)
· Let’s examine a number of factors that relate to wages for a group of
males from the Atlantic region of the United States
· We wish to understand the association between an employee’s age
and education, as well as the calendar year, on his wage
· Let’s consider the left-hand panel of the figure in the next slide. It
displays wage versus age for each of the individuals in the dataset
· There is evidence that wage increases with age but then decreases
again after roughly age 60. The blue line, which provides an estimate
of the average wage for a given age, makes this trend clearer
Figure for the Wage dataset
First example (2/3)
· Given an employee’s age, we can use this curve to predict his wage
· However, there is a significant amount of variability associated with
this average value => age alone is unlikely to provide an accurate
prediction a particular’s man wage
· We also have information, in the right-hand and center panels of the
figure, regarding each employee’s education level and the year in
which the wage was earned. From the panels it is strongly indicated
that both factors (year, education) are associated with wage
First example (3/3)
· Wages increase by approximately $10000 in a roughly linear fashion
between 2003-2009, and they are typically greater for individuals with
higher education levels
· We would like to find a way to accurately predict a given man’s wage and
based on the figure we suspect the most accurate prediction will be
obtained by combining his age, his education and the year
· One of the topics of the unit is linear regression, which can be used to
predict wage from this dataset
· However, ideally, we should predict wage in a way that accounts for the
non-linear relationship between wage and age. We will also discuss,
later in the unit, a class of approaches for addressing this problem
Second Example (1/4)
· The Wage dataset involves predicting a continuous or quantitative
output value
· This is known as a regression problem
· However, in certain cases we may instead wish to predict a non-
numerical value, i.e., a categorical or qualitative output
· For example, let’s assume that we have a stock market dataset
that contains the daily movements in the Standard & Poor’s 500
stock index over a 5-year period between 2001 and 2005
Second Example (2/4)
· The goal is to predict whether the index will increase or decrease
on a given day using the past 5 days’ percentage changes in the
index
· Here the statistical learning problem does not involve predicting a
numerical value – instead, it involves predicting whether a given
day’s stock market performance will fall into the Up bucket or the
Down bucket in the figure
· This is known as a classification problem
Figure for the Stock Market
dataset
Second Example (3/4)
· The left-hand panel of the figure for the Stock Market dataset
displays two boxplots of the previous day’s percentage changes in the
stock index:
· One for the 648 days for which the market increased
on the subsequent day, and
· One for the 602 days for which the market decreased
· The two plots look almost identical, suggesting that there is no simple
strategy for using yesterday’s movement in the S&P to predict today’s
returns
· The center and right-hand panels of the figure display boxplots for
the percentage changes in the stock index when taking into account
two days or three days prior to today
Second Example (4/4)
· Again, the plots indicate little association between past and present
returns
· If only things were very simple!
· The lack of pattern is to be expected – in the presence of strong
correlations between successive days’ returns, one could adopt a simple
trading strategy to generate profits from the market
· Still, we will see later in the unit how different statistical methods can
explore certain weak trends in the data and lead to 60% accuracy in
predicting the direction of the movement in the market
Third Example (1/2)
· In the previous two examples, our datasets had both input and output
variables
· However, another important class of problems involves situations in
which we only observe input variables, with no corresponding output
· For example, in a marketing setting, we might have demographic
information for a number of current or potential customers. We may
wish to understand which types of customers are similar to each other,
by grouping individuals according to their observed characteristics
· This is a third type of problem, known as a clustering problem. In this
case, we are not trying to predict an output variable
Third Example (2/2)
· Another similar problem can be the clustering of gene expression
measurements in a two-dimensional space, in order to determine
whether specific measurements correspond to specific types of cancer
· We will focus, later in the unit, on unsupervised learning
approaches in order to handle problems like the clustering of customers
or genes’ measurements
Other Statistical Learning
Problems
· Predicting whether someone will have a heart attack on the basis of
demographic, diet and clinical measurements
· Customize an email spam detection system
· Identify the numbers in a handwritten zip code
· Classify the pixels in a LANDSAT image, by usage (type of
soil, type of vegetation)
Notation (1/3)
· We denote as m the number of distinct data points (observations) in our
sample and n the number of variables (e.g., in the Wage dataset, there
are 3000 people => m=3000, and 12 variables, such as year, age, wage,
etc. => n=12)
· In some examples, n might be quite large, such as on the order of
thousands or even millions.
· This is often the case in the analysis of modern biological
data or web-based advertising data
Notation (2/3)

· In general, we let xij represent the value of


the j-th variable for the i-th observation,
where i=1, 2, 3,…,m and j=1, 2,…,n
· So, i indexes the observations and j indexes the
variables
Notation (3/3)

· We let D denote a mXn matrix whose


(i,j)th element is xij . If you are unfamiliar
with matrices, it is useful to visualize D as a
spreadsheet of numbers with m rows and n
columns.
What is Statistical Learning
(1/4)
· Suppose we are statistical consultants hired by
a client to provide advice on how to improve
sales of a particular product
· The dataset consists of the sales of that product
in 200 different markets, along with advertising
budgets for the product in each of those
markets for three different media: TV, radio and
newspapers
· The data is shown in the next Figure
25
What is Statistical Learning
(2/4)
· It is not possible for our client to directly
increase sales of the product
· BUT: the client can control the advertising
expenditure in each of the three media
· So, if we determine that there is an association
between advertising and sales, then we can
instruct our client to adjust advertising budgets,
thereby increasing sales indirectly
· Therefore, our goal is to develop an accurate
model that can be used to predict sales on the
What is Statistical Learning
(3/4)
28

What is Statistical Learning


(4/4)
 Suppose we observe and for

 We believe that there is a relationship between Y and


at least one of the X’s.
 We can model the relationship as

Yi  f (X i )   i
 Where f is an unknown function and ε is a
random error term, independent of X, with mean
zero (unbiased estimation)
29

A Simple Example (1/2)

0.10
0.05
0.00
y

-0.05
-0.10

0.0 0.2 0.4 0.6 0.8 1.0

x
30

A Simple Example (2/2)

0.10

εi
0.05
0.00
y

-0.05

f
-0.10

0.0 0.2 0.4 0.6 0.8 1.0

x
31

Different Standard
Deviations
sd=0.001 sd=0.005

• The difficulty of

0.10

0.10
0.05
estimating f will

0.05
0.00
depend on the

0.00
y

y
-0.05
standard

-0.05
-0.10

-0.10
deviation of the 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

ε’s. x x

sd=0.01 sd=0.03
0.10

0.00 0.05 0.10


0.05
0.00
y

y
-0.05

-0.10
-0.10

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x
32

Different Estimates For f


sd=0.001 sd=0.005

0.10

0.10
0.05

0.05
0.00

0.00
y

y
-0.05

-0.05
-0.10

-0.10
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

sd=0.01 sd=0.03
0.10

0.00 0.05 0.10


0.05
0.00
y

y
-0.05

-0.10
-0.10

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x
33
34
35

More than one input


variables
 In most of the previous examples, the function f
involved one input variable (X)
 In general, f may involve more than one input
variables
 In the next Figure, income is plotted as a function of
years of education and of seniority
 Here f is a two-dimensional surface that must be
estimated based on the observed data
 The blue surface represents the true underlying
relationship between income and years of educations
and seniority. The red dots indicated the observed
values of these quantities for 30 individuals
36

Income vs. Education and


Seniority
37

Why estimate f?
 Statistical Learning, and this course, are all
about how to estimate f.
 The term statistical learning refers to using the
data to “learn” f.
 Why do we care about estimating f?
 There are 2 reasons for estimating f,
 Prediction and
 Inference.
38

1. Prediction
 If we can produce a good estimate for f (and the
variance of ε is not too large) we can make
accurate predictions for the response, Y, based
on a new value of X.
39

Example: Direct Mailing


Prediction
 Interested in predicting how much money an
individual will donate based on observations
from 90,000 people on which we have recorded
over 400 different characteristics.
 Don’t care too much about each individual
characteristic.
 Just want to know: For a given individual should I
send out a mailing?
40

2. Inference
 Alternatively, we may be interested in the type
of relationship between Y and the X's.
 For example,
Which particular predictors actually affect the
response?
Is the relationship positive or negative?
Is the relationship a simple linear one or is it more
complicated etc.?
41

Example: Housing Inference


 Wish to predict median house price based on 14
variables.
 Probably want to understand which factors have
the biggest effect on the response and how big
the effect is.
 For example how much impact does a river view
have on the house value etc.
42

How Do We Estimate f?
 We will assume we have observed a set of
training data

{(X1 , Y1 ), ( X 2 , Y2 ),  , ( X n , Yn )}
 We must then use the training data and a
statistical method to estimate f.
 Statistical Learning Methods:
Parametric Methods
Non-parametric Methods
43
44
45
46

Parametric Methods (1/3)


 The use of parametric methods reduces the problem
of estimating f down to one of estimating a set of
parameters.
 They involve a two-step model based approach

STEP 1:
Make some assumption about the functional form of f,
i.e. come up with a model. The most common
example is a linear model i.e.
f (X i )  0  1 X i1   2 X i 2     p X ip
47

Parametric Methods (2/3)


 Although it is almost never correct, a linear
model often serves as a good and
interpretable approximation to the unknown
true function f(X)
 We will also examine more complicated, and
flexible, models for f. In a sense the more
flexible the model the more realistic it is.
48

Parametric Methods (3/3)


STEP 2:
Use the training data to fit the model i.e. estimate f
or equivalently the unknown parameters such as β0,
β1, β2,…, βp.

The most common approach for estimating


the parameters in a linear model is ordinary
least squares (OLS).
But there are often superior approaches.
49
50

Revisit the Income vs.


Education and Seniority
Example
51

Example: A Linear Regression


Estimate

• Even if the
standard
deviation is low
we will still get a
bad answer if we
use the wrong
model
• The true f has
some curvature
that is not
captured in the
linear fit
52

Non-parametric Methods
 They do not make explicit assumptions about the
functional form of f.
 Instead, they seek an estimate of f that gets as
close to the data points as possible without being
too rough or wiggly.
53

Advantages of non-
parametric Methods
 They accurately fit a wider range of possible
shapes of f
 Any parametric approach brings with it the
possibility that the functional form used to
estimate f is very different from the true f => the
resulting model will not fit the data well
 In contrast, non-parametric approaches
completely avoid this danger, since essentially
no assumption is made about the form of f
54

Disadvantage of non-
parametric Methods
 Since they do not reduce the problem of
estimating f to a small number of parameters, a
very large number of observations is required to
obtain an accurate estimate of f
55

Example of a non-parametric
approach: A Thin-Plate Spline
Estimate
• Non-linear
regression
methods are
more flexible
and can
potentially
provide more
accurate
estimates.
56

Tradeoff Between Prediction


Accuracy and Model
Interpretability (1/2)
 Why not just use a more flexible method if it is
more realistic?
 There are two reasons
Reason 1:
A simple method such as linear regression produces
a model which is much easier to interpret (the
Inference part is better). For example, in a linear
model, βj is the average increase in Y for a one unit
increase in Xj holding all other variables constant.
57

Tradeoff Between Prediction


Accuracy and Model
Interpretability (2/2)

Reason 2:
Even if you are only interested in prediction (so the
first reason is not relevant) it is often possible to get
more accurate predictions with a simple, instead of
a complicated, model. This seems counter intuitive
but has to do with the fact that it is harder to fit a
more flexible model.
58

Flexibility vs.
Interpretability
59

A Poor Estimate

• Non-linear
regression
methods can
also be too
flexible and
produce poor
estimates for f.
60

Supervised vs. Unsupervised Learning


(1/3)
 We can divide all learning problems into
Supervised and Unsupervised situations
 Supervised Learning:
Supervised Learning is where both the predictors,
Xi, and the response, Yi, are observed => for each
observation of the predictor measurements there
is an associated response measurement
Most of this unit will also deal with supervised
learning
61

Supervised vs. Unsupervised


Learning (2/3)
 Unsupervised Learning:
In this situation only the Xi’s are observed
It is not possible to fit a linear regression model,
since there is no response variable to predict
In this setting we are in some sense working blind
The situation is referred to as unsupervised
because we lack a response variable that can
supervise our analysis
62

Supervised vs. Unsupervised


Learning (3/3)
 Unsupervised Learning:
We can seek to understand the relationships
between the variables or between the
observations
We need to use the Xi’s to guess what Y would
have been and build a model from there
A common example is market segmentation
where we try to divide potential customers into
groups based on their characteristics
A common approach is clustering
We will consider unsupervised learning towards
the end of the unit
63

A Simple Clustering
Example
64

Regression vs.
Classification
 Supervised learning problems can be further
divided into regression and classification problems
 Regression covers situations where Y is
continuous/numerical. e.g.
 Predicting the value of the Dow Jones stock market
index in 6 months
 Predicting the value of a given house based on
various inputs
 Classification covers situations where Y is
categorical e.g.
 Will the Dow be up (U) or down (D) in 6 months?
 Is this email a SPAM or not?
65

Different Approaches
 We will deal with both types of problems in this
unit
 Some methods work well on both types of
problems, e.g. Neural Networks
 Other methods work best on Regression, e.g.
Linear Regression, or on Classification, e.g. k-
Nearest Neighbors.
66

Short Assignments for next


week

(Not to be submitted through


LMS, but to be discussed in
class)
67

Short Assignment 1 (1/3)


• Explain whether each scenario, below, is a
classification or regression problem, and indicate
whether we are most interested in inference or
prediction. Finally, provide for each scenario the
number of observations n and the number and
type of predictors p.
• (a) We collect a set of data on the top 500 firms
in the US. For each firm we record profit, number
of employees, industry and the CEO salary. We
are interested in understanding which factors
affect CEO salary.
68

Short Assignment 1 (2/3)


• (b) We are considering launching a new product
and wish to know whether it will be a success or
a failure. We collect data on 20 similar products
that were previously launched. For each product
we have recorded whether it was a success or
failure, price charged for the product, marketing
budget, competition price, and ten other
variables.
69

Short Assignment 1 (3/3)


• (c) We are interested in predicting the % change
in the US dollar in relation to the weekly changes
in the world stock markets. Hence we collect
weekly data for all of 2020. For each week we
record the % change in the dollar, the % change
in the US market, the % change in the British
market, and the % change in the German
market.
70

Short Assignment 2
• Find on the web a paper/article/blog post
discussing the differences between statistical
learning and machine learning.
• Share the link to the paper with the class, in the
Discussion Forum, and be prepared to discuss
the differences between statistical learning and
machine learning in class next week

You might also like