0% found this document useful (0 votes)
22 views63 pages

Lecture 4

The document discusses machine learning concepts, focusing on KNN, linear regression, model evaluation, and techniques to avoid overfitting. It includes practical examples using Python's sklearn library, such as implementing KNN and linear regression, as well as methods for cross-validation and feature engineering. Additionally, it covers regularization techniques like Ridge and Lasso regression to improve model performance.

Uploaded by

Edoardo Maschio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views63 pages

Lecture 4

The document discusses machine learning concepts, focusing on KNN, linear regression, model evaluation, and techniques to avoid overfitting. It includes practical examples using Python's sklearn library, such as implementing KNN and linear regression, as well as methods for cross-validation and feature engineering. Additionally, it covers regularization techniques like Ridge and Lasso regression to improve model performance.

Uploaded by

Edoardo Maschio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Machine learning with pytho

ESCP-Paris 2021
Slides (or images, contents) adapted from D. Dligach, C. Müller, E.
Duchesnay, M.Defferrard, E. Eaton, S. Sankararaman and many others (who
,

made their course materials freely available online).

Anh-Phuong TA
Chief data scientist at Le Figaro CCM-Benchmark group
[email protected]

1
n

Exercise
Testing kNN with boston dataset
A little bit …
Today’s lecture
• KNN (last time)
• Linear regression
• Model evaluation
• How to avoid over tting
fi
KNN
from sklearn.neighbors import KNeighborsClassi er
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
knn = KNeighborsClassi er(n_neighbors=1)
knn. t(X_train, y_train)
y_pred = knn.predict(X_test)
print("Score: {:.2f}".format(np.mean(y_pred == y_test)))
print("Score: {:.2f}".format(knn.score(X_test, y_test)))
fi
fi
fi
Over tting & Under tting
fi
fi
Over tting & Under tting
fi
fi
Avoid over tting
• Reduce the number of features manually or do feature
selection
• Do a model selection.
• Use regularization (keep the features but reduce their
importance by setting small parameter values)
• Do a cross-validation to estimate the test error
.

fi
.

cross-validation

pro: more stable, more dat


con: slower
a

Cross-validation
Cross-validation
Cross-validation
GridSearchCV
GridSearchCV results
CV strategies

Strati ed: Ensure relative class frequencies in each fold re ect


relative class frequencies on the whole dataset.
fi
fl
Repeated KFold and LeaveOneOut
• LeaveOneOut : KFold(n_folds=n_samples) High
variance, takes a long tim
• Better: RepeatedKFold. Apply KFold or
Strati edKFold multiple times with shuf ed data.
Reduces variance!
fi
e

fl
Strati edShuf eSplit
fi
fl
Using Cross-Validation Generators
cross_validate Function
Feature engineering: scaling
Standard Scaler Example
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, Normalize

Standard Scaler + pipeline


Pipeline
Categorical variables (credit: C.
Muller)
Ordinal encoding
One-Hot (Dummy) Encoding
One-Hot (Dummy) Encoding
One-Hot (Dummy) Encoding
Categorical columns with Pandas

Or you can use:


from sklearn.preprocessing import OneHotEncoder
Dealing with Missing Values
Among others, Imputation Methods

Mean / Media
kN
Regression model
Matrix factorization




N

Baseline: Dropping Columns


nan_columns = np.any(np.isnan(X_train), axis=0
X_drop_columns = X_train[:, ~nan_columns

And then, use X_drop_columns to train your model

Imputation: Median, Mean


Imputation: Median, Mean
Linear regression
• If your data:

Good to use linear regression, and


our goal is to nd:

Note that: if there are more than one variables


=> multiple linear regression
fi
Linear regression
• If your data:

Humm!!!
Or do some data transformations rst
fi
Lost/Cost function

It compares all the predictions against their actual


values and provides us with a score value
Training and Testing

=> f(x_i) is used interchangeably with h(x_i)


Training and Testing: linear
regression

=> z is used interchangeably with phi


Linear regression: loss
functions
Lost/Cost function
ML algorithms often de ne an objective functio
This function is optimized during learning
It is often a cost function we want to minimiz
Function J below learns weights as the sum of squared errors (SSE)
fi
e

Learning as optimization
The fundamental dif culty of machine learning

Picture was taken from some ML courses at Stanford


fi
How to optimize?
What is the gradient?
Gradient descent?
Gradient descent:
un intuition
Gradient descent:
un intuition
Gradient descent:
un intuition
Gradient descent
Gradient computation

We update all weights simultaneously:


Partial derivatives

Whiteboard!!!!
What should step size be

Stochastic gradient descent (SGD)

Stochastic gradient descent (SGD)

Avoid over tting


• Reduce the number of features manually or do feature
selection
• Do a model selection.
• Use regularization (keep the features but reduce their
importance by setting small parameter values)
• Do a cross-validation to estimate the test error
.

fi
.

Avoid over tting:


Regularization
Idea: regularized Empirical
Risk Minimizatio
fi
n

Ridge Regression (L2)


lasso (least absolute shrinkage
and selection operator) (L1)
Understanding L1 and L2
Penalties
Understanding L1 and L2
Penalties
Example
from sklearn.linear_model import Ridge, LinearRegression, Lass
from sklearn.datasets import load_bosto
boston = load_boston(
from sklearn.model_selection import train_test_spli
X, y = boston.data, boston.targe
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42
score = cross_val_score(LinearRegression(), X_train, y_train, cv=10
np.mean(score
from sklearn.model_selection import GridSearchC
param_grid = {'alpha': np.logspace(-3, 3, 13)
print(param_grid
grid = GridSearchCV(Ridge(), param_grid, cv=10, return_train_score=True,
iid=False
grid. t(X_train, y_train
print(grid.best_params_
print(grid.best_score_)
fi
)

L1 + L2 = Elastic Net

In sklearn
Grid-searching ElasticNet
Assignment 3

You might also like