Supervised Learning With Scikit-learn
Supervised Learning With Scikit-learn
with scikit-learn
S U P E R V I S E D L E A R N I N G W I T H S C I K I T- L E A R N
George Boorman
Core Curriculum Manager, DataCamp
What is machine learning?
Machine learning is the process whereby:
Computers are given the ability to learn to make decisions from data
Aim: Predict the target values of unseen data, given the features
array([0, 0, 0, 0, 1, 0])
We import a Model, which is a type of algorithm for our supervised learning problem, from an sklearn module. For
example, the k-Nearest Neighbors model uses distance between observations to predict labels or values. We
create a variable named model, and instantiate the Model. A model is fit to the data, where it learns patterns
about the features and the target variable. We fit the model to X, an array of our features, and y, an array of our
target variable values. We then use the model's dot-predict method, passing six new observations, X_new. For
example, if feeding features from six emails to a spam classification model, an array of six values is returned. A
one indicates the model predicts that email is spam, and a zero represents a prediction of not spam.
George Boorman
Core Curriculum Manager, DataCamp
Classifying labels of unseen data
1. Build a model
2. Model learns from the labeled data we pass to it
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X, y)
To fit a KNN model using scikit-learn, we import KNeighborsClassifier from sklearn-dot-neighbors. We split our data into X, a 2D array of our
features, and y, a 1D array of the target values - in this case, churn status. scikit-learn requires that the features are in an array where each
column is a feature and each row a different observation. Similarly, the target needs to be a single column with the same number of
observations as the feature data. We use the dot-values attribute to convert X and y to NumPy arrays. Printing the shape of X and y, we see
there are 3333 observations of two features, and 3333 observations of the target variable. We then instantiate our KNeighborsClassifier, setting
n_neighbors equal to 15, and assign it to the variable knn. Then we can fit this classifier to our labeled data by applying the classifier's dot-fit
method and passing two arguments: the feature values, X, and the target values, y.
(3, 2)
predictions = knn.predict(X_new)
print('Predictions: {}'.format(predictions))
Predictions: [1 0 0]
George Boorman
Core Curriculum Manager, DataCamp
Measuring model performance
In classification, accuracy is a commonly used metric
Accuracy:
0.8800599700149925
George Boorman
Core Curriculum Manager, DataCamp
Predicting blood glucose levels
import pandas as pd
diabetes_df = pd.read_csv("diabetes.csv")
print(diabetes_df.head())
(752,) (752,)
X_bmi = X_bmi.reshape(-1, 1)
print(X_bmi.shape)
(752, 1)
George Boorman
Core Curriculum Manager, DataCamp
Regression mechanics
y = ax + b
Simple linear regression uses one feature
y = target
x = single feature
a, b = parameters/coefficients of the model - slope, intercept
How do we choose a and b?
Define an error function for any given line
y = a1 x1 + a2 x2 + a3 x3 + ... + an xn + b
High R2 : Low R2 :
0.356302876407827
RM SE = √M SE
24.028109426907236
George Boorman
Core Curriculum Manager, DataCamp
Cross-validation motivation
Model performance is dependent on the way we split up the data
Not representative of the model's ability to generalize to unseen data
Solution: Cross-validation!
10 folds = 10-fold CV
k folds = k-fold CV
print(np.mean(cv_results), np.std(cv_results))
0.7418682216666667 0.023330243960652888
array([0.7054865, 0.76874702])
George Boorman
Core Curriculum Manager, DataCamp
Why regularize?
Recall: Linear regression minimizes a loss function
It chooses a coefficient, a, for each feature variable, plus b
George Boorman
Core Curriculum Manager, DataCamp
Classification metrics
Measuring model performance with accuracy:
Fraction of correctly classified samples
Could build a classifier that predicts NONE of the transactions are fraudulent
99% accurate!
But terrible at actually predicting fraudulent transactions
Accuracy:
Precision
Recall
[[1106 11]
[ 183 34]]
George Boorman
Core Curriculum Manager, DataCamp
Logistic regression for binary classification
Logistic regression is used for classification problems
Logistic regression outputs probabilities
[0.08961376]
0.6700964152663693
George Boorman
Core Curriculum Manager
Hyperparameter tuning
Ridge/lasso regression: Choosing alpha
KNN: Choosing n_neighbors
We can still split the data and perform cross-validation on the training set
0.7564731534089224
George Boorman
Core Curriculum Manager, DataCamp
scikit-learn requirements
Numeric data
No missing values
print(music.info())
George Boorman
Core Curriculum Manager, DataCamp
Missing data
No value for a feature in a particular row
This can occur because:
There may have been no observation
genre 8
popularity 31
loudness 44
liveness 46
tempo 46
speechiness 59
duration_ms 91
instrumentalness 91
danceability 143
valence 143
acousticness 200
energy 200
dtype: int64
popularity 0
liveness 0
loudness 0
tempo 0
genre 0
duration_ms 29
instrumentalness 29
speechiness 53
danceability 127
valence 127
acousticness 178
energy 178
dtype: int64
For categorical values, we typically use the most frequent value - the mode
0.7593582887700535
George Boorman
Core Curriculum Manager
Why scale our data?
print(music_df[["duration_ms", "loudness", "speechiness"]].describe())
19801.42536120538, 71343.52910125865
2.260817795600319e-17, 1.0
0.81
0.53
0.8199999999999999
print(cv.best_params_)
{'knn__n_neighbors': 12}
George Boorman
Core Curriculum Manager, DataCamp
Different models for different problems
Some guiding principles
Interpretability
Some models are easier to explain, which can be important for stakeholders
Flexibility
May improve accuracy, by making fewer assumptions about data
R-squared
Confusion matrix
ROC AUC
Logistic Regression
George Boorman
Core Curriculum Manager, DataCamp
What you've covered
Using supervised learning techniques to build predictive models
For both regression and classification problems
Cross-validation
Hyperparameter tuning
Using pipelines