Scikit Learn
Scikit Learn
#scikit-learn
Table of Contents
About 1
Remarks 2
Examples 2
Installation of scikit-learn 2
Creating pipelines 3
Sample datasets 4
Chapter 2: Classification 6
Examples 6
RandomForestClassifier 6
GradientBoostingClassifier 8
A Decision Tree 8
Examples 11
Examples 13
Examples 15
Cross-validation 15
K-Fold 16
ShuffleSplit 16
Chapter 7: Regression 20
Examples 20
Credits 22
About
You can share this PDF with anyone you feel could benefit from it, downloaded the latest version
from: scikit-learn
It is an unofficial and free scikit-learn ebook created for educational purposes. All the content is
extracted from Stack Overflow Documentation, which is written by many hardworking individuals at
Stack Overflow. It is neither affiliated with Stack Overflow nor official scikit-learn.
The content is released under Creative Commons BY-SA, and the list of contributors to each
chapter are provided in the credits section at the end of this book. Images may be copyright of
their respective owners unless otherwise specified. All trademarks and registered trademarks are
the property of their respective company owners.
Use the content presented in this book at your own risk; it is not guaranteed to be correct nor
accurate, please send your feedback and corrections to [email protected]
https://riptutorial.com/ 1
Chapter 1: Getting started with scikit-learn
Remarks
scikit-learnis a general-purpose open-source library for data analysis written in python. It is
based on other python libraries: NumPy, SciPy, and matplotlib
Examples
Installation of scikit-learn
For most installation pip python package manager can install python and all of its dependencies:
However for linux systems it is recommended to use conda package manager to avoid possible
build processes
Canopy and Anaconda both ship a recent version of scikit-learn, in addition to a large set of
scientific python library for Windows, Mac OSX (also relevant for Linux).
import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
https://riptutorial.com/ 2
X, y = iris_dataset['data'], iris_dataset['target']
Data is split into train and test sets. To do this we use the train_test_split utility function to split
both X and y (data and target vectors) randomly with the option train_size=0.75 (training sets
contain 75% of the data).
Training datasets are fed into a k-nearest neighbors classifier. The method fit of the classifier will
fit the model to the data.
By using one pair of train and test sets we might get a biased estimation of the quality of the
classifier due to the arbitrary choice the data split. By using cross-validation we can fit of the
classifier on different train/test subsets of the data and make an average over all accuracy results.
The function cross_val_score fits a classifier to the input data using cross-validation. It can take as
input the number of different splits (folds) to be used (5 in the example below).
Creating pipelines
Finding patterns in data often proceeds in a chain of data-processing steps, e.g., feature selection,
normalization, and classification. In sklearn, a pipeline of stages is used for this.
For example, the following code shows a pipeline consisting of two stages. The first scales the
features, and the second trains a classifier on the resulting augmented dataset:
Once the pipeline is created, you can use it like a regular stage (depending on its specific steps).
Here, for example, the pipeline behaves like a classifier. Consequently, we can use it as follows:
# fitting a classifier
https://riptutorial.com/ 3
pipeline.fit(X_train, y_train)
# getting predictions for the new data sample
pipeline.predict_proba(X_test)
Data is stored in numpy.arrays (but other array-like objects like pandas.DataFrames are accepted if
those are convertible to numpy.arrays)
Each object in the data is described by set of features the general convention is that data sample
is represented with array, where first dimension is data sample id, second dimension is feature id.
import numpy
data = numpy.arange(10).reshape(5, 2)
print(data)
Output:
[[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]
Sample datasets
For ease of testing, sklearn provides some built-in datasets in sklearn.datasets module. For
example, let's load Fisher's iris dataset:
import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
iris_dataset.keys()
['target_names', 'data', 'target', 'DESCR', 'feature_names']
You can read full description, names of features and names of classes (target_names). Those are
stored as strings.
We are interested in the data and classes, which stored in data and target fields. By convention
those are denoted as X and y
https://riptutorial.com/ 4
X, y = iris_dataset['data'], iris_dataset['target']
X.shape, y.shape
((150, 4), (150,))
numpy.unique(y)
array([0, 1, 2])
Shapes of X and y say that there are 150 samples with 4 features. Each sample belongs to one of
following classes: 0, 1 or 2.
X and y can now be used in training a classifier, by calling the classifier's fit() method.
Here is the full list of datasets provided by the sklearn.datasets module with their size and
intended use:
These datasets are useful to quickly illustrate the behavior of the various algorithms
implemented in the scikit. They are however often too small to be representative of real
world machine learning tasks.
In addition to these built-in toy sample datasets, sklearn.datasets also provides utility functions for
loading external datasets:
• for loading sample datasets from the mlcomp.org repository (note that the
load_mlcomp
datasets need to be downloaded before). Here is an example of usage.
• fetch_lfw_pairs and fetch_lfw_people for loading Labeled Faces in the Wild (LFW) pairs
dataset from http://vis-www.cs.umass.edu/lfw/, used for face verification (resp. face
recognition). This dataset is larger than 200 MB. Here is an example of usage.
https://riptutorial.com/ 5
Chapter 2: Classification
Examples
Using Support Vector Machines
Example:
import numpy as np
y = [0] * 10 + [1] * 10
Note that x is composed of two Gaussians: one centered around (0, 0), and one centered around
(1, 1).
svm.SVC(kernel='linear').fit(x, y)
svm.SVR(kernel='linear').fit(x, y)
RandomForestClassifier
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-
https://riptutorial.com/ 6
samples of the dataset and use averaging to improve the predictive accuracy and control over-
fitting.
Import:
train = [[1,2,3],[2,5,1],[2,1,7]]
target = [0,1,0]
rf = RandomForestClassifier(n_estimators=100)
rf.fit(train, target)
Predict:
test = [2,2,3]
predicted = rf.predict(test)
Build a text report showing the main classification metrics, including the precision and recall, f1-
score (the harmonic mean of precision and recall) and support (the number of observations of that
class in the training set).
Output -
https://riptutorial.com/ 7
GradientBoostingClassifier
Gradient Boosting for classification. The Gradient Boosting Classifier is an additive ensemble of a
base model whose error is corrected in successive iterations (or stages) by the addition of
Regression Trees which correct the residuals (the error of the previous stage).
Import:
iris_dataset = load_iris()
X, y = iris_dataset.data, iris_dataset.target
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
>>> gbc.n_estimators
100
This can be controlled by setting n_estimators to a different value during the initialization time.
A Decision Tree
A decision tree is a classifier which uses a sequence of verbose rules (like a>7) which can be
easily understood.
The example below trains a decision tree classifier using three feature vectors of length 3, and
then predicts the result for a so far unknown fourth feature vector, the so called test vector.
https://riptutorial.com/ 8
from sklearn.tree import DecisionTreeClassifier
# Initialize Classifier.
# Random values are initialized with always the same random seed of value 0
# (allows reproducible results)
dectree = DecisionTreeClassifier(random_state=0)
dectree.fit(train, target)
print predicted
import pydot
import StringIO
dotfile = StringIO.StringIO()
tree.export_graphviz(dectree, out_file=dotfile)
(graph,)=pydot.graph_from_dot_data(dotfile.getvalue())
graph.write_png("dtree.png")
graph.write_pdf("dtree.pdf")
In LR Classifier, he probabilities describing the possible outcomes of a single trial are modeled
using a logistic function. It is implemented in the linear_model library
The sklearn LR implementation can fit binary, One-vs- Rest, or multinomial logistic regression with
optional L2 or L1 regularization. For example, let us consider a binary classification on a sample
sklearn dataset
X,y = make_hastie_10_2(n_samples=1000)
Use train-test split to divide the input data into training and test sets (70%-30%)
https://riptutorial.com/ 9
Using the LR Classifier is similar to other examples
# Initialize Classifier.
LRC = LogisticRegression()
LRC.fit(data_train, labels_train)
confusion_matrix(predicted, labels_test)
https://riptutorial.com/ 10
Chapter 3: Dimensionality reduction (Feature
selection)
Examples
Reducing The Dimension With Principal Component Analysis
Principal Component Analysis finds sequences of linear combinations of the features. The first
linear combination maximizes the variance of the features (subject to a unit constraint). Each of
the following linear combinations maximizes the variance of the features in the subspace
orthogonal to that spanned by the previous linear combinations.
A common dimension reduction technique is to use only the k first such linear combinations.
Suppose the features are a matrix X of n rows and m columns. The first k linear combinations form
a matrix βk of m rows and k columns. The product X β has n rows and k columns. Thus, the
resulting matrix β k can be considered a reduction from m to k dimensions, retaining the high-
variance parts of the original matrix X.
import numpy as np
np.random.seed(123) # we'll set a random seed so that our results are reproducible
X = np.hstack((np.random.randn(100, 2) + (10, 10), 0.001 * np.random.randn(100, 5)))
Now let's check the results. First, here are the linear combinations:
pca.components_
# array([[ -2.84271217e-01, -9.58743893e-01, -8.25412629e-05,
# 1.96237855e-05, -1.25862328e-05, 8.27127496e-05,
# -9.46906600e-05],
# [ -9.58743890e-01, 2.84271223e-01, -7.33055823e-05,
# -1.23188872e-04, -1.82458739e-05, 5.50383246e-05,
# 1.96503690e-05]])
Note how the first two components in each vector are several orders of magnitude larger than the
others, showing that the PCA recognized that the variance is contained mainly in the first two
columns.
https://riptutorial.com/ 11
To check the ratio of the variance explained by this PCA, we can examine
pca.explained_variance_ratio_:
pca.explained_variance_ratio_
# array([ 0.57039059, 0.42960728])
https://riptutorial.com/ 12
Chapter 4: Feature selection
Examples
Low-Variance Feature Removal
Its underlying idea is that if a feature is constant (i.e. it has 0 variance), then it cannot be used for
finding any interesting patterns and can be removed from the dataset.
Consequently, a heuristic approach to feature elimination is to first remove all features whose
variance is below some (low) threshold.
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
There are 3 boolean features here, each with 6 instances. Suppose we wish to remove those that
are constant in at least 80% of the instances. Some probability calculations show that these
features will need to have variance lower than 0.8 * (1 - 0.8). Consequently, we can use
This method should be used with caution because a low variance doesn't necessarily mean that a
feature is “uninteresting”. Consider the following example where we construct a dataset that
contains 3 features, the first two consisting of randomly distributed variables and the third of
uniformly distributed variables.
# generate dataset
np.random.seed(0)
feat1 = np.random.normal(loc=0, scale=.1, size=100) # normal dist. with mean=0 and std=.1
feat2 = np.random.normal(loc=0, scale=10, size=100) # normal dist. with mean=0 and std=10
feat3 = np.random.uniform(low=0, high=10, size=100) # uniform dist. in the interval [0,10)
data = np.column_stack((feat1,feat2,feat3))
https://riptutorial.com/ 13
data[:5]
# Output:
# array([[ 0.17640523, 18.83150697, 9.61936379],
# [ 0.04001572, -13.47759061, 2.92147527],
# [ 0.0978738 , -12.70484998, 2.4082878 ],
# [ 0.22408932, 9.69396708, 1.00293942],
# [ 0.1867558 , -11.73123405, 0.1642963 ]])
np.var(data, axis=0)
# Output: array([ 1.01582662e-02, 1.07053580e+02, 9.07187722e+00])
sel = VarianceThreshold(threshold=0.1)
sel.fit_transform(data)[:5]
# Output:
# array([[ 18.83150697, 9.61936379],
# [-13.47759061, 2.92147527],
# [-12.70484998, 2.4082878 ],
# [ 9.69396708, 1.00293942],
# [-11.73123405, 0.1642963 ]])
Now the first feature has been removed because of its low variance, while the third feature (that's
the most uninteresting) has been kept. In this case it would have been more appropriate to
consider a coefficient of variation because that's independent of scaling.
https://riptutorial.com/ 14
Chapter 5: Model selection
Examples
Cross-validation
Learning the parameters of a prediction function and testing it on the same data is a
methodological mistake: a model that would just repeat the labels of the samples that it has just
seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This
situation is called overfitting. To avoid it, it is common practice when performing a (supervised)
machine learning experiment to hold out part of the available data as a test set X_test, y_test.
Note that the word “experiment” is not intended to denote academic use only, because even in
commercial settings machine learning usually starts out experimentally.
In scikit-learn a random split into training and test sets can be quickly computed with the
train_test_split helper function. Let’s load the iris data set to fit a linear support vector machine on
it:
We can now quickly sample a training set while holding out 40% of the data for testing (evaluating)
our classifier:
Now, after we have train and test sets, lets use it:
K-fold cross-validation is a systematic process for repeating the train/test split procedure multiple
times, in order to reduce the variance associated with a single trial of train/test split. You
essentially split the entire dataset into K equal size "folds", and each fold is used once for testing
https://riptutorial.com/ 15
the model and K-1 times for training the model.
Multiple folding techniques are available with the scikit library. Their usage is dependent on the
input data characteristics. Some examples are
K-Fold
You essentially split the entire dataset into K equal size "folds", and each fold is used once for
testing the model and K-1 times for training the model.
TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
StratifiedKFoldis a variation of k-fold which returns stratified folds: each set contains
approximately the same percentage of samples of each target class as the complete set
ShuffleSplit
Used to generate a user defined number of independent train / test dataset splits. Samples are
first shuffled and then split into a pair of train and test sets.
StratifiedShuffleSplit is a variation of ShuffleSplit, which returns stratified splits, i.e which creates
splits by preserving the same percentage for each target class as in the complete set.
Other folding techniques such as Leave One/p Out, and TimeSeriesSplit (a variation of K-fold) are
available in the scikit model_selection library.
https://riptutorial.com/ 16
Chapter 6: Receiver Operating Characteristic
(ROC)
Examples
Introduction to ROC and AUC
Example of Receiver Operating Characteristic (ROC) metric to evaluate classifier output quality.
ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis.
This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and
a true positive rate of one. This is not very realistic, but it does mean that a larger area under the
curve (AUC) is usually better.
The “steepness” of ROC curves is also important, since it is ideal to maximize the true positive
rate while minimizing the false positive rate.
A simple example:
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
Arbitrary y values - in real case this is the predicted target values (model.predict(x_test) ):
y = np.array([1,1,2,2,3,3,4,4,2,3])
Scores is the mean accuracy on the given test data and labels (model.score(X,Y)):
Plotting:
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
https://riptutorial.com/ 17
plt.show()
Output:
Note: the sources were taken from these link1 and link2
One needs the predicted probabilities in order to calculate the ROC-AUC (area under the curve)
score. The cross_val_predict uses the predict methods of classifiers. In order to be able to get the
ROC-AUC score, one can simply subclass the classifier, overriding the predict method, so that it
would act like predict_proba.
class LogisticRegressionWrapper(LogisticRegression):
https://riptutorial.com/ 18
def predict(self, X):
return super(LogisticRegressionWrapper, self).predict_proba(X)
output:
https://riptutorial.com/ 19
Chapter 7: Regression
Examples
Ordinary Least Squares
Ordinary Least Squares is a method for finding the linear combination of features that best fits the
observed outcome in the following sense.
If the vector of outcomes to be predicted is y, and the explanatory variables form the matrix X,
then OLS will find the vector β solving
minβ|y^ - y|22,
Application Context
OLS should only be applied to regression problems, it is generally unsuitable for classification
problems: Contrast
Example
Let's generate a linear model with some noise, then see if LinearRegression Manages to
reconstruct the linear model.
import numpy as np
X = np.random.randn(100, 3)
https://riptutorial.com/ 20
array([ 9.97768469e-01, 9.98237634e-01, 7.55016533e-04])
https://riptutorial.com/ 21
Credits
S.
Chapters Contributors
No
Getting started with Alleo, Ami Tavory, Community, Gabe, Gal Dreiman, panty,
1
scikit-learn Sean Easter, user2314737
Dimensionality
Ami Tavory, DataSwede, Gal Dreiman, Sean Easter,
3 reduction (Feature
user2314737
selection)
Receiver Operating
6 Gal Dreiman, Gorkem Ozkaya
Characteristic (ROC)
https://riptutorial.com/ 22