|| Jai Sri Gurudev ||
Sri Adichunchanagiri Shikshana Trust ®
SJB Institute of Technology
Autonomous Institute under Visvesvaraya Technological University (VTU)
Approved by AICTE, New Delhi, Recognized by UGC, New Delhi with 2 (f) & 12 (B)
Accredited by NAAC with ‘A+’ Grade, Accredited by National Board of Accreditation)
No. 67, BGS Health & Education City, Dr. Vishnuvardhan Road, Kengeri, Bengaluru-560060.
Department of Electronics & Communication
Engineering
Machine Learning with Python [21EC744]
MODULE – 1
Introduction and Training Machine Learning Algorithms for
Classification
Notes (as per VTU Syllabus)
VII SEMESTER – B. E
Academic Year: 2024 – 25 (ODD)
Course Coordinator: Dr. Supreeth H S G
Designation: Associate Professor
Dept: Electronics & Communication Engineering
Machine Learning with Python 21EC744
Syllabus
[As per Choice Based Credit System (CBCS) scheme]
SEMESTER – VII
Subject Code: 21EC744 IA Marks: 50
Number of Lecture Hours/Week: 3 Exam Marks: 50
Total Number of Lecture Hours: 40 Exam Hours 03
Credits – 03
Introduction: Introduction to Machine Learning, Building intelligent machines to transform
data into knowledge, The three different types of machine learning, An introduction to the
basic terminology and notations, A roadmap for building machine learning systems, Using
Python for machine learning.
Training Machine Learning Algorithms for Classification Artificial neurons – a brief glimpse
into the early history of machine learning, Implementing a perceptron learning algorithm in
Python, Adaptive linear neurons and the convergence of learning.
Textbook 1: Chapters 1, 2
Textbook:
Python Machine Learning by Sebastian Raschka, Published by Packt Publishing Ltd.
Reference Books:
1. Machine Learning with Python for Everyone by Mark E Fenner
2. Machine Learning using Python by Manaranjan Pradhan & U Dinesh Kumar
3. Practical Machine Learning with Python by Dipanjan Sarkar, Raghav Bali &Tushar
Sharma
Dept. of ECE/SJBIT Page 2
Machine Learning with Python 21EC744
INDEX SHEET
PAGE
Section TOPIC
NO.
VTU Syllabus 2
MODULE – 1
Introduction and Training Machine Learning Algorithms for Classification
1.1 Introduction to Machine Learning 4
1.2 Building intelligent machines to transform data into knowledge 4
1.3 Introduction to the basic terminology and notations 6
1.4 A roadmap for building machine learning systems 7
1.5 Using Python for machine learning 8
1.6 Artificial neurons 9
1.7 The perceptron learning rule 10
1.8 Adaptive linear neurons and the convergence of learning 13
Dept. of ECE/SJBIT Page 3
Machine Learning with Python 21EC744
1.1 Introduction to Machine Learning
Machine learning (ML) is a branch of and computer science that focuses on the using data
and algorithms to enable AI to imitate the way that humans learn, gradually improving its
accuracy.
1.2 Building intelligent machines to transform data into knowledge
In this age of modern technology, there is one resource that we have in abundance: a large
amount of structured and unstructured data. In the second half of the twentieth century,
machine learning evolved as a subfield of ArtiÀcial Intelligence (AI) that involved self-
learning algorithms that derived knowledge from data in order to make predictions. Instead of
requiring humans to manually derive rules and build models from analyzing large amounts of
data, machine learning offers a more efficient alternative for capturing the knowledge in data
to gradually improve the performance of predictive models and make data-driven decisions.
Not only is machine learning becoming increasingly important in computer science research,
but it also plays an ever-greater role in our everyday lives. Thanks to machine learning, we
enjoy robust email spam filters, convenient text and voice recognition software, reliable web
search engines, challenging chess-playing programs, and, hopefully soon, safe and efficient
self-driving cars.
The three different types of machine learning are:
supervised learning, unsupervised learning, and reinforcement learning
1.2.1 Supervised Learning:
1. Making predictions about the future with supervised learning
The main goal in supervised learning is to learn a model from labeled training data
that allows us to make predictions about unseen or future data. Here, the term
supervised refers to a set of samples where the desired output signals (labels) are
already known.
Dept. of ECE/SJBIT Page 1
Machine Learning with Python 21EC744
Considering the example of email spam filtering, we can train a model using a supervised
machine learning algorithm on a corpus of labeled emails, emails that are correctly marked as
spam or not-spam, to predict whether a new email belongs to either of the two categories. A
supervised learning task with discrete class labels, such as in the previous email spam
filtering example, is also called a classiÀcation task. Another subcategory of supervised
learning is regression, where the outcome signal is a continuous value:
2. Classification for predicting class labels
Classification is a subcategory of supervised learning where the goal is to predict the
categorical class labels of new instances, based on past observations. Those class
labels are discrete, unordered values that can be understood as the group memberships
of the instances. The previously mentioned example of email spam detection
represents a typical example of a binary classification task, where the machine
learning algorithm learns a set of rules in order to distinguish between two possible
classes: spam and non-spam emails.
However, the set of class labels does not have to be of a binary nature. The predictive
model learned by a supervised learning algorithm can assign any class label that was
presented in the training dataset to a new, unlabeled instance. A typical example of a
multiclass classification task is handwritten character recognition.
The following figure illustrates the concept of a binary classification task given 30
training samples; 15 training samples are labeled as negative class (minus signs) and
15 training samples are labeled as positive class (plus signs). In this scenario, our
dataset is two-dimensional, which means that each sample has two values associated
with it: x1 and x2. Now, we can use a supervised machine learning algorithm to learn
a rule—the decision boundary represented as a dashed line—that can separate those
two classes and classify new data into each of those two categories given its x1 and x2
values:
Dept. of ECE/SJBIT Page 2
Machine Learning with Python 21EC744
3. Regression for predicting continuous outcomes
We learned in the previous section that the task of classification is to assign
categorical, unordered labels to instances. A second type of supervised learning is the
prediction of continuous outcomes, which is also called regression analysis. In
regression analysis, we are given a number of predictor (explanatory) variables and a
continuous response variable (outcome or target), and we try to find a relationship
between those variables that allows us to predict an outcome.
For example, let's assume that we are interested in predicting the math SAT scores of
our students. If there is a relationship between the time spent studying for the test and
the final scores, we could use it as training data to learn a model that uses the study
time to predict the test scores of future students who are planning to take this test.
The following figure illustrates the concept of linear regression. Given a predictor
variable x and a response variable y, we fit a straight line to this data that minimizes
the distance—most commonly the average squared distance—between the sample
points and the fitted line. We can now use the intercept and slope learned from this
data to predict the outcome variable of new data:
Dept. of ECE/SJBIT Page 3
Machine Learning with Python 21EC744
1.2.2 Reinforcement Learning:
In reinforcement learning, the goal is to develop a system (agent) that improves its
performance based on interactions with the environment. Since the information about the
current state of the environment typically also includes a so-called reward signal, we can
think of reinforcement learning as a field related to supervised learning. However, in
reinforcement learning this feedback is not the correct ground truth label or value, but a
measure of how well the action was measured by a reward function. Through its interaction
with the environment, an agent can then use reinforcement learning to learn a series of
actions that maximizes this reward via an exploratory trial-and-error approach or deliberative
planning.
A popular example of reinforcement learning is a chess engine. Here, the agent decides upon
a series of moves depending on the state of the board (the environment), and the reward can
be defined as win or lose at the end of the game:
There are many different subtypes of reinforcement learning. However, a general scheme is
that the agent in reinforcement learning tries to maximize the reward by a series of
interactions with the environment. Each state can be associated with a positive or negative
reward, and a reward can be defined as accomplishing an overall goal, such as winning or
losing a game of chess. For instance, in chess the outcome of each move can be thought of as
a different state of the environment.
1.2.3 Unsupervised Learning
In supervised learning, we know the right answer beforehand when we train our model, and
in reinforcement learning, we define a measure of reward for particular actions by the agent.
In unsupervised learning, however, we are dealing with unlabeled data or data of unknown
structure. Using unsupervised learning techniques, we are able to explore the structure of our
data to extract meaningful information without the guidance of a known outcome variable or
reward function.
1. Finding subgroups with clustering
Clustering is an exploratory data analysis technique that allows us to organize a pile
of information into meaningful subgroups (clusters) without having any prior
knowledge of their group memberships. Each cluster that arises during the analysis
defines a group of objects that share a certain degree of similarity but are more
dissimilar to objects in other clusters, which is why clustering is also sometimes
called unsupervised classification. Clustering is a great technique for structuring
information and deriving meaningful relationships from data. For example, it allows
Dept. of ECE/SJBIT Page 4
Machine Learning with Python 21EC744
marketers to discover customer groups based on their interests, in order to develop
distinct marketing programs.
The following figure illustrates how clustering can be applied to organizing unlabeled
data into three distinct groups based on the similarity of their features x1 and x2 :
2. Dimensionality reduction for data compression
Another subfield of unsupervised learning is dimensionality reduction. Often we are
working with data of high dimensionality—each observation comes with a high
number of measurements—that can present a challenge for limited storage space and
the computational performance of machine learning algorithms. Unsupervised
dimensionality reduction is a commonly used approach in feature preprocessing to
remove noise from data, which can also degrade the predictive performance of certain
algorithms and compress the data onto a smaller dimensional subspace while retaining
most of the relevant information.
Sometimes, dimensionality reduction can also be useful for visualizing data. The
following figure shows an example where nonlinear dimensionality reduction was
applied to compress a 3D Swiss Roll onto a new 2D feature subspace:
Dept. of ECE/SJBIT Page 5
Machine Learning with Python 21EC744
1.3 Introduction to the basic terminology and notations
The following table depicts an excerpt of the Iris dataset, which is a classic example in the
field of machine learning. The Iris dataset contains the measurements of 150 Iris flowers
from three different species—Setosa, Versicolor, and Virginica. Here, each flower sample
represents one row in our dataset, and the flower measurements in centimeters are stored as
columns, which we also call the features of the dataset:
To keep the notation and implementation simple yet efficient, we will make use of some of
the basics of linear algebra. In the following chapters, we will use a matrix and vector
notation to refer to our data. We will follow the common convention to represent each sample
as a separate row in a feature matrix X, where each feature is stored as a separate column.
Dept. of ECE/SJBIT Page 6
Machine Learning with Python 21EC744
1.4 A roadmap for building machine learning systems
The following diagram shows a typical workflow for using machine learning in predictive
modeling
1. Preprocessing – getting data into shape
Let's begin with discussing the roadmap for building machine learning systems. Raw
data rarely comes in the form and shape that is necessary for the optimal performance
of a learning algorithm. Thus, the preprocessing of the data is one of the most crucial
steps in any machine learning application. If we take the Iris flower dataset from the
previous section as an example, we can think of the raw data as a series of flower
images from which we want to extract meaningful features. Useful features could be
the color, the hue, the intensity of the flowers, the height, and the flower lengths and
widths. Many machine learning algorithms also require that the selected features are
on the same scale for optimal performance, which is often achieved by transforming
the features in the range [0, 1] or a standard normal distribution with zero mean and
unit variance.
Some of the selected features may be highly correlated and therefore redundant to a
certain degree. In those cases, dimensionality reduction techniques are useful for
compressing the features onto a lower dimensional subspace. Reducing the
dimensionality of our feature space has the advantage that less storage space
is required, and the learning algorithm can run much faster.
Dept. of ECE/SJBIT Page 7
Machine Learning with Python 21EC744
2. Training and selecting a predictive model
Many different machine learning algorithms have been developed to solve different
problem tasks. An important point that can be summarized from David Wolpert's
famous No free lunch theorems is that we can't get learning "for free" (The Lack of A
Priori Distinctions Between Learning Algorithms.
For example, each classification algorithm has its inherent biases, and no single
classification model enjoys superiority if we don't make any assumptions about the
task. In practice, it is therefore essential to compare at least a handful of different
algorithms in order to train and select the best performing model. But before we can
compare different models, we first have to decide upon a metric to measure
performance. One commonly used metric is classification accuracy, which is defined
as the proportion of correctly classified instances.
One legitimate question to ask is this: how do we know which model performs well
on the final test dataset and real-world data if we don't use this test set for the model
selection, but keep it for the final model evaluation? In order to address the issue
embedded in this question, different cross-validation techniques can be used where
the training dataset is further divided into training and validation subsets in order to
estimate the generalization performance of the model. Finally, we also cannot expect
that the default parameters of the different learning algorithms provided by software
libraries are optimal for our specific problem task. Therefore, we will make frequent
use of hyperparameter optimization techniques that help us to fine-tune the
performance of our model.
3. Evaluating models and predicting unseen data instances
After we have selected a model that has been fitted on the training dataset, we can use
the test dataset to estimate how well it performs on this unseen data to estimate the
generalization error. If we are satisfied with its performance, we can now use this
model to predict new, future data
1.5 Using Python for machine learning
Python is one of the most popular programming languages for data science and therefore
enjoys a large number of useful add-on libraries developed by its great developer and and
open-source community.
Although the performance of interpreted languages, such as Python, for computation-
intensive tasks is inferior to lower-level programming languages, extension libraries such as
NumPy and SciPy have been developed that build upon lower-layer Fortran and C
implementations for fast and vectorized operations on multidimensional arrays.
For machine learning programming tasks, we will mostly refer to the scikit-learn library,
which is currently one of the most popular and accessible open source machine learning
libraries.
Dept. of ECE/SJBIT Page 8
Machine Learning with Python 21EC744
Training Simple Machine Learning Algorithms for Classification
1.6 Artificial neurons
In order to design AI, Warren McCullock and Walter Pitts published the first concept of a
simplified brain cell, the so-called McCullock- Pitts (MCP) neuron, in 1943. Neurons are
interconnected nerve cells in the brain that are involved in the processing and transmitting of
chemical and electrical signals, which is illustrated in the following figure:
McCullock and Pitts described such a nerve cell as a simple logic gate with binary outputs;
multiple signals arrive at the dendrites, are then integrated into the cell body, and, if the
accumulated signal exceeds a certain threshold, an output signal is generated that will be
passed on by the axon.
Only a few years later, Frank Rosenblatt published the first concept of the perceptron
learning rule based on the MCP neuron model (The Perceptron: A Perceiving and
Recognizing Automaton, F. Rosenblatt, Cornell Aeronautical Laboratory, 1957). With his
perceptron rule, Rosenblatt proposed an algorithm that would automatically learn the optimal
weight coefficients that are then multiplied with the input features in order to make the
decision of whether a neuron fires or not. In the context of supervised learning and
classification, such an algorithm could then be used to predict if a sample belongs to one
class or the other.
The formal definition of an artificial neuron
Dept. of ECE/SJBIT Page 9
Machine Learning with Python 21EC744
1.7The perceptron learning rule
The whole idea behind the MCP neuron and Rosenblatt's thresholded perceptron model is to
use a reductionist approach to mimic how a single neuron in the brain works: it either fires or
it doesn't. Thus, Rosenblatt's initial perceptron rule is fairly simple and can be summarized by
the following steps:
Dept. of ECE/SJBIT Page 10
Machine Learning with Python 21EC744
Dept. of ECE/SJBIT Page 11
Machine Learning with Python 21EC744
The diagram illustrates how the perceptron receives the inputs of a sample x and combines
them with the weights w to compute the net input. The net input is then passed on to the
threshold function, which generates a binary output -1 or +1— the predicted class label of the
Dept. of ECE/SJBIT Page 12
Machine Learning with Python 21EC744
sample. During the learning phase, this output is used to calculate the error of the prediction
and update the weights.
1.8 Adaptive linear neurons and the convergence of learning
The Adaline algorithm is particularly interesting because it illustrates the key concepts of
defining and minimizing continuous cost functions. This lays the groundwork for
understanding more advanced machine learning algorithms for classification, such as logistic
regression, support vector machines, and regression models.
The key difference between the Adaline rule (also known as the Widrow-Hoff rule) and
Rosenblatt's perceptron is that the weights are updated based on a linear activation function
rather than a unit step function like in the perceptron. In Adaline, this linear activation
function φ(z) is simply the identity function of the net input, so that:
Dept. of ECE/SJBIT Page 13
Machine Learning with Python 21EC744
The illustration shows that the Adaline algorithm compares the true class labels with the
linear activation function's continuous valued output to compute the model error and update
the weights. In contrast, the perceptron compares the true class labels to the predicted class
labels.
Minimizing cost functions with gradient descent
In the case of Adaline, we can define the cost function J to learn the weights as the Sum of
Squared Errors (SSE) between the calculated outcome and the true class label:
Dept. of ECE/SJBIT Page 14
Machine Learning with Python 21EC744
The following figure illustrates what might happen if we change the value of a particular
weight parameter to minimize the cost function J. The left subfigure illustrates the case of a
well-chosen learning rate, where the cost decreases gradually, moving in the direction of the
global minimum. The subfigure on the right, however, illustrates what happens if we choose
a learning rate that is too large—we overshoot the global minimum:
Improving gradient descent through feature scaling
Gradient descent is one of the many algorithms that benefit from feature scaling called as
standardization, which gives our data the property of a standard normal distribution, which
helps gradient descent learning to converge more quickly. Standardization shifts the mean of
each feature so that it is centered at zero and each feature has a standard deviation of 1. For
instance, to standardize the jth feature, we can simply subtract the sample mean 𝜇𝑗 from
every training sample and divide it by its standard deviation 𝜎𝑗 :
Dept. of ECE/SJBIT Page 15
Machine Learning with Python 21EC744
One of the reasons why standardization helps with gradient descent learning is that the
optimizer has to go through fewer steps to find a good or optimal solution (the global cost
minimum), as illustrated in the following figure, where the subfigures represent the cost
surface as a function of two model weights in a two-dimensional classification problem:
Large-scale machine learning and stochastic gradient descent
Minimizing a cost function by taking a step in the opposite direction of a cost gradient that is
calculated from the whole training set; this is why this approach is sometimes also referred to
as batch gradient descent. Now imagine we have a very large dataset with millions of data
points, which is not uncommon in many machine learning applications. Running batch
gradient descent can be computationally quite costly in such scenarios since we need to
reevaluate the whole training dataset each time we take one step towards the global
minimum.
A popular alternative to the batch gradient descent algorithm is stochastic gradient descent,
sometimes also called iterative or online gradient descent. Instead of updating the weights
based on the sum of the accumulated errors over all samples x(i)
Although stochastic gradient descent can be considered as an approximation of gradient
descent, it typically reaches convergence much faster because of the more frequent weight
Dept. of ECE/SJBIT Page 16
Machine Learning with Python 21EC744
updates. Since each gradient is calculated based on a single training example, the error
surface is noisier than in gradient descent, which can also have the advantage that stochastic
gradient descent can escape shallow local minima more readily if we are working with
nonlinear cost functions.
Another advantage of stochastic gradient descent is that we can use it for online learning. In
online learning, our model is trained on the fly as new training data arrives. This is especially
useful if we are accumulating large amounts of data, for example, customer data in web
applications. Using online learning, the system can immediately adapt to changes and the
training data can be discarded after updating the model if storage space is an issue.
Dept. of ECE/SJBIT Page 17