0% found this document useful (0 votes)
12 views266 pages

PCATG

The document outlines the syllabus and structure for the Machine Learning course (Paper - XI) for the Master of Computer Applications program at the University of Madras. It introduces key concepts, techniques, and applications of machine learning, emphasizing the importance of understanding various learning systems, including supervised and unsupervised learning. The course aims to equip students with practical skills in using machine learning software to solve real-world problems, with assessments based on tests, assignments, and end-semester examinations.

Uploaded by

mukeshmagesh6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views266 pages

PCATG

The document outlines the syllabus and structure for the Machine Learning course (Paper - XI) for the Master of Computer Applications program at the University of Madras. It introduces key concepts, techniques, and applications of machine learning, emphasizing the importance of understanding various learning systems, including supervised and unsupervised learning. The course aims to equip students with practical skills in using machine learning software to solve real-world problems, with assessments based on tests, assignments, and end-semester examinations.

Uploaded by

mukeshmagesh6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 266

SPCA 201N

MASTER OF
COMPUTER APPLICATIONS

SECOND YEAR
THIRD SEMESTER

PAPER - XI

MACHINE LEARNING

INSTITUTE OF DISTANCE EDUCATION


UNIVERSITY OF MADRAS
MASTER OF COMPUTER PAPER - XI
APPLICATIONS MACHINE LEARNING
SECOND YEAR - THIRD SEMESTER

WELCOME
Warm Greetings.

It is with a great pleasure to welcome you as a student of Institute of Distance


Education, University of Madras. It is a proud moment for the Institute of Distance education
as you are entering into a cafeteria system of learning process as envisaged by the University
Grants Commission. Yes, we have framed and introduced as per AICTE Choice Based
Credit System(CBCS) in Semester pattern from the academic year 2020-21. You are free
to choose courses, as per the Regulations, to attain the target of total number of credits set
for each course and also each degree programme. What is a credit? To earn one credit in
a semester you have to spend 30 hours of learning process. Each course has a weightage
in terms of credits. Credits are assigned by taking into account of its level of subject content.
For instance, if one particular course or paper has 4 credits then you have to spend 120
hours of self-learning in a semester. You are advised to plan the strategy to devote hours of
self-study in the learning process. You will be assessed periodically by means of tests,
assignments and quizzes either in class room or laboratory or field work. In the case of PG
(UG), Continuous Internal Assessment for 20(25) percentage and End Semester University
Examination for 80 (75) percentage of the maximum score for a course / paper. The theory
paper in the end semester examination will bring out your various skills: namely basic
knowledge about subject, memory recall, application, analysis, comprehension and
descriptive writing. We will always have in mind while training you in conducting experiments,
analyzing the performance during laboratory work, and observing the outcomes to bring
out the truth from the experiment, and we measure these skills in the end semester
examination. You will be guided by well experienced faculty.

I invite you to join the CBCS in Semester System to gain rich knowledge leisurely at
your will and wish. Choose the right courses at right times so as to erect your flag of
success. We always encourage and enlighten to excel and empower. We are the cross
bearers to make you a torch bearer to have a bright future.

With best wishes from mind and heart,

DIRECTOR

(i)
MASTER OF COMPUTER PAPER - XI
APPLICATIONS MACHINE LEARNING
SECOND YEAR - THIRD SEMESTER

COURSE WRITER

Ms. D. Renuka Devi


Assistant Professor
Department of Computer Science
Stella Maris College (Autonomous)
Chennai.

EDITING AND CO-ORDINATION

Dr. S. Sasikala
Associate Professor in Computer Science
Institute of Distance Education
University of Madras
Chepauk, Chennnai - 600 005.

© UNIVERSITY OF MADRAS, CHENNAI 600 005.

(ii)
MASTER OF COMPUTER APPLICATIONS

SECOND YEAR

THIRD SEMESTER

PAPER - XI

MACHINE LEARNING

SYLLABUS

Objective of the course : To introduce the basic concepts and techniques of Machine
Learning and to develop skills of using recent machine learning software for solving practical
problems.

Course Outcomes : After successful completion of this course, the students should be
able to recognize the characteristics of machine learning that make it useful to real-world
problems. Understand the foundation of generative models.

Unit 1: The Fundamentals of Machine Learning: The Machine Learning Landscape - Types
of Machine Learning Systems - Main Challenges of Machine Learning - Testing and
Validating. End-to-End Machine Learning Project - Look at the Big Picture - Get the Data -
Discover and Visualize the Data to Gain Insights - Prepare the Data for Machine Learning
Algorithms - Select and Train a Model - Fine-Tune Your Model - Launch, Monitor, and Maintain
Your System.

Unit 2: Ingredients of machine learning: Tasks – Models – Features. Supervised Learning:


Classification – Binary classification and related tasks – Scoring and ranking – class
probability estimation – Multi-class classification. Unsupervised Learning: Regression –
Unsupervised and descriptive learning. Concept Learning: The hypothesis space – paths
through the hypothesis space – beyond conjunctive concepts – learnability.

Unit 3: Tree Models: Decision trees – Ranking and probability estimation trees – tree
learning as variance reduction. Rule Models: Learning ordered rule lists – learning unordered
rule sets – descriptive rule learning – first–order rule learning. Linear Models: The least-
squares method – The perceptron – Support vector machines.

(iii)
Unit 4: Distance-based Models: Neighbours and exemplars – Nearest-neighbour
classification – Distance-based clustering – K-Means algorithm – Hierarchical clustering.
Probabilistic Models: The normal distribution and its geometric interpretations – probabilistic
models for categorical data – Naïve Bayes model for classification – probabilistic models
with hidden values – Expectation-Maximization.

Unit 5: Features: Kinds of features – Feature transformations – Feature construction and


selection. Model ensembles: Bagging and random forests – Boosting – Mapping the
ensemble landscape. Machine Learning experiments: What to measure – How to measure
it – How to interpret it.

Text Books:

1. Flach, P, “Machine Learning: The Art and Science of Algorithms that Make Sense of
Data”, Cambridge University Press, 2012

2. Aurélien Géron, “Hands-On Machine Learning with Scikit-Learn and Tensor Flow:
Concepts, Tools, and Techniques to Build Intelligent Systems”, First Edition, 2017
(Chapters 1 and 2)

References

1. John D. Kelleher, Brian Mac Namee, Aoife D’Arcy, “Fundamentals of Machine


Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case
Studies”, The MIT Press, First Edition, 2012

2. Kevin P. Murphy, “Machine Learning: A Probabilistic Perspective”, MIT Press, 2012

3. Ethem Alpaydin, “Introduction to Machine Learning”, MIT Press, Third Edition, 2014

4. Tom Mitchell, “Machine Learning”, McGraw-Hill, 1997

5. Stephen Marsland, “Machine Learning - An Algorithmic Perspective”, Chapman and


Hall/CRC Press, Second Edition, 2014.

(iv)
MASTER OF COMPUTER APPLICATIONS

SECOND YEAR

THIRD SEMESTER

PAPER - XI

MACHINE LEARNING

SCHEME OF LESSONS

Sl.No. Title Page

1 The Machine Learning Landscape 001

2 End-to-End Machine Learning Project 019

3 The ingredients of Machine Learning 037

4 Binary classification and related tasks 056

5 Concept Learning 080

6 Tree models 088

7 Rule Models 102

8 Linear Models 110

9 Distance-based models 128

10 Probabilistic models 149

11 Features 162

12 Model ensembles 172

13 Machine Learning Experiments 182

(v)
1

LESSON -1

THE MACHINE LEARNING LANDSCAPE


Structure

1.1 Introduction

1.2 Learning Objectives

1.2 Types of Machine Learning Systems

1.3 Supervised Learning

1.4 Unsupervised Learning

1.5 Semi – Supervised Learning

1.6 Reinforcement Learning

1.7 Main Challenges of Machine Learning

1.8 Testing and Validating Machine Learning Model

1.9 Summary

1.10 Keywords

1.11 Model Questions

1.1 Introduction

The goal of Machine Learning (ML) is to construct computer programs that can learn from data.
Machine learning is a method of data analysis that automates analytical model building. It is a
branch of artificial intelligence based on the idea that systems can learn from data, identify
patterns and make decisions with minimal human intervention. Machine learning is a subfield of
computer science that evolved from the study of pattern recognition and computational learning
theory in artificial intelligence. Machine learning explores the construction and study of algorithms
that can learn from and make predictions on data.
2

Here is a slightly more general definition:

[Machine Learning is the] field of study that gives computers the ability to learn without
being explicitly programmed. Arthur Samuel, 1959

And a more engineering-oriented one: A computer program is said to learn from


experience E with respect to some task T and some performance measure P, if its
performance on T, as measured by P, improves with experience E.Tom Mitchell, 1997

Finally, Machine Learning can help humans learn (Figure 1.1 ): ML algorithms can be inspected
to see what they have learned.

Figure 1.1 Machine Learning

To summarize, Machine Learning is great for:

• Problems for which existing solutions require a lot of hand-tuning or long lists of rules:
one Machine Learning algorithm can often simplify code and perform better.

• Complex problems for which there is no good solution at all using a traditional approach:
the best Machine Learning techniques can find a solution.

• Fluctuating environments: a Machine Learning system can adapt to new data.

• Getting insights about complex problems and large amounts of data.


3

1.2 Learning Objectives

• To explore the basics of machine learning

• To understand the various types of machine learning mechanisms

• To apply the model for real time problems

• To understand the overall life cycle of the machine learning process

1.3 TYPES OF MACHINE LEARNING SYSTEMS

There are so many different types of Machine Learning systems (Figure 1.2) that it is useful to
classify them in broad categories based on:

• Whether or not they are trained with human supervision (supervised, unsupervised,
semi supervised, and Reinforcement Learning)

• Whether or not they can learn incrementally on the fly (online versus batch learning)

• Whether they work by simply comparing new data points to known data points, or instead
detect patterns in the training data and build a predictive model, much like scientists do
(instance-based versus model-based learning)

Machine Learning systems can be classified according to the amount and type of supervision
they get during training. There are four major categories: supervised learning, unsupervised
learning, semi supervised learning, and Reinforcement Learning.

Figure 1.2 Types of Machine Learning


4

1.3 Supervised Learning

Supervised learning is when the model is getting trained on a labelled dataset.


A labelled dataset is one that has both input and output parameters. In this type of learning
both training and validation, datasets are labelled as shown in the Figure 1.3.

Figure 1.3 Supervised Learning

Both the above figures have labelled data set –

• Figure A: It is a dataset of a shopping store that is useful in predicting whether a customer


will purchase a particular product under consideration or not based on his/ her gender,
age, and salary.

Input: Gender, Age, Salary

Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means
that the customer won’t purchase it.

• Figure B: It is a Meteorological dataset that serves the purpose of predicting wind speed
based on different parameters.

Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction

Output: Wind Speed


5

Training the system

While training the model, data is usually split in the ratio of 80:20 i.e. 80% as training data
and rest as testing data. In training data, we feed input as well as output for 80% of data.
The model learns from training data only. We use different machine learning algorithms
(which we will discuss in detail in the next articles) to build our model. By learning, it means
that the model will build some logic of its own. Once the model is ready then it is good to be
tested. At the time of testing, the input is fed from the remaining 20% data which the model
has never seen before, the model will predict some value and we will compare it with actual
output and calculate the accuracy. The supervised algorithms are categorized as below,

Figure 1.4 Classification and Regression

Types of Supervised Learning

1. Classification: It is a Supervised Learning task where output is having defined labels


(discrete value). For example, in above Figure A, Output – Purchased has defined labels
i.e. 0 or 1; 1 means the customer will purchase and 0 means that customer won’t
purchase. The goal here is to predict discrete values belonging to a particular class and
evaluate them on the basis of accuracy.

It can be either binary or multi-class classification. In binary classification, the model


predicts either 0 or 1; yes or no but in the case of multi-class classification, the model
predicts more than one class. Example: Gmail classifies mails in more than one class
like social, promotions, updates, forums.

2. Regression: It is a Supervised Learning task where output is having continuous value.


6

Example in above Figure B, Output – Wind Speed is not having any discrete value but
is continuous in the particular range. The goal here is to predict a value as much closer
to the actual output value as our model can and then evaluation is done by calculating
the error value. The smaller the error the greater the accuracy of our regression model.

Example of Supervised Learning Algorithms

• Linear Regression

• Nearest Neighbor

• Gaussian Naive Bayes

• Decision Trees

• Support Vector Machine (SVM)

• Random Forest

Applications

• Advertisement Popularity: Selecting advertisements that will perform well is often a


supervised learning task. Many of the ads you see as you browse the internet are placed
there because a learning algorithm said that they were of reasonable popularity (and
clickability). Furthermore, its placement associated on a certain site or with a certain
query (if you find yourself using a search engine) is largely due to a learned algorithm
saying that the matching between ad and placement will be effective.

• Spam Classification: If you use a modern email system, chances are you’ve
encountered a spam filter. That spam filter is a supervised learning system. Fed email
examples and labels (spam/not spam), these systems learn how to preemptively filter
out malicious emails so that their user is not harassed by them. Many of these also
behave in such a way that a user can provide new labels to the system and it can learn
user preference.

• Face Recognition: Do you use Facebook? Most likely your face has been used in a
supervised learning algorithm that is trained to recognize your face. Having a system
7

that takes a photo, finds faces, and guesses who that is in the photo (suggesting a tag)
is a supervised process. It has multiple layers to it, finding faces and then identifying
them, but is still supervised nonetheless.

1.4 Unsupervised Learning

Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns
or data groupings without the need for human intervention. Its ability to discover similarities and
differences in information make it the ideal solution for exploratory data analysis, cross-selling
strategies, customer segmentation, and image recognition. Unsupervised learning is very much
the opposite of supervised learning. It features no labels. Instead, our algorithm would be fed a
lot of data and given the tools to understand the properties of the data. From there, it can learn
to group, cluster, and/or organize the data in a way such that a human (or other intelligent
algorithm) can come in and make sense of the newly organized data.

No labels are given to the learning algorithm, leaving it on its own to find structure in its input.
Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means
towards an end (feature learning).

Applications

Recommender Systems: If you’ve ever used YouTube or Netflix, you’ve most likely encountered
a video recommendation system. These systems are often times placed in the unsupervised
domain. We know things about videos, maybe their length, their genre, etc. We also know the
watch history of many users. Taking into account users that have watched similar videos as
you and then enjoyed other videos that you have yet to see, a recommender system can see
this relationship in the data and prompt you with such a suggestion.

Buying Habits: It is likely that your buying habits are contained in a database somewhere and
that data is being bought and sold actively at this time. These buying habits can be used in
unsupervised learning algorithms to group customers into similar purchasing segments. This
helps companies market to these grouped segments and can even resemble recommender
systems.
8

Grouping User Logs: Less user facing, but still very relevant, we can use unsupervised learning
to group user logs and issues. This can help companies identify central themes to issues their
customers face and rectify these issues, through improving a product or designing an FAQ to
handle common issues. Either way, it is something that is actively done and if you’ve ever
submitted an issue with a product or submitted a bug report, it is likely that it was fed to an
unsupervised learning algorithm to cluster it with other similar issues.

News Sections: Google News uses unsupervised learning to categorize articles on the same
story from various online news outlets. For example, the results of a presidential election could
be categorized under their label for “US” news.

Computer vision: Unsupervised learning algorithms are used for visual perception tasks, such
as object recognition.

Medical imaging: Unsupervised machine learning provides essential features to medical


imaging devices, such as image detection, classification and segmentation, used in radiology
and pathology to diagnose patients quickly and accurately.

Anomaly detection: Unsupervised learning models can comb through large amounts of data
and discover atypical data points within a dataset. These anomalies can raise awareness around
faulty equipment, human error, or breaches in security.

Customer personas: Defining customer personas makes it easier to understand common


traits and business clients’ purchasing habits. Unsupervised learning allows businesses to
build better buyer persona profiles, enabling organizations to align their product messaging
more appropriately.

Figure 1.5 Unsupervised learning


9

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k-means clustering, Decision
tree, etc. Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm

The unsupervised learning algorithm can be further categorized (Figure 1.6) into two types of
problems:

Figure 1.6 Types of Unsupervised Learning

Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes
them as per the presence and absence of those commonalities.

Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such
as people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.
10

Example of Unsupervised Learning algorithms

o K-means clustering

o KNN (k-nearest neighbors)

o Hierarchal clustering

o Anomaly detection

o Neural Networks

o Principle Component Analysis

o Independent Component Analysis

o Apriori algorithm

o Singular value decomposition

1.5 Semi – Supervised Learning

The most basic disadvantage of any Supervised Learning algorithm is that the dataset has
to be hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly
process, especially when dealing with large volumes of data. The most basic disadvantage of
any Unsupervised Learning is that it’s application spectrum is limited.

To counter these disadvantages, the concept of Semi-Supervised Learning was introduced.


In this type of learning, the algorithm is trained upon a combination of labeled and unlabelled
data. Typically, this combination will contain a very small amount of labeled data and a very
large amount of unlabelled data. The basic procedure involved is that first, the programmer will
cluster similar data using an unsupervised learning algorithm and then use the existing labeled
data to label the rest of the unlabelled data. The typical use cases of such type of algorithm
have a common property among them – The acquisition of unlabelled data is relatively cheap
while labeling the said data is very expensive.

Intuitively, one may imagine the three types of learning algorithms as Supervised learning where
a student is under the supervision of a teacher at both home and school, Unsupervised learning
11

where a student has to figure out a concept himself and Semi-Supervised learning where a
teacher teaches a few concepts in class and gives questions as homework which are based
on similar concepts.

Applications

1. Speech Analysis: Since labeling of audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.

2. Internet Content Classification: Labeling each webpage is an impractical and unfeasible


process and thus uses Semi-Supervised learning algorithms. Even the Google search
algorithm uses a variant of Semi-Supervised learning to rank the relevance of a webpage
for a given query.

3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.

1.6 Reinforcement Learning (RL)

Reinforcement learning is a machine learning training method based on rewarding desired


behaviors and/or punishing undesired ones. In general, a reinforcement learning agent is able
to perceive and interpret its environment, take actions and learn through trial and error.
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to
maximize reward in a particular situation. It is employed by various software and machines to
find the best possible behavior or path it should take in a specific situation. Reinforcement
learning differs from the supervised learning in a way that in supervised learning the training
data has the answer key with it so the model is trained with the correct answer itself whereas in
reinforcement learning, there is no answer but the reinforcement agent decides what to do to
perform the given task.

Example: The problem is as follows (Figure 1.7): We have an agent and a reward, with many
hurdles in between. The agent is supposed to find the best possible path to reach the reward.
The following problem explains the problem more easily.
12

Figure 1.7 Reinforcement Learning

The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fire. The robot learns by trying all the possible
paths and then choosing the path which gives him the reward with the least hurdles. Each right
step will give the robot a reward and each wrong step will subtract the reward of the robot. The
total reward will be calculated when it reaches the final reward that is the diamond.

Input: The input should be an initial state from which the model will start

Output: There are many possible output as there are variety of solution to a particular problem

Training: The training is based upon the input, the model will return a state and the user will
decide to reward or punish the model based on its output.

The model keeps continue to learn.

The best solution is decided based on the maximum reward.

Types of Reinforcement: There are two types of Reinforcement.

1. Positive

Positive Reinforcement is defined as when an event, occurs due to a particular behavior,


increases the strength and the frequency of the behavior. In other words, it has a positive
effect on behavior.
13

Advantages of reinforcement learning are:

• Maximizes Performance

• Sustain Change for a long period of time

Disadvantages of reinforcement learning:

• Too much Reinforcement can lead to overload of states which can diminish the result

2. Negative

Negative Reinforcement is defined as strengthening of a behavior because a negative


condition is stopped or avoided.

Advantages of reinforcement learning:

• Increases Behavior

• Provide defiance to minimum standard of performance

Disadvantages of reinforcement learning:

• It Only provides enough to meet up the minimum behavior

Various Practical applications of Reinforcement Learning

• RL can be used in robotics for industrial automation.

• RL can be used in machine learning and data processing

• RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.

RL can be used in large environments in the following situations

1. A model of the environment is known, but an analytic solution is not available;

2. Only a simulation model of the environment is given (the subject of simulation-based


optimization)
14

3. The only way to collect information about the environment is to interact with it.

1.7 MAIN CHALLENGES OF MACHINE LEARNING

The key challenges are described below,

Insufficient Quantity of Training Data

Machine Learning is not quite there yet; it takes a lot of data for most Machine Learning algorithms
to work properly. Even for very simple problems you typically need thousands of examples, and
for complex problems such as image or speech recognition you may need millions of examples
(unless you can reuse parts of an existing model).

Nonrepresentative Training Data

In order to generalize well, it is crucial that the training data be representative of the new cases
we want to generalize to. This is true whether we use instance-based learning or model-based
learning.

Poor-Quality Data

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-quality
measurements), it will make it harder for the system to detect the underlying patterns, so system
is less likely to perform well. It is often well worth the effort to spend time cleaning up the training
data. For example:

• If some instances are clearly outliers, it may help to simply discard them or try to fix the
errors manually.

• If some instances are missing a few features (e.g., 5% of your customers did not specify
their age), we must decide whether we want to ignore this attribute altogether, ignore
these instances, fill in the missing values (e.g., with the median age), or train one model
with the feature and one model without it, and so on.

Irrelevant Features

Critical part of the success of a Machine Learning project is coming up with a good set of
features to train on. This process, called feature engineering, involves:
15

• Feature selection: selecting the most useful features to train on among existing features.

• Feature extraction: combining existing features to produce a more useful one (as we
saw earlier, dimensionality reduction algorithms can help).

• Creating new features by gathering new data.

Overfitting the Training Data

Overgeneralizing is something that we humans do all too often, and unfortunately machines
can fall into the same trap if we are not careful. In Machine Learning this is called overfitting: it
means that the model performs well on the training data, but it does not generalize well.

Underfitting the Training Data

Underfitting is the opposite of overfitting: it occurs when the model is too simple to learn the
underlying structure of the data. For example, a linear model of life satisfaction is prone to
underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate,
even on the training examples.

The main options to fix this problem are:

• Selecting a more powerful model, with more parameters

• Feeding better features to the learning algorithm (feature engineering)

• Reducing the constraints on the model (e.g., reducing the regularization hyper parameter)

1.8 TESTING AND VALIDATING MACHINE LEARNING MODEL

Effective machine learning (ML) algorithms require quality training and testing data —
and often lots of it — to make accurate predictions. Different datasets serve different
purposes in preparing an algorithm to make predictions and decisions based on real-
world data.

Training Dataset: The sample of data used to fit the model. The actual dataset that we use to
train the model. The model sees and learns from this data. This type of data builds up the machine
learning algorithm. The data scientist feeds the algorithm input data, which corresponds to an
16

expected output. The model evaluates the data repeatedly to learn more about the data’s behavior
and then adjusts itself to serve its intended purpose.

Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit
on the training dataset while tuning model hyper parameters. During training, validation data
infuses new data into the model that it hasn’t evaluated before. Validation data provides the first
test against unseen data, allowing data scientists to evaluate how well the model makes
predictions based on the new data. Not all data scientists use validation data, but it can provide
some helpful information to optimize hyper parameters, which influence how the model assesses
data. The evaluation becomes more biased as skill on the validation dataset is incorporated
into the model configuration. The validation set is used to evaluate a given model, but this is for
frequent evaluation. Hence the model occasionally sees this data, but never does it “Learn”
from this. So the validation set affects a model, but only indirectly. The validation set is also
known as the Dev set or the Development set. This makes sense since this dataset helps
during the “development” stage of the model.

Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on
the training dataset. The Test dataset provides the gold standard used to evaluate the model. It
is only used once a model is completely trained (using the train and validation sets). The test
set is generally what is used to evaluate competing models. After the model is built, testing data
once again validates that it can make accurate predictions. If training and validation data include
labels to monitor performance metrics of the model, the testing data should be unlabeled. Test
data provides a final, real-world check of an unseen dataset to confirm that the ML algorithm
was trained effectively. Many a times the validation set is used as the test set, but it is not good
practice. The test set is generally well curated. It contains carefully sampled data that spans
the various classes that the model would face, when used in the real world. It is common to use
80% of the data for training and hold out 20% for testing.

Training data vs. validation data

ML algorithms require training data to achieve an objective. The algorithm will analyze this
training dataset, classify the inputs and outputs, then analyze it again. Trained enough, an
algorithm will essentially memorize all of the inputs and outputs in a training dataset — this
becomes a problem when it needs to consider data from other sources, such as real-world
customers.
17

Here is where validation data is useful. Validation data provides an initial check that the model
can return useful predictions in a real-world setting, which training data cannot do. The ML
algorithm can assess training data and validation data at the same time. Validation data is an
entirely separate segment of data, though a data scientist might carve out part of the training
dataset for validation — as long as the datasets are kept separate throughout the entirety of
training and testing.

For example, let’s say an ML algorithm is supposed to analyze a picture of a vertebrate and
provide its scientific classification. The training dataset would include lots of pictures of mammals,
but not all pictures of all mammals, let alone all pictures of all vertebrates. So, when the validation
data provides a picture of a squirrel, an animal the model hasn’t seen before, the data scientist
can assess how well the algorithm performs in that task. This is a check against an entirely
different dataset than the one it was trained on. Based on the accuracy of the predictions after
the validation stage, data scientists can adjust hyper parameters such as learning rate, input
features and hidden layers. These adjustments prevent overfitting, in which the algorithm can
make excellent determinations on the training data, but can’t effectively adjust predictions for
additional data. The opposite problem, Underfitting, occurs when the model isn’t complex enough
to make accurate predictions against either training data or new data.

Validation data vs. testing data

Not all data scientists rely on both validation data and testing data. To some degree, both datasets
serve the same purpose: make sure the model works on real data. However, there are some
practical differences between validation data and testing data. Validation data occurs as part of
the model training process. Conversely, the model acts as a black box when we run testing
data through it. Thus, validation data tunes the model, whereas testing data simply confirms
that it works.

1.9 Summary

In this lesson, we have discussed the fundamental notions of machine learning and the process.
The various types of learning mechanisms are delineated in detail. The testing, training, and
validating the machine models are further explained with examples.
18

1.10 Keywords

Machine Learning, Supervised Learning, Unsupervised Learning, Semi – Supervised Learning,


Reinforcement Learning, Testing, Validating

1.11 Model Questions

1. Explain in detail the life cycle of Machine Leaning.

2. Elaborate on the types of learning.

3. Discuss about the post implementation of the Machine Learning model.

4. Differentiate test Vs training dataset.

5. Write short notes on the challenges of Machine Learning.


19

LESSON – 2

END-TO-END MACHINE LEARNING PROJECT


Structure

2.1 Introduction

2.2 Learning Objectives

2.2 Get the data

2.3 Discover and visualize the data to gain insights

2.4 Types of data visualization charts

2.5 Prepare the data for Machine Learning algorithms

2.6 Select a model and train it

2.7 Considerations for Model Selection

2.8 Model Selection Techniques

2.9 Fine-tune your model

2.10 Summary

2.11 Keywords

2.12 Model Questions

2.1 Introduction

End-to-End machine learning is concerned with preparing data, training a model on it, and then
deploying that model. The cycle of Machine Learning Projects involves the following steps,

 Look at the big picture.


20

 Get the data.

 Discover and visualize the data to gain insights.

 Prepare the data for Machine Learning algorithms.

 Select a model and train it.

 Fine-tune your model.

 Present your solution.

 Launch, monitor, and maintain your system.

2.2 Learning Objectives

 To comprehend the several aspects machine learning project

 To analyze and visualize the data for deriving insights

 To understand the visualization mechanisms

 To explore data preparation

 The model selection and fine tune the model for better performance

2.3 Get the data

Machine Learning it is best to actually experiment with real-world data, not just artificial datasets.
Fortunately, there are thousands of open datasets to choose from, ranging across all sorts of
domains. Here are a few places we can look to get data:

 Popular open data repositories:

o UC Irvine Machine Learning Repository

o Kaggle datasets
21

o Amazon’s AWS datasets

 Meta portals (they list open data repositories):

o http://dataportals.org/

o http://opendatamonitor.eu/

o http://quandl.com/

 Other pages listing many popular open data repositories:

o Wikipedia’s list of Machine Learning datasets

o Quora.com question

o Datasets subreddit

2.4 Discover and visualize the data to gain insights

Visualizing information is a critical part of data analysis. Being able to gain a visual understanding
of information creates a solid foundation from which to base points on. Data that stands alone
and is stored on computers over time becomes invisible. To be able to see information or
interpret it, it needs to be visualized. This turns the once invisible non-actionable data into
understandable pictures and images. When creating visualizations, tables alone are not enough
to be able to correctly and accurately interpret the data available.

Converting the information into a rudimentary graph doesn’t allow us to be able to identify a
pattern immediately. For this reason, it’s good to get creative and think outside the box a little.
For example, when referencing geographic locations by using a map of the area with embedded
visual data, it makes it considerably easier to digest the available statistics.

Where insights are readily available from an array of sources that focus on a wide range of
business issues, opportunities, developments, and risks, the accurate interpretation of data
can pay dividends for companies and secure their long-term future in the process (Figure 2.1).
22

Figure 2.1. Visualization

Every new visualization is likely to give us some insights into our data. Some of those insights
might be already known (but perhaps not yet proven) while other insights might be completely
new or even surprising to us. Some new insights might mean the beginning of a story, while
others could just be the result of errors in the data, which are most likely to be found by visualizing
the data.

Data visualization is capable of producing significant levels of insight, provided that said data is
extracted in a clear and non-convoluted manner. Such insights can be used to delve deeper
into data sets, and have the ability to extract actionable information for businesses and clients
alike. For example, if an organization has been found to be operating at a loss, it would be wise
to draw on data to identify where changes can be made. This not only makes it easier for
employers to take mitigating measures, but it also makes such measures much easier to
explain to board members and employees.

While insights can be largely generated automatically, it’s important to be mindful when crafting
your visualization. As the infographic above shows, graphs need to not only possess the relevant
data but also be functional. This means that your X and Y axes must not only be well-listed in
order to display information in the right context but also shouldn’t take on too much data. Diffuse
23

and convoluted visualizations can ultimately cause identified issues to appear much more difficult
for viewers. The ideal data visualization must be easy to understand, informative, and eye-
catching. Charts that are too complex, lacking context, missing vital information, or are difficult
to interpret due to design flaws can severely undermine the process.

2.5 Types of data visualization charts

Now that we understand how data visualization can be used, let’s apply the different types of
data visualization to their uses. There are numerous tools available to help create data
visualizations. Some are more manual and some are automated, but either way they should
allow you to make any of the following types of visualizations.

Line chart

A line chart illustrates changes over time. The x-axis is usually a period of time, while the y-axis
is quantity. So, this could illustrate a company’s sales for the year broken down by month or
how many units a factory produced each day for the past week.

Area chart

An area chart is an adaptation of a line chart where the area under the line is filled in to emphasize
its significance. The color fill for the area under each line should be somewhat transparent so
that overlapping areas can be discerned.

Bar chart

A bar chart also illustrates changes over time. But if there is more than one variable, a bar chart
can make it easier to compare the data for each variable at each moment in time. For example,
a bar chart could compare the company’s sales from this year to last year.

Histogram

A histogram looks like a bar chart, but measures frequency rather than trends over time. The x-
axis of a histogram lists the “bins” or intervals of the variable, and the y-axis is frequency, so
each bar represents the frequency of that bin. For example, you could measure the frequencies
of each answer to a survey question. The bins would be the answer: “unsatisfactory,” “neutral,”
and “satisfactory.” This would tell you how many people gave each answer.
24

Scatter plot

Scatter plots are used to find correlations. Each point on a scatter plot means “when x = this,
then y equals this.” That way, if the points trend a certain way (upward to the left, downward to
the right, etc.) there is a relationship between them. If the plot is truly scattered with no trend at
all, then the variables do not affect each other at all.

Bubble chart

A bubble chart is an adaptation of a scatter plot, where each point is illustrated as a bubble
whose area has meaning in addition to its placement on the axes. A pain point associated with
bubble charts is the limitations on sizes of bubbles due to the limited space within the axes. So,
not all data will fit effectively in this type of visualization.

Pie chart

A pie chart is the best option for illustrating percentages, because it shows each element as
part of a whole. So, if your data explains a breakdown in percentages, a pie chart will clearly
present the pieces in the proper proportions.

Gauge

A gauge can be used to illustrate the distance between intervals. This can be presented as a
round clock-like gauge or as a tube type gauge resembling a liquid thermometer. Multiple gauges
can be shown next to each other to illustrate the difference between multiple intervals.

Map

Much of the data dealt with in businesses has a location element, which makes it easy to
illustrate on a map. An example of a map visualization is mapping the number of purchases
customers made in each state in the U.S. In this example, each state would be shaded in and
states with less purchases would be a lighter shade, while states with more purchases would
be darker shades. Location information can also be very valuable for business leadership to
understand, making this an important data visualization to use.
25

Heat map

A heat map is basically a color-coded matrix. A formula is used to color each cell of the matrix
is shaded to represent the relative value or risk of that cell. Usually heat map colors range from
green to red, with green being a better result and red being worse. This type of visualization is
helpful because colors are quicker to interpret than numbers.

Frame diagram

Frame diagrams are basically tree maps which clearly show hierarchical relationship structure.
A frame diagram consists of branches, which each have more branches connecting to them
with each level of the diagram consisting of more and more branches.

2.6 Prepare the data for Machine Learning algorithms

Data preparation may be one of the most difficult steps in any machine learning project.

The reason is that each dataset is different and highly specific to the project. Nevertheless,
there are enough commonalities across predictive modeling projects that we can define a
loose sequence of steps and subtasks that you are likely to perform.

This process provides a context in which we can consider the data preparation required for the
project, informed both by the definition of the project performed before data preparation and the
evaluation of machine learning algorithms performed after.

Common Data Preparation Tasks

We can define data preparation as the transformation of raw data into a form that is more
suitable for modeling. Nevertheless, there are steps in a predictive modeling project before and
after the data preparation step that are important and inform the data preparation that is to be
performed. The process of applied machine learning consists of a sequence of steps.

but all projects have the same general steps; they are:

 Step 1: Define Problem.

 Step 2: Prepare Data.


26

 Step 3: Evaluate Models.

 Step 4: Finalize Model.

We are concerned with the data preparation step (step 2), and there are common or standard
tasks that you may use or explore during the data preparation step in a machine learning project.

These tasks include:

Data Cleaning: Identifying and correcting mistakes or errors in the data.

Feature Selection: Identifying those input variables that are most relevant to the task.

Data Transforms: Changing the scale or distribution of variables.

Feature Engineering: Deriving new variables from available data.

Dimensionality Reduction: Creating compact projections of the data.

Data Cleaning

Data cleaning involves fixing systematic problems or errors in “messy” data. The most useful
data cleaning involves deep domain expertise and could involve identifying and addressing
specific observations that may be incorrect.

There are many reasons data may have incorrect values, such as being mistyped, corrupted,
duplicated, and so on. Domain expertise may allow obviously erroneous observations to be
identified as they are different from what is expected, such as a person’s height of 200 feet.

Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed.
This might involve removing a row or a column. Alternately, it might involve replacing observations
with new values.

Nevertheless, there are general data cleaning operations that can be performed, such as:

 Using statistics to define normal data and identify outliers.

 Identifying columns that have the same value or no variance and removing them.
27

 Identifying duplicate rows of data and removing them.

 Marking empty values as missing.

 Imputing missing values using statistics or a learned model.

Data cleaning is an operation that is typically performed first, prior to other data preparation
operations.

Feature Selection

Feature selection refers to techniques for selecting a subset of input features that are most
relevant to the target variable that is being predicted.

This is important as irrelevant and redundant input variables can distract or mislead learning
algorithms possibly resulting in lower predictive performance. Additionally, it is desirable to
develop models only using the data that is required to make a prediction, e.g. to favor the
simplest possible well performing model.

Feature selection techniques are generally grouped into those that use the target variable
(supervised) and those that do not (unsupervised). Additionally, the supervised techniques
can be further divided into models that automatically select features as part of fitting the model
(intrinsic), those that explicitly choose features that result in the best performing model
(wrapper) and those that score each input feature and allow a subset to be selected (filter).

Statistical methods are popular for scoring input features, such as correlation. The features
can then be ranked by their scores and a subset with the largest scores used as input to a
model. The choice of statistical measure depends on the data types of the input variables and
a review of different statistical measures that can be used.

Data Transforms

Data transforms are used to change the type or distribution of data variables. This is a large
umbrella of different techniques and they may be just as easily applied to input and output
variables. Data may have one of a few types, such as numeric or categorical, with subtypes
for each, such as integer and real-valued for numeric, and nominal, ordinal, and Boolean for
categorical.
28

Numeric Data Type: Number values.

Integer: Integers with no fractional part.

Real: Floating point values.

Categorical Data Type: Label values.

Ordinal: Labels with a rank ordering.

Nominal: Labels with no rank ordering.

Boolean: Values True and False.

We may wish to convert a numeric variable to an ordinal variable in a process called


discretization. Alternatively, we may encode a categorical variable as integers or boolean
variables, required on most classification tasks.

Discretization Transform: Encode a numeric variable as an ordinal variable.

Ordinal Transform: Encode a categorical variable into an integer variable.

One-Hot Transform: Encode a categorical variable into binary variables.

Feature Engineering

Feature engineering refers to the process of creating new input variables from the available
data. Engineering new features is highly specific to your data and data types. As such, it often
requires the collaboration of a subject matter expert to help identify new features that could be
constructed from the data. This specialization makes it a challenging topic to generalize to
general methods.

Nevertheless, there are some techniques that can be reused, such as:

 Adding a boolean flag variable for some state.

 Adding a group or global summary statistic, such as a mean.

 Adding new variables for each component of a compound variable, such as a date-
time. A popular approach drawn from statistics is to create copies of numerical input
29

variables that have been changed with a simple mathematical operation, such as raising
them to a power or multiplied with other input variables, referred to as polynomial features.

 Polynomial Transform: Create copies of numerical input variables that are raised to a
power.

The theme of feature engineering is to add broader context to a single observation or decompose
a complex variable, both in an effort to provide a more straightforward perspective on the input
data.

Dimensionality Reduction

The number of input features for a dataset may be considered the dimensionality of the data.
For example, two input variables together can define a two-dimensional area where each row
of data defines a point in that space. This idea can then be scaled to any number of input
variables to create large multi-dimensional hyper-volumes.

The problem is, the more dimensions this space has (e.g. the more input variables), the more
likely it is that the dataset represents a very sparse and likely unrepresentative sampling of that
space. This is referred to as the curse of dimensionality. This motivates feature selection,
although an alternative to feature selection is to create a projection of the data into a lower-
dimensional space that still preserves the most important properties of the original data.

This is referred to generally as dimensionality reduction and provides an alternative to feature


selection. Unlike feature selection, the variables in the projected data are not directly related to
the original input variables, making the projection difficult to interpret.

The most common approach to dimensionality reduction is to use a matrix factorization


technique:

Principal Component Analysis (PCA)

Singular Value Decomposition (SVD)

2.7 Select a model and train it

Model selection is the process of choosing one among many candidate models for a predictive
modeling problem. There may be many competing concerns when performing model selection
30

beyond model performance, such as complexity, maintainability, and available resources. The
two main classes of model selection techniques are probabilistic measures and resampling
methods.

Model selection is the process of selecting one final machine learning model from among a
collection of candidate machine learning models for a training dataset. Model selection is a
process that can be applied both across different types of models (e.g. logistic regression,
SVM, KNN, etc.) and across models of the same type configured with different model hyper
parameters (e.g. different kernels in an SVM). For example, we may have a dataset for which
we are interested in developing a classification or regression predictive model.

We do not know beforehand as to which model will perform best on this problem, as it is
unknowable. Therefore, we fit and evaluate a suite of different models on the problem. Model
selection is the process of choosing one of the models as the final model that addresses the
problem. Model selection is different from model assessment. For example, we evaluate or
assess candidate models in order to choose the best one, and this is model selection. Whereas
once a model is chosen, it can be evaluated in order to communicate how well it is expected to
perform in general; this is model assessment.

2.8 Considerations for Model Selection

Fitting models is relatively straightforward, although selecting among them is the true challenge
of applied machine learning. Firstly, we need to get over the idea of a “best” model. All models
have some predictive error, given the statistical noise in the data, the incompleteness of the
data sample, and the limitations of each different model type. Therefore, the notion of a perfect
or best model is not useful. Instead, we must seek a model that is “good enough.”

The project stakeholders may have specific requirements, such as maintainability and limited
model complexity. As such, a model that has lower skill but is simpler and easier to understand
may be preferred. Alternately, if model skill is prized above all other concerns, then the ability of
the model to perform well on out-of-sample data will be preferred regardless of the computational
complexity involved.

Therefore, a “good enough” model may refer to many things and is specific to your project,
such as:
31

 A model that meets the requirements and constraints of project stakeholders.

 A model that is sufficiently skillful given the time and resources available.

 A model that is skillful as compared to naive models.

 A model that is skillful relative to other tested models.

 A model that is skillful relative to the state-of-the-art.

For example, we are not selecting a fit model, as all models will be discarded. This is because
once we choose a model, we will fit a new final model on all available data and start using it to
make predictions.

2.9 Model Selection Techniques

The best approach to model selection requires “sufficient” data, which may be nearly infinite
depending on the complexity of the problem. In this ideal situation, we would split the data
into training, validation, and test sets, then fit candidate models on the training set, evaluate and
select them on the validation set, and report the performance of the final model on the test set.

Instead, there are two main classes of techniques to approximate the ideal case of model
selection; they are:

Probabilistic Measures: Choose a model via in-sample error and complexity.

Resampling Methods: Choose a model via estimated out-of-sample error

Probabilistic Measures

Probabilistic measures involve analytically scoring a candidate model using both its performance
on the training dataset and the complexity of the model. It is known that training error is
optimistically biased, and therefore is not a good basis for choosing a model. The performance
can be penalized based on how optimistic the training error is believed to be. This is typically
achieved using algorithm-specific methods, often linear, that penalize the score based on the
complexity of the model.
32

Four commonly used probabilistic model selection measures include:

Akaike Information Criterion (AIC).

Bayesian Information Criterion (BIC).

Minimum Description Length (MDL).

Structural Risk Minimization (SRM).

Resampling Methods

Resampling methods seek to estimate the performance of a model (or more precisely, the
model development process) on out-of-sample data. This is achieved by splitting the training
dataset into sub train and test sets, fitting a model on the sub train set, and evaluating it on the
test set. This process may then be repeated multiple times and the mean performance across
each trial is reported.

It is a type of Monte Carlo estimate of model performance on out-of-sample data, although


each trial is not strictly independent as depending on the resampling method chosen, the same
data may appear multiple times in different training datasets, or test datasets.

Three common resampling model selection methods include:

 Random train/test splits.

 Cross-Validation (k-fold, LOOCV, etc.).

 Bootstrap.

Model training is the phase in the data science development lifecycle where practitioners try to
fit the best combination of weights and bias to a machine learning algorithm to minimize a loss
function over the prediction range. The purpose of model training is to build the best mathematical
representation of the relationship between data features and a target label (in supervised learning)
or among the features themselves (unsupervised learning). Loss functions are a critical aspect
of model training since they define how to optimize the machine learning algorithms. Depending
on the objective, type of data and algorithm, data science practitioner use different type of loss
functions. One of the popular examples of loss functions is Mean Square Error (MSE).
33

Model training is the key step in machine learning that results in a model ready to be validated,
tested, and deployed. The performance of the model determines the quality of the applications
that are built using it. Quality of training data and the training algorithm are both important assets
during the model training phase. Typically, training data is split for training, validation and testing.
The training algorithm is chosen based on the end use case. There are a number of tradeoff
points in deciding the best algorithm–model complexity, interpretability, performance, compute
requirements, etc. All these aspects of model training make it both an involved and important
process in the overall machine learning development cycle.

Determining How Much Training Data You Need

There are a lot of factors in play for deciding how much machine learning training data you
need. First and foremost is how important accuracy is. Say you’re creating a sentiment
analysis algorithm. Your problem is complex. A sentiment algorithm that achieves 85 or
90% accuracy is more than enough for most people’s needs and a false positive or negative
here or there isn’t going to substantively change much of anything.

Of course, more complicated use cases generally require more data than less complex
ones. A computer vision that’s looking to only identify foods versus one that’s trying to
identify objects generally will need less training data as a rule of thumb. Note that there’s
really no such thing as too much high-quality data. Better training data, and more of it, will
improve your models.

2.10 Fine-tune your model

Fine-tuning takes a model that has already been trained for a particular task and then fine-
tuning or tweaking it to make it perform a second similar task. Fine tuning machine learning
predictive model is a crucial step to improve accuracy of the forecasted results. Sometimes,
we have to explore how model parameters can enhance forecasting accuracy of our machine
learning model.

Tuning is usually a trial-and-error process by which you change some hyper parameters (for
example, the number of trees in a tree-based algorithm or the value of alpha in a linear algorithm),
run the algorithm on the data again, then compare its performance on your validation set in
order to determine which set of hyper parameters results in the most accurate model.
34

All machine learning algorithms have a “default” set of hyper parameters, which means “a
configuration that is external to the model and whose value cannot be estimated from data.”
Diff erent algorithms consist of dif ferent hyper parameters. For example,
regularized regression models have coefficients penalties, decision trees have a set number
of branches, and neural networks have a set number of layers. When building models, analysts
and data scientists choose the default configuration of these hyper parameters after running
the model on several datasets.

While the generic set of hyper parameters for each algorithm provides a starting point for analysis
and will generally result in a well-performing model, it may not have the optimal configurations
for your particular dataset and business problem. In order to find the best hyper parameters for
your data, you need to tune them. Model tuning allows you to customize your models so they
generate the most accurate outcomes and give you highly valuable insights into your data,
enabling you to make the most effective business decisions.

Bayesian Optimization

Bayesian Optimization has emerged as an efficient tool for hyperparameter tuning of machine
learning algorithms, more specifically, for complex models like deep neural networks. It offers
an efficient framework for optimising the highly expensive black-box functions without knowing
its form. It has been applied in several fields including learning optimal robot mechanics,
sequential experimental design, and synthetic gene design.

Evolutionary Algorithms

Evolutionary algorithms (EA) are optimization algorithms that work by modifying a set of candidate
solutions (population) according to certain rules called operators. One of the main advantages
of the EA is their generality: Meaning EA can be used in a broad range of conditions due to their
simplicity and independence from the underlying problem. In hyperparameter tuning problems,
evolutionary algorithms have proved to perform better than grid search techniques based on an
accuracy-speed ratio.

Gradient-Based Optimization

Gradient-based optimization is a methodology to optimise several hyper parameters, based on


the computation of the gradient of a machine learning model selection criterion with respect to
35

the hyper parameters. This hyperparameter tuning methodology can be applied when some
differentiability and continuity conditions of the training criterion are satisfied.

Grid Search

Grid search is a basic method for hyperparameter tuning. It performs an exhaustive search on
the hyperparameter set specified by users. This approach is the most straightforward leading
to the most accurate predictions. Using this tuning method, users can find the optimal
combination. Grid search is applicable for several hyper-parameters, however, with limited
search space.

Keras’ Tuner

Keras tuning is a library that allows users to find optimal hyper parameters for machine learning
or deep learning models. The library helps to find kernel sizes, learning rate for optimization,
and different hyper-parameters. Keras tuner can be used for getting the best parameters for
various deep learning models for the highest accuracy.

Population-based Optimization

Population-based methods are essentially a series of random search methods based on genetic
algorithms, such as evolutionary algorithms, particle swarm optimization, among others. One
of the most widely used population-based methods is population-based training (PBT), proposed
by DeepMind. PBT is a unique method in two aspects:

 It allows for adaptive hyper-parameters during training

 It combines parallel search and sequential optimization

ParamILS

ParamILS (Iterated Local Search in Parameter Configuration Space) is a versatile stochastic


local search approach for automated algorithm configuration. ParamILS is an automated
algorithm configuration method that helps develop high-performance algorithms and their
applications.

ParamILS uses default and random settings for initialization and employs iterative first
improvement as a subsidiary local search procedure. It also uses a fixed number of random
36

moves for perturbation and always accepts better or equally-good parameter configurations,
but re-initializes the search at random with probability.

Random Search

Random search can be said as a basic improvement on grid search. The method refers to a
randomised search over hyper-parameters from certain distributions over possible parameter
values. The searching process continues till the desired accuracy is reached. Random search
is similar to grid search but has proven to create better results than the latter. The approach is
often applied as the baseline of HPO to measure the efficiency of newly designed algorithms.
Though random search is more effective than grid search, it is still a computationally intensive
method.

2.11 Summary

Understanding the machine learning project is crucial. The cycle of the entire process is
discussed in a step-by-step manner. Visualizing information is a critical part of data analysis.
There are numerous tools available to help create data visualizations. Model selection is the
process of choosing one among many candidate models for a predictive modeling problem.
There are two main classes of techniques to approximate the ideal case of model selection are
probabilistic and resampling. Fine tuning machine learning predictive model is a crucial step to
improve accuracy of the forecasted results.

2.12 Keywords

Machine Learning Project, Visualization, Feature Selection, Feature Engineering, Optimization

2.13 Model Questions

1. Discuss about the development of Machine Learning Project.

2. Write notes on visualization techniques.

3. How a machine leaning model can be fine-tuned?

4. Write in detail model selection techniques.

5. Elaborate on data preparation tasks.


37

LESSON - 3

THE INGREDIENTS OF MACHINE LEARNING


Structure

3.1 Introduction

3.2 Learning Objectives

3.3 Tasks

3.4 Models

3.5 Features

3.6 Summary

3.7 Keywords

3.8 Model Questions

3.1 Introduction

Machine learning is all about using the right features to build the right models that achieve the
right tasks. In essence, features define a ‘language’ in which we describe the relevant objects in
our domain, be they e-mails or complex organic molecules. We should not normally have to go
back to the domain objects themselves once we have a suitable feature representation, which
is why features play such an important role in machine learning.

A task is an abstract representation of a problem we want to solve regarding those domain


objects: the most common form of these is classifying them into two or more classes. Many of
these tasks can be represented as a mapping from data points to outputs. This mapping or
model is itself produced as the output of a machine learning algorithm applied to training data;
there is a wide variety of models to choose from. In a simple way, the terms are defined as
below,
38

 Tasks: the problems that can be solved with machine learning

 Models: the output of machine learning

 Features: the workhorses of machine learning

3.2 Learning Objectives

 To understand task, model and features.

 To get a deep insight on Logical models, Geometric models, Probabilistic models, and
grouping and grading

 To understand feature selection, creation, and extraction

 To explore Feature engineering

3.3 Tasks: The problems that can be solved with machine learning

The various kind of problems are addressed by the machine learning techniques. In machine
learning, multiclass or multinomial classification is the problem of classifying instances into
one of three or more classes. (Classifying instances into one of two classes is called binary
classification.) While some classification algorithms naturally permit the use of more than two
classes, others are by nature binary algorithms; these can, however, be turned into multinomial
classifiers by a variety of strategies. Multiclass classification should not be confused with multi-
label classification, where multiple labels are to be predicted for each instance.

Regression analysis is a set of statistical processes for estimating the relationships between a
dependent variable (often called the ‘outcome variable’) and one or more independent variables
(often called ‘predictors’, ‘covariates’, or ‘features’). The most common form of regression
analysis is linear regression, in which a researcher finds the line (or a more complex linear
function) that most closely fits the data according to a specific mathematical criterion.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense) to each other than to
those in other groups (clusters). It is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including machine learning, pattern
39

recognition, image analysis, information retrieval, bioinformatics, data compression, and


computer graphics. Association rule learning is a rule-based machine learning method for
discovering interesting relations between variables in large databases. It is intended to identify
strong rules discovered in databases using some measures of interestingness.

Applications of Machine learning are many, including external (client-centric) applications such
as product recommendation, customer service, and demand forecasts, and internally to help
businesses improve products or speed up manual and time-consuming processes. Machine
learning algorithms are typically used in areas where the solution requires continuous
improvement post-deployment. Adaptable machine learning solutions are incredibly dynamic
and are adopted by companies across verticals. Some of them are described below,

Identifying Spam

Spam e-mail recognition was one example of the task or problem in machine learning. It
constitutes a binary classification task, which is easily the most common task in machine
learning. Spam identification is one of the most basic applications of machine learning. Most of
our email inboxes also have an unsolicited, bulk, or spam inbox, where our email provider
automatically filters unwanted spam emails.

One obvious variation is to consider classification problems with more than two classes. For
instance, we may want to distinguish different kinds of ham e-mails, e.g., work-related e-mails
and private messages. We could approach this as a combination of two binary classification
tasks: the first task is to distinguish between spam and ham, and the second task is, among
ham e-mails, to distinguish between work- related and private ones

Making Product Recommendations

Recommender systems are one of the most characteristic and ubiquitous machine learning
use cases in day-to-day life. These systems are used everywhere by search engines, e-
commerce websites (Amazon), entertainment platforms (Google Play, Netflix), and multiple
web & mobile apps.

Prominent online retailers like Amazon and eBay often show a list of recommended products
individually for each of their consumers. These recommendations are typically based on
40

behavioral data and parameters such as previous purchases, item views, page views, clicks,
form fill-ins, purchases, item details (price, category), and contextual data (location, language,
device), and browsing history.

These recommender systems allow businesses to drive more traffic, increase customer
engagement, reduce churn rate, deliver relevant content and boost profits. All such recommended
products are based on a machine learning model’s analysis of customer’s behavioral data. It is
an excellent way for online retailers to offer extra value and enjoy various upselling opportunities
using machine learning.

Customer Segmentation

Customer segmentation, churn prediction and customer lifetime value (LTV) prediction are the
main challenges faced by any marketer. Businesses have a huge amount of marketing relevant
data from various sources such as email campaigns, website visitors and lead data.

Using data mining and machine learning, an accurate prediction for individual marketing offers
and incentives can be achieved. Using ML, savvy marketers can eliminate guesswork involved
in data-driven marketing. For example, given the pattern of behavior by a user during a trial
period and the past behaviors of all users, identifying chances of conversion to paid version can
be predicted. A model of this decision problem would allow a program to trigger customer
interventions to persuade the customer to convert early or better engage in the trial.

Image & Video Recognition

Advances in deep learning (a subset of machine learning) have stimulated rapid progress in
image & video recognition techniques over the past few years. They are used for multiple
areas, including object detection, face recognition, text detection, visual search, logo and
landmark detection, and image composition.

Since machines are good at processing images, Machine Learning algorithms can train Deep
Learning frameworks to recognize and classify images in the dataset with much more accuracy
than humans. Similar to image recognition, companies such as Shutterstock, eBay, Salesforce,
Amazon, and Facebook use Machine Learning for video recognition where videos are broken
down frame by frame and classified as individual digital images.
41

Fraudulent Transactions

Fraudulent banking transactions are quite a common occurrence today. However, it is not feasible
(in terms of cost involved and efficiency) to investigate every transaction for fraud, translating to
a poor customer service experience. Machine Learning in finance can automatically build super-
accurate predictive maintenance models to identify and prioritize all kinds of possible fraudulent
activities. Businesses can then create a data-based queue and investigate the high priority
incidents.

It allows to deploy resources in an area where we will see the greatest return on the investigative
investment. Further, it also helps to optimize customer satisfaction by protecting their accounts
and not challenging valid transactions. Such fraud detection using machine learning can help
banks and financial organizations save money on disputes/chargebacks as one can train Machine
Learning models to flag transactions that appear fraudulent based on specific characteristics.

Demand Forecasting

The concept of demand forecasting is used in multiple industries, from retail and e-commerce
to manufacturing and transportation. It feeds historical data to Machine Learning algorithms
and models to predict the number of products, services, power, and more. It allows businesses
to efficiently collect and process data from the entire supply chain, reducing overheads and
increasing efficiency. ML-powered demand forecasting is very accurate, rapid, and transparent.
Businesses can generate meaningful insights from a constant stream of supply/demand data
and adapt to changes accordingly.

Virtual Personal Assistant

From Alexa and Google Assistant to Cortana and Siri, we have multiple virtual personal assistants
to find accurate information using our voice instruction, such as calling someone, opening an
email, scheduling an appointment, and more. These virtual assistants use Machine Learning
algorithms for recording our voice instructions, sending them over the server to a cloud, followed
by decoding them using Machine Learning algorithms and acting accordingly.
42

Sentiment Analysis

Sentiment analysis is one of the beneficial and real-time machine learning applications that
help determine the emotion or opinion of the speaker or the writer. For instance, if you’ve written
a review, email, or any other form of a document, a sentiment analyzer will be able to assess
the actual thought and tone of the text. This sentiment analysis application can be used to
analyze decision-making applications, review-based websites, and more.

Customer Service Automation

Managing an increasing number of online customer interactions has become a pain point for
most businesses. It is because they simply don’t have the customer support staff available to
deal with the sheer number of inquiries, they receive daily. Machine learning algorithms have
made it possible and super easy for chatbots and other similar automated systems to fill this
gap. This application of machine learning enables companies to automate routine and low
priority tasks, freeing up their employees to manage more high-level customer service tasks.

Further, Machine Learning technology can access the data, interpret behaviors and recognize
the patterns easily. This could also be used for customer support systems that can work identical
to a real human being and solve all of the customers’ unique queries. The Machine Learning
models behind these voice assistants are trained on human languages and variations in the
human voice because it has to efficiently translate the voice to words and then make an on-
topic and intelligent response. If implemented the right way, problems solved by machine learning
can streamline the entire process of customer issue resolution and offer much-needed
assistance along with enhanced customer satisfaction.

3.4 Models

Models form the central concept in machine learning as they are what is being learned from the
data, in order to solve a given task. There is a considerable – not to say bewildering – range of
machine learning models to choose from. The basic idea for creating a taxonomy of algorithms
is that we divide the instance space by using one of three ways:
43

 Logical models

 Geometric models

 Probabilistic models

 Grouping and grading

Logical models

Logic models are hypothesized descriptions of the chain of causes and effects leading to an
outcome of interest (e.g. prevalence of cardiovascular diseases, annual traffic collision, etc).
While they can be in a narrative form, logic model usually take form in a graphical depiction of
the “if-then” (causal) relationships between the various elements leading to the outcome. However,
the logic model is more than the graphical depiction: it is also the theories, scientific evidences,
assumptions and beliefs that support it and the various processes behind it.

Logical models use a logical expression to divide the instance space into segments and hence
construct grouping models. A logical expression is an expression that returns a Boolean value,
i.e., a True or False outcome. Once the data is grouped using a logical expression, the data is
divided into homogeneous groupings for the problem we are trying to solve. For example, for a
classiûcation problem, all the instances in the group belong to one class.

There are mainly two kinds of logical models: Tree models and Rule models.

Rule models consist of a collection of implications or IF-THEN rules. For tree-based models,
the ‘if-part’ deûnes a segment and the ‘then-part’ deûnes the behaviour of the model for this
segment. Rule models follow the same reasoning.

Tree models can be seen as a particular type of rule model where the if-parts of the rules are
organised in a tree structure. Both Tree models and Rule models use the same approach to
supervised learning. The approach can be summarised in two strategies: we could first find
the body of the rule (the concept) that covers a sufficiently homogeneous set of examples and
then find a label to represent the body. Alternately, we could approach it from the other direction,
i.e., first select a class we want to learn and then find rules that cover examples of the class.
44

A simple tree-based model is shown in Figure 3.1. The tree shows survival numbers of
passengers on the Titanic (“sibsp” is the number of spouses or siblings aboard). The values
under the leaves show the probability of survival and the percentage of observations in the leaf.
The model can be summarised as: The chances of survival were good if you were (i) a female
or (ii) a male younger than 9.5 years with less than 2.5 siblings.

Figure 3.1 Tree Model

To understand logical models further, we need to understand the idea of Concept Learning.
Concept Learning involves learning logical expressions or concepts from examples. The idea
of Concept Learning fits in well with the idea of Machine learning, i.e., inferring a general function
from specific training examples. Concept learning forms the basis of both tree-based and rule-
based models. More formally, Concept Learning involves acquiring the definition of a general
category from a given set of positive and negative training examples of the category. A Formal
Definition for Concept Learning is “The inferring of a Boolean-valued function from training
examples of its input and output.” In concept learning, we only learn a description for the positive
class and label everything that doesn’t satisfy that description as negative.
45

The following example explains (Figure 3.2) this idea in more detail.

Figure 3.2 Logical Model – Example

A Concept Learning Task called “Enjoy Sport” as shown above is defined by a set of data from
some example days. Each data is described by six attributes. The task is to learn to predict the
value of Enjoy Sport for an arbitrary day based on the values of its attribute values. The problem
can be represented by a series of hypotheses. Each hypothesis is described by a conjunction
of constraints on the attributes. The training data represents a set of positive and negative
examples of the target function. In the example above, each hypothesis is a vector of six
constraints, specifying the values of the six attributes – Sky, AirTemp, Humidity, Wind, Water,
and Forecast. The training phase involves learning the set of days (as a conjunction of attributes)
for which Enjoy Sport = yes.

Thus, the problem can be formulated as:

Given instances X which represent a set of all possible days, each described by the
attributes:

o Sky – (values: Sunny, Cloudy, Rainy),

o AirTemp – (values: Warm, Cold),


46

o Humidity – (values: Normal, High),

o Wind – (values: Strong, Weak),

o Water – (values: Warm, Cold),

o Forecast – (values: Same, Change).

We can also formulate Concept Learning as a search problem. We can think of Concept
learning as searching through a set of predefined space of potential hypotheses to identify a
hypothesis that best fits the training examples. Concept learning is also an example of Inductive
Learning. Inductive learning, also known as discovery learning, is a process where the learner
discovers rules by observing examples. Inductive learning is different from deductive learning,
where students are given rules that they then need to apply. Inductive learning is based on
the inductive learning hypothesis. The Inductive Learning Hypothesis postulates that: Any
hypothesis found to approximate the target function well over a sufficiently large set of training
examples is expected to approximate the target function well over other unobserved examples.
This idea is the fundamental assumption of inductive learning.

Geometric models

In the previous section, we have seen that with logical models, such as decision trees, a logical
expression is used to partition the instance space. Two instances are similar when they end up
in the same logical segment. In this section, we consider models that define similarity by
considering the geometry of the instance space. In Geometric models, features could be
described as points in two dimensions (x- and y-axis) or a three-dimensional space (x, y, and z).
Even when features are not intrinsically geometric, they could be modelled in a geometric
manner (for example, temperature as a function of time can be modelled in two axes). In
geometric models, there are two ways we could impose similarity.

 We could use geometric concepts like lines or planes to segment (classify) the
instance space. These are called Linear models.

 Alternatively, we can use the geometric notion of distance to represent similarity. In


this case, if two points are close together, they have similar values for features and
thus can be classed as similar. We call such models as Distance-based models.
47

Linear models

Linear models are relatively simple. In this case, the function is represented as a linear
combination of its inputs. Thus, if x1 and x2 are two scalars or vectors of the same dimension
and a and b are arbitrary scalars, then ax1 + bx2 represents a linear combination of x1 and x2.
In the simplest case where f(x) represents a straight line, we have an equation of the form f (x)
= mx + c where c represents the intercept and m represents the slope.

Linear models are parametric, which means that they have a ûxed form with a small number
of numeric parameters that need to be learned from data. For example, in f (x)
= mx + c, m and c are the parameters that we are trying to learn from the data. This technique
is different from tree or rule models, where the structure of the model (e.g., which features to
use in the tree, and where) is not fixed in advance.

Linear models are stable, i.e., small variations in the training data have only a limited impact on
the learned model. In contrast, tree models tend to vary more with the training data, as the
choice of a different split at the root of the tree typically means that the rest of the tree is
different as well. As a result of having relatively few parameters, Linear models have low
variance and high bias. This implies that Linear models are less likely to overfit the training
data than some other models. However, they are more likely to underfit. For example, if we
want to learn the boundaries between countries based on labelled data, then linear models are
not likely to give a good approximation.

Distance-based models

Distance-based models are the second class of Geometric models. Like Linear models, distance-
based models are based on the geometry of data. As the name implies, distance-based models
work on the concept of distance. In the context of Machine learning, the concept of distance is
not based on merely the physical distance between two points. Instead, we could think of the
distance between two points considering the mode of transport between two points. Travelling
between two cities by plane covers less distance physically than by train because a plane is
unrestricted. Similarly, in chess, the concept of distance depends on the piece used – for
example, a Bishop can move diagonally. Thus, depending on the entity and the mode of travel,
the concept of distance can be experienced differently. The distance metrics commonly used
are Euclidean, Minkowski, Manhattan, and Mahalanobis.
48

Distance is applied through the concept of neighbours and exemplars. Neighbours are points in
proximity with respect to the distance measure expressed through exemplars. Exemplars are
either centroids that ûnd a centre of mass according to a chosen distance metric or medoids
that ûnd the most centrally located data point. The most commonly used centroid is the arithmetic
mean, which minimizes squared Euclidean distance to all other points.

The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean position
of all the points in the figure from the centroid point. This definition extends to any object in n-
dimensional space: its centroid is the mean position of all the points. Medoids are similar in
concept to means or centroids. Medoids are most commonly used on data when a mean or
centroid cannot be defined. They are used in contexts where the centroid is not representative
of the dataset, such as in image data. Examples of distance-based models include the nearest-
neighbour models, which use the training data as exemplars – for example, in classification.
The K-means clustering algorithm also uses exemplars to create clusters of similar data points.

Probabilistic models

The third family of machine learning algorithms is the probabilistic models. probabilistic classifier
is a classifier that is able to predict, given an observation of an input, a probability distribution
over a set of classes, rather than only outputting the most likely class that the observation
should belong to. Probabilistic classifiers provide classification that can be useful in its own
right or when combining classifiers into ensembles.

We have seen before that the k-nearest neighbour algorithm uses the idea of distance (e.g.,
Euclidian distance) to classify entities, and logical models use a logical expression to partition
the instance space. In this section, we see how the probabilistic models use the idea of
probability to classify new entities.

Probabilistic models see features and target variables as random variables. The process of
modelling represents and manipulates the level of uncertainty with respect to these variables.
There are two types of probabilistic models: Predictive and Generative. Predictive probability
models use the idea of a conditional probability distribution P (Y |X) from which Y can be
predicted from X. Generative models estimate the joint distribution P (Y, X). Once we know
the joint distribution for the generative models, we can derive any conditional or marginal
distribution involving the same variables. Thus, the generative model is capable of creating new
49

data points and their labels, knowing the joint probability distribution. The joint distribution looks
for a relationship between two variables. Once this relationship is inferred, it is possible to infer
new data points.

Naïve Bayes is an example of a probabilistic classifier.

The goal of any probabilistic classifier is given a set of features (x_0 through x_n) and a set of
classes (c_0 through c_k), we aim to determine the probability of the features occurring in each
class, and to return the most likely class. Therefore, for each class, we need to calculate P(c_i
| x_0, …, x_n). We can do this using the Bayes rule defined as

The Naïve Bayes algorithm is based on the idea of Conditional Probability. Conditional probability
is based on finding the probability that something will happen, given that something else has
already happened. The task of the algorithm then is to look at the evidence and to determine the
likelihood of a specific class and assign a label accordingly to each entity.

Grouping and grading

Grouping models do this by breaking up the instance space into groups or segments, the
number of which is determined at training time. One could say that grouping models have a
fixed and finite ‘resolution’ and cannot distinguish between individual instances beyond this
resolution.

3.5 Features

In machine learning, features are individual independent variables that act like a input in your
system. Actually, while making the predictions, models use such features to make the predictions.
And using the feature engineering process, new features can also be obtained from old features
in machine learning. To understand in more simple way, let’s take an example, where you can
consider one column of your data set to be one feature which is also know as “variables or
attributes” and the more number of features are known as dimensions. And depending on what
you are trying to analyze the features you include in your dataset can vary widely.

Feature engineering is the process of using the domain knowledge of the data to create features
that makes machine learning algorithms work properly. If feature engineering is performed
properly, it helps to improve the power of prediction of machine learning algorithms by creating
the features using the raw data that facilitate the machine learning process.
50

Features in machine learning is very important, being building a block of datasets, the quality of
the features in your dataset has major impact on the quality of the insights you will get while
using the dataset for machine learning. However, depending on the different business problems
in different industries it is not necessary the features should be same features, so here we
need to strongly understand the business goal of your data science project.

Where on the other hand, using the “feature selection” and “feature engineering” process you
can improve the quality of your dataset’s features, which a very tedious and difficult process. It
these techniques are working well, you will get optimal dataset with all of the important features,
that bearing on your specific business problem leads to the best possible model development
and the most beneficial visual perception.

Tops Methods of Feature Selection in ML:

 Universal Selection

 Feature Importance

 Correlation Matrix with Heatmap

Feature engineering consists of various process -

Feature Creation: Creating features involves creating new variables which will be most helpful
for our model. This can be adding or removing some features.

Transformations: Feature transformation is simply a function that transforms features from


one representation to another. The goal here is to plot and visualise data, if something is not
adding up with the new features, we can reduce the number of features used, speed up training,
or increase the accuracy of a certain model.

Feature Extraction: Feature extraction is the process of extracting features from a data set to
identify useful information. Without distorting the original relationships or significant information,
this compresses the amount of data into manageable quantities for algorithms to process.

Exploratory Data Analysis: Exploratory data analysis (EDA) is a powerful and simple tool that
can be used to improve your understanding of your data, by exploring its properties. The technique
51

is often applied when the goal is to create new hypotheses or find patterns in the data. It’s often
used on large amounts of qualitative or quantitative data that haven’t been analyzed before.

Benchmark: A Benchmark Model is the most user-friendly, dependable, transparent, and


interpretable model against which you can measure your own. It’s a good idea to run test
datasets to see if your new machine learning model outperforms a recognised benchmark.
These benchmarks are often used as measures for comparing the performance between
different machine learning models like neural networks and support vector machines, linear
and non-linear classifiers, or different approaches like bagging and boosting.

Some of the techniques listed may work better with certain algorithms or datasets, while others
may be useful in all situations.

Imputation

When it comes to preparing data for machine learning, missing values are one of the most
typical issues. Human errors, data flow interruptions, privacy concerns, and other factors could
all contribute to missing values. Missing values have an impact on the performance of machine
learning models for whatever cause. The main goal of imputation is to handle these missing
values. There are two types of imputation:

Numerical Imputation: To figure out what numbers should be assigned to people currently in
the population, we usually use data from completed surveys or censuses. These data sets can
include information about how many people eat different types of food, whether they live in a city
or country with a cold climate, and how much they earn every year. That is why numerical
imputation is used to fill gaps in surveys or censuses when certain pieces of information are
missing.

Categorical Imputation: When dealing with categorical columns, replacing missing values
with the highest value in the column is a smart solution. However, if you believe the values in the
column are evenly distributed and there is no dominating value, imputing a category like “Other”
would be a better choice, as your imputation is more likely to converge to a random selection in
this scenario.
52

Handling Outliers

Outlier handling is a technique for removing outliers from a dataset. This method can be used
on a variety of scales to produce a more accurate data representation. This has an impact on
the model’s performance. Depending on the model, the effect could be large or minimal; for
example, linear regression is particularly susceptible to outliers. This procedure should be
completed prior to model training. The various methods of handling outliers include:

Removal: Outlier-containing entries are deleted from the distribution. However, if there are
outliers across numerous variables, this strategy may result in a big chunk of the datasheet
being missed.

Replacing values: Alternatively, the outliers could be handled as missing values and replaced
with suitable imputation.

Capping: Using an arbitrary value or a value from a variable distribution to replace the maximum
and minimum values.

Discretization: Discretization is the process of converting continuous variables, models, and


functions into discrete ones. This is accomplished by constructing a series of continuous intervals
(or bins) that span the range of our desired variable/model/function.

Log Transform

Log Transform is the most used technique among data scientists. It’s mostly used to turn a
skewed distribution into a normal or less-skewed distribution. We take the log of the values in a
column and utilize those values as the column in this transform. It is used to handle confusing
data, and the data becomes more approximative to normal applications.

One-hot encoding

A one-hot encoding is a type of encoding in which an element of a finite set is represented by


the index in that set, where only one element has its index set to “1” and all other elements are
assigned indices within the range [0, n-1]. In contrast to binary encoding schemes, where each
bit can represent 2 values (i.e. 0 and 1), this scheme assigns a unique value for each possible
case.
53

Scaling

Feature scaling is one of the most pervasive and difficult problems in machine learning, yet it’s
one of the most important things to get right. In order to train a predictive model, we need data
with a known set of features that needs to be scaled up or down as appropriate. This blog post
will explain how feature scaling works and why it’s important as well as some tips for getting
started with feature scaling.

After a scaling operation, the continuous features become similar in terms of range. Although
this step isn’t required for many algorithms, it’s still a good idea to do so. Distance-based
algorithms like k-NN and k-Means, on the other hand, require scaled continuous features as
model input. There are two common ways for scaling:

Normalization: All values are scaled in a specified range between 0 and 1 via normalization (or
min-max normalization). This modification has no influence on the feature’s distribution; however,
it does exacerbate the effects of outliers due to lower standard deviations. As a result, it is
advised that outliers be dealt with prior to normalization.

Standardization: Standardization (also known as z-score normalization) is the process of


scaling values while accounting for standard deviation. If the standard deviation of features
differs, the range of those features will likewise differ. The effect of outliers in the characteristics
is reduced as a result. To arrive at a distribution with a 0 mean and 1 variance, all the data
points are subtracted by their mean and the result divided by the distribution’s variance.

Feature tools is a framework to perform automated feature engineering. It excels at transforming


temporal and relational datasets into feature matrices for machine learning. Feature tools
integrates with the machine learning pipeline-building tools you already have. The features are,

 Easy to get started, good documentation and community support

 It helps you construct meaningful features for machine learning and predictive
modelling by combining your raw data with what you know about your data.

 It provides APIs to verify that only legitimate data is utilised for calculations, preventing
label leakage in your feature vectors.
54

 Feature tools includes a low-level function library that may be layered to generate
features.

 Its AutoML library (EvalML) helps you build, optimize, and evaluate machine learning
pipelines.

 Good at handling relational databases.

Some of the tools are discussed below,

AutoFeat

AutoFeat helps to perform Linear Prediction Models with Automated Feature Engineering and
Selection. AutoFeat allows you to select the units of the input variables in order to avoid the
construction of physically nonsensical features.

TsFresh

TsFresh is a python package. It calculates a huge number of time series characteristics, or


features, automatically. In addition, the package includes methods for assessing the explanatory
power and significance of such traits in regression and classification tasks.

OneBM

OneBM interacts directly with a database’s raw tables. It slowly joins the tables, taking different
paths on the relational tree. It recognises simple data types (numerical or categorical) and
complicated data types (set of numbers, set of categories, sequences, time series, and texts)
in the joint results and applies pre-defined feature engineering approaches to the supplied types.

ExploreKit

Based on the idea that extremely informative features are typically the consequence of
manipulating basic ones, ExploreKit identifies common operators to alter each feature
independently or combine multiple of them. Instead of running feature selection on all developed
features, which can be quite huge, meta learning is used to rank candidate features.
55

3.6 Summary

Machine learning is all about using the right features to build the right models that achieve the
right tasks. Features in machine learning is very important, being building a block of datasets.
Tops Methods of Feature Selection in ML are Universal Selection, Correlation Matrix with
Heatmap, and Feature Importance. Models form the central concept in machine learning as
they are what is being learned from the data, in order to solve a given task.

3.7 Keywords

Tasks, Models, Features, Feature Engineering, Exploratory Data Analysis (EDA)

3.8 Model Questions

1. Discuss tasks in machine learning and its various applications.

2. Elaborate on various models.

3. Write in detail feature selection methods.

4. Write notes on the tools used for feature selection.

5. Differentiate distance based and probabilistic models.


56

LESSON – 4

BINARY CLASSIFICATION AND RELATED TASKS


Structure

4.1 Introduction

4.2 Learning Objectives

4.2 Classification

4.3 Binary Classification

4.4 Multi-Class Classification

4.5 Multi-Label Classification

4.6 Imbalanced Classification

4.7 Assessing classification performance

4.8 Scoring and ranking

4.9 Summary

4.10 Keywords

4.11 Model Questions

4.1 Introduction

In this lesson and the next we take a bird’s-eye view of the wide range of different tasks that can
be solved with machine learning techniques. ‘Task’ here refers to whatever it is that machine
learning is intended to improve performance. Since this is a classification task, we need to
learn an appropriate classifier from training data. Many different types of classifiers exist: linear
classifiers, Bayesian classifiers, distance-based classifiers, to name a few. For each of these
tasks we will discuss what it is, what variants exist, how performance at the task could be
assessed, and how it relates to other tasks.
57

The objects of interest in machine learning are usually referred to as instances. The set of all
possible instances is called the instance space. To illustrate, X could be the set of all possible
e-mails. We furthermore distinguish between the label space L and the output space Y. The
label space is used in supervised learning to label the examples. In order to achieve the task
under consideration we need a model: a mapping from the instance space to the output space.
For instance, in classification the output space is a set of classes, while in regression it is the
set of real numbers. In order to learn such a model we require a training set Tr of labelled
instances (x, l (x)), also called examples, where l :X ’!L is a labelling function.

The most commonly encountered machine learning scenario is where the label space coincides
with the output space. That is, Y = L and we are trying to learn an approximation l^ : X ’!L to the
true labelling function l , which is only known through the labels it assigned to the training data.
This scenario covers both classification and regression. In cases where the label space and
the output space differ, this usually serves the purpose of learning a model that outputs more
than just a label – for instance, a score for each possible label. In this case we have Y = Rk ,
with k = |L| the number of labels.

4.2 Learning Objectives

 To understand different classifiers

 To learn Binary Classification, Multi-Class Classification, and Multi-Label Classification

 To find the imbalance in classification problem and apply techniques

 To Assess classification performance

 To find out Scoring and ranking

4.3 Classification

Classification is the most common task in machine learning. A classifier is a mapping c^:X ’!C
, where C = {C1, C2, . . . ,Ck } is a finite and usually small set of class labels. We will sometimes
also use Ci to indicate the set of examples of that class. We use the ‘hat’ to indicate that c^(x)
is an estimate of the true but unknown function c(x). Examples for a classifier take the form (x,
c(x)), where x “ X is an instance and c(x) is the true class of the instance. Learning a classifier
58

involves constructing the function c^ such that it matches c as closely as possible (and not just
on the training set, but ideally on the entire instance space X).

In the simplest case we have only two classes which are usually referred to as positive and
negative, •”and _, or +1 and “1. Two-class classification is often called binary classification (or
concept learning, if the positive class can be meaningfully called a concept). Spam e-mail
filtering is a good example of binary classification, in which spam is conventionally taken as the
positive class, and ham as the negative class (clearly, positive here doesn’t mean ‘good’!).
Other examples of binary classification include medical diagnosis (the positive class here is
having a particular disease) and credit card fraud detection.

4.4 Binary classification

Binary classification refers to those classification tasks that have two class labels.

Examples include:

Email spam detection (spam or not).

Churn prediction (churn or not).

Conversion prediction (buy or not).

Typically, binary classification tasks involve one class that is the normal state and another
class that is the abnormal state. For example “not spam” is the normal state and “spam” is the
abnormal state. Another example is “cancer not detected” is the normal state of a task that
involves a medical test and “cancer detected” is the abnormal state. The class for the normal
state is assigned the class label 0 and the class with the abnormal state is assigned the class
label 1.

Popular algorithms that can be used for binary classification include:

 Logistic Regression

 k-Nearest Neighbors

 Decision Trees
59

 Support Vector Machine

 Naive Bayes

4.4 Multi-Class Classification

Multi-class classification refers to those classification tasks that have more than two class
labels.

Examples include:

 Face classification.

 Plant species classification.

 Optical character recognition.

Unlike binary classification, multi-class classification does not have the notion of normal and
abnormal outcomes. Instead, examples are classified as belonging to one among a range of
known classes. The number of class labels may be very large on some problems. For example,
a model may predict a photo as belonging to one among thousands or tens of thousands of
faces in a face recognition system.

Problems that involve predicting a sequence of words, such as text translation models, may
also be considered a special type of multi-class classification. Each word in the sequence of
words to be predicted involves a multi-class classification where the size of the vocabulary
defines the number of possible classes that may be predicted and could be tens or hundreds of
thousands of words in size.

It is common to model a multi-class classification task with a model that predicts a Multinoulli
probability distribution for each example. The Multinoulli distribution is a discrete probability
distribution that covers a case where an event will have a categorical outcome, e.g. K in {1, 2,
3, …, K}. For classification, this means that the model predicts the probability of an example
belonging to each class label. Many algorithms used for binary classification can be used for
multi-class classification.
60

Popular algorithms that can be used for multi-class classification include:

 k-Nearest Neighbors.

 Decision Trees.

 Naive Bayes.

 Random Forest.

 Gradient Boosting.

Algorithms that are designed for binary classification can be adapted for use for multi-class
problems. This involves using a strategy of fitting multiple binary classification models for each
class vs. all other classes (called one-vs-rest) or one model for each pair of classes (called
one-vs-one).

 One-vs-Rest: Fit one binary classification model for each class vs. all other classes.

 One-vs-One: Fit one binary classification model for each pair of classes.

Binary classification algorithms that can use these strategies for multi-class classification
include:

 Logistic Regression.

 Support Vector Machine.

4.5 Multi-Label Classification

Multi-label classification refers to those classification tasks that have two or more class labels,
where one or more class labels may be predicted for each example. Consider the example
of photo classification, where a given photo may have multiple objects in the scene and a
model may predict the presence of multiple known objects in the photo, such as “bicycle,”
“apple,” “person,” etc.

This is unlike binary classification and multi-class classification, where a single class label is
predicted for each example. It is common to model multi-label classification tasks with a model
61

that predicts multiple outputs, with each output taking predicted as a Bernoulli probability
distribution. This is essentially a model that makes multiple binary classification predictions for
each example.

Classification algorithms used for binary or multi-class classification cannot be used directly
for multi-label classification. Specialized versions of standard classification algorithms can be
used, so-called multi-label versions of the algorithms, including:

 Multi-label Decision Trees

 Multi-label Random Forests

 Multi-label Gradient Boosting

4.6 Imbalanced Classification

Imbalanced classification refers to classification tasks where the number of examples in each
class is unequally distributed. Typically, imbalanced classification tasks are binary classification
tasks where the majority of examples in the training dataset belong to the normal class and a
minority of examples belong to the abnormal class.

Examples include:

 Fraud detection.

 Outlier detection.

 Medical diagnostic tests.

These problems are modeled as binary classification tasks, although may require specialized
techniques. Specialized techniques may be used to change the composition of samples in the
training dataset by Undersampling the majority class or oversampling the minority class.

Examples include:

 Random Undersampling.

 SMOTE Oversampling.
62

Specialized modeling algorithms may be used that pay more attention to the minority class
when fitting the model on the training dataset, such as cost-sensitive machine learning
algorithms.

Examples include:

 Cost-sensitive Logistic Regression.

 Cost-sensitive Decision Trees.

 Cost-sensitive Support Vector Machines.

4.7 Assessing classification performance

The performance of such classifiers can be summarised by means of a table known as a


contingency table or confusion matrix. A Confusion matrix is an N x N matrix used for evaluating
the performance of a classification model, where N is the number of target classes. The matrix
compares the actual target values with those predicted by the machine learning model. This
gives us a holistic view of how well our classification model is performing and what kinds of
errors it is making. For a binary classification problem, we would have a 2 x 2 matrix as shown
below with 4 values(Figure 4.1):

Figure 4.1. Confusion Matrix


63

Let’s decipher the matrix:

The target variable has two values: Positive or Negative

The columns represent the actual values of the target variable

The rows represent the predicted values of the target variable

Understanding True Positive, True Negative, False Positive and False Negative in a Confusion
Matrix

True Positive (TP)

The predicted value matches the actual value

The actual value was positive and the model predicted a positive value

True Negative (TN)

The predicted value matches the actual value

The actual value was negative and the model predicted a negative value

False Positive (FP) – Type 1 error

The predicted value was falsely predicted

The actual value was negative but the model predicted a positive value

Also known as the Type 1 error

False Negative (FN) – Type 2 error

The predicted value was falsely predicted

The actual value was positive but the model predicted a negative value

Also known as the Type 2 error

For example, suppose we had a classification dataset with 1000 data points. We fit a classifier
on it and get the below confusion matrix (Figure 4.2) :
64

Figure 4.2. Example Dataset

The different values of the Confusion matrix would be as follows:

 True Positive (TP) = 560; meaning 560 positive class data points were correctly classified
by the model

 True Negative (TN) = 330; meaning 330 negative class data points were correctly
classified by the model

 False Positive (FP) = 60; meaning 60 negative class data points were incorrectly classified
as belonging to the positive class by the model

 False Negative (FN) = 50; meaning 50 positive class data points were incorrectly classified
as belonging to the negative class by the model

Why Do We Need a Confusion Matrix?

Before we answer this question, let’s think about a hypothetical classification problem.

Let’s say you want to predict how many people are infected with a contagious virus in times
before they show the symptoms, and isolate them from the healthy population (ringing any
bells, yet? ). The two values for our target variable would be: Sick and Not Sick.

Our dataset is an example of an imbalanced dataset. There are 947 data points for the negative
class and 3 data points for the positive class. This is how we’ll calculate the accuracy:
65

Let’s see how our model performed(Figure 4.3) :

Figure 4.3 Model Performance

The total outcome values are:

TP = 30, TN = 930, FP = 30, FN = 10

So, the accuracy for our model turns out to be:

96%! Not bad! But it is giving the wrong idea about the result. Our model is saying “I can predict
sick people 96% of the time”. However, it is doing the opposite. It is predicting the people who
will not get sick with 96% accuracy while the sick are spreading the virus! Do you think this is a
correct metric for our model given the seriousness of the issue? Shouldn’t we be measuring
how many positive cases we can predict correctly to arrest the spread of the contagious virus?
Or maybe, out of the correctly predicted cases, how many are positive cases to check the
reliability of our model?
66

This is where we come across the dual concept of Precision and Recall.

Precision vs. Recall

Precision tells us how many of the correctly predicted cases actually turned out to be positive.

Here’s how to calculate Precision:

This would determine whether our model is reliable or not.

Recall tells us how many of the actual positive cases we were able to predict correctly with our
model. And here’s how we can calculate Recall(Figure 4.4):

Figure 4.4 Recall and Precision Calculation

We can easily calculate Precision and Recall for our model by plugging in the values into the
above questions:
67

50% percent of the correctly predicted cases turned out to be positive cases. Whereas 75% of
the positives were successfully predicted by our model. Precision is a useful metric in cases
where False Positive is a higher concern than False Negatives. Precision is important in music
or video recommendation systems, e-commerce websites, etc. Wrong results could lead to
customer churn and be harmful to the business.

Recall is a useful metric in cases where False Negative trumps False Positive. Recall is important
in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive
cases should not go undetected! In our example, Recall would be a better metric because we
don’t want to accidentally discharge an infected person and let them mix with the healthy
population thereby spreading the contagious virus. Now you can understand why accuracy
was a bad metric for our model. But there will be cases where there is no clear distinction
between whether Precision is more important or Recall. What should we do in those cases?
We combine them!

F1-Score

In practice, when we try to increase the precision of our model, the recall goes down, and vice-
versa. The F1-score captures both the trends in a single value:

F1-score is a harmonic mean of Precision and Recall, and so it gives a combined idea
about these two metrics. It is maximum when Precision is equal to Recall. But there is a catch
here. The interpretability of the F1-score is poor. This means that we don’t know what our
classifier is maximizing – precision or recall? So, we use it in combination with other evaluation
metrics which gives us a complete picture of the result.
68

The summary is presented below (Figure 4.5),

Figure 4.5 Confusion Matrix Formulas

Example:

Let’s start with an example confusion matrix for a binary classifier (though it can easily be
extended to the case of more than two classes):
69

What can we learn from this matrix?

 There are two possible predicted classes: “yes” and “no”. If we were predicting
the presence of a disease, for example, “yes” would mean they have the disease,
and “no” would mean they don’t have the disease.

 The classifier made a total of 165 predictions (e.g., 165 patients were being
tested for the presence of that disease).

 Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55
times.

 In reality, 105 patients in the sample have the disease, and 60 patients do not.

Let’s now define the most basic terms, which are whole numbers (not rates):

 true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.

 true negatives (TN): We predicted no, and they don’t have the disease.

 false positives (FP): We predicted yes, but they don’t actually have the disease.
(Also known as a “Type I error.”)

 false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a “Type II error.”)
70

This is a list of rates that are often computed from a confusion matrix for a binary classifier:

 Accuracy: Overall, how often is the classifier correct?

o (TP+TN)/total = (100+50)/165 = 0.91

 Misclassification Rate: Overall, how often is it wrong?

o (FP+FN)/total = (10+5)/165 = 0.09

o equivalent to 1 minus Accuracy

o also known as “Error Rate”

 True Positive Rate: When it’s actually yes, how often does it predict yes?

o TP/actual yes = 100/105 = 0.95

o also known as “Sensitivity” or “Recall”

 False Positive Rate: When it’s actually no, how often does it predict yes?

o FP/actual no = 10/60 = 0.17

 True Negative Rate: When it’s actually no, how often does it predict no?

o TN/actual no = 50/60 = 0.83

o equivalent to 1 minus False Positive Rate

o also known as “Specificity”

 Precision: When it predicts yes, how often is it correct?

o TP/predicted yes = 100/110 = 0.91

 Prevalence: How often does the yes condition actually occur in our sample?

o actual yes/total = 105/165 = 0.64


71

A couple other terms are also worth mentioning:

 Null Error Rate: This is how often you would be wrong if you always predicted the
majority class. (In our example, the null error rate would be 60/165=0.36 because if you
always predicted yes, you would only be wrong for the 60 “no” cases.) This can be a
useful baseline metric to compare your classifier against. However, the best classifier
for a particular application will sometimes have a higher error rate than the null error
rate, as demonstrated by the Accuracy Paradox.

 Cohen’s Kappa: This is essentially a measure of how well the classifier


performed as compared to how well it would have performed simply by chance.
In other words, a model will have a high Kappa score if there is a big difference
between the accuracy and the null error rate.

 F Score: This is a weighted average of the true positive rate (recall) and precision.

 ROC Curve: This is a commonly used graph that summarizes the performance
of a classifier over all possible thresholds. It is generated by plotting the True
Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the
threshold for assigning observations to a given class.

Understanding AUC - ROC Curve

In Machine Learning, performance measurement is an essential task. So when it comes to a


classification problem, we can count on an AUC - ROC Curve. When we need to check or
visualize the performance of the multi-class classification problem, we use the AUC (Area
Under The Curve) ROC (Receiver Operating Characteristics) curve. It is one of the most
important evaluation metrics for checking any classification model’s performance. It is also
written as AUROC (Area Under the Receiver Operating Characteristics) AUC - ROC curve
is a performance measurement for the classification problems at various threshold settings.
ROC is a probability curve and AUC represents the degree or measure of separability. It tells
how much the model is capable of distinguishing between classes. Higher the AUC, the better
the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC,
the better the model is at distinguishing between patients with the disease and no disease.
72

The ROC curve (Figure 4.6) is plotted with TPR against the FPR where TPR is on the y-axis
and FPR is on the x-axis.

Figure 4.6 ROC CURVE

An excellent model has AUC near to the 1 which means it has a good measure of separability.
A poor model has an AUC near 0 which means it has the worst measure of separability. In fact,
it means it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is
0.5, it means the model has no class separation capacity whatsoever.

Red distribution curve is of the positive class (patients with disease) and the green distribution
curve is of the negative class(patients with no disease).
73

Figure 4.7 ROC CURVE – Ideal Situation

This is an ideal situation (Figure 4.7). When two curves don’t overlap at all means model has an
ideal measure of separability. It is perfectly able to distinguish between positive class and negative
class.

When two distributions overlap (Figure 4.8), we introduce type 1 and type 2 errors. Depending
upon the threshold, we can minimize or maximize them. When AUC is 0.7, it means there is a
70% chance that the model will be able to distinguish between positive class and negative
class.
74

Figure 4.8 ROC CURVE – Overlap

This is the worst situation(Figure 4.9) .When AUC is approximately 0.5, the model has no
discrimination capacity to distinguish between positive class and negative class.

Figure 4.9 ROC CURVE – Worst Situation


75

The relation between Sensitivity, Specificity, FPR, and Threshold.

Sensitivity and Specificity are inversely proportional to each other. So, when we increase
Sensitivity, Specificity decreases, and vice versa. Sensitivity, Specificity and Sensitivity,
Specificity

When we decrease the threshold, we get more positive values thus it increases the sensitivity
and decreasing the specificity. Similarly, when we increase the threshold, we get more negative
values thus we get higher specificity and lower sensitivity. As we know FPR is 1 - specificity. So
when we increase TPR, FPR also increases and vice versa. TPR +, FPR and TPR, FPR

4.8 Scoring and ranking

Scoring is widely used in machine learning to mean the process of generating new values,
given a model and some new input. The generic term “score” is used, rather than “prediction,”
because the scoring process can generate so many different types of values:

 A list of recommended items and a similarity score.

 Numeric values, for time series models and regression models.

 A probability value, indicating the likelihood that a new input belongs to some existing
category.

 The name of a category or cluster to which a new item is most similar.

 A predicted class or outcome, for classification models.

Data used for scoring

The new data that you provide as input generally needs to have the same columns that were
used to train the model, minus the label, or outcome column. Columns that are used solely as
identifiers are usually excluded when training a model, and thus should be excluded when
scoring as well.

However, identifiers such as primary keys can easily be re-combined with the scoring dataset
later, by using the Add Columns module. Before you perform scoring on your dataset, always
76

check for missing values and nulls. When data used as input for scoring has missing values,
the missing values are used as inputs. Because nulls are propagated, the result is usually a
missing value.

List of scoring modules

Machine Learning Studio (classic) provides many different scoring modules. You select one
depending on the type of model you are using, or the type of scoring task you are performing:

 Apply Transformation: Applies a well-specified data transformation to a dataset.

Use this module to apply a saved process to a set of data.

 Assign Data to Clusters: Assigns data to clusters by using an existing trained clustering
model.

Use this module if you want to cluster new data based on an existing K-Means clustering
model.

This module replaces the Assign to Clusters (deprecated) module, which has been
deprecated but is still available for use in existing experiments.

 Score Matchbox Recommender: Scores predictions for a dataset by using the Matchbox
recommender.

Use this module if you want to generate recommendations, find related items or
users, or predict ratings.

 Score Model: Scores predictions for a trained classification or regression model.

Use this module for all other regression and classification models, as well as
some anomaly detection models.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning,


typically supervised, semi-supervised or reinforcement learning, in the construction of ranking
models for information retrieval systems. Training data consists of lists of items with some partial
order specified between items in each list. This order is typically induced by giving a numerical
or ordinal score or a binary judgment (e.g., “relevant” or “not relevant”) for each item.
77

The ranking model purposes to rank, i.e., producing a permutation of items in new, unseen lists
in a similar way to rankings in the training data. Ranking is a central part of many information
retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, and
online advertising. Training data consists of queries and documents matching them together
with relevance degree of each match. It may be prepared manually by human assessors (or
raters, as Google calls them), who check results for some queries and determine relevance of
each result. It is not feasible to check the relevance of all documents, and so typically a technique
called pooling is used — only the top few documents, retrieved by some existing ranking models
are checked. Alternatively, training data may be derived automatically by analyzing clickthrough
logs (i.e. search results which got clicks from users), query chains, or such search engines’
features as Google’s Search Wiki.

Training data is used by a learning algorithm to produce a ranking model which computes the
relevance of documents for actual queries. Typically, users expect a search query to complete
in a short time (such as a few hundred milliseconds for web search), which makes it impossible
to evaluate a complex ranking model on each document in the corpus, and so a two-phase
scheme is used.

First, a small number of potentially relevant documents are identified using simpler retrieval
models which permit fast query evaluation, such as the vector space model, boolean model.
This phase is called top- k document retrieval and many heuristics were proposed in the literature
to accelerate it, such as using a document’s static quality score and tiered indexes. In the
second phase, a more accurate but computationally expensive machine-learned model is used
to re-rank these documents.

Learning to rank algorithms have been applied in areas other than information retrieval:

 In machine translation for ranking a set of hypothesized translations;

 In computational biology for ranking candidate 3-D structures in protein structure


prediction problem.

 In recommender systems for identifying a ranked list of related news articles to


recommend to a user after he or she has read a current news article.
78

 In software engineering, learning-to-rank methods have been used for fault localization.

Class probability estimation

Class probability estimation is obviously more difficult than classification. Given a way of
generating class probabilities, classification error is minimized as long as the correct class is
predicted with maximum probability. However, a method for classification does not imply a
method of generating accurate probability estimates: the estimates that yield the correct
classification may be quite poor when assessed according to the quadratic or informational
loss.

Consider the case of probability estimation for a dataset with two classes. If the predicted
probabilities are on the correct side of the 0.5 threshold commonly used for classification, no
classification errors will be made. However, this does not mean that the probability estimates
themselves are accurate. They may be systematically too optimistic—too close to either 0 or
1—or too pessimistic—not close enough to the extremes. This type of bias will increase the
measured quadratic or informational loss, and will cause problems when attempting to minimize
the expected cost of classifications based on a given cost matrix.

Assessing class probability estimates

As with classifiers, we can now ask the question of how good these class probability estimators
are. A slight complication here is that, as already remarked, we do not have access to the true
probabilities. One trick that is often applied is to define a binary vector (I [c(x) =C1], . . . , I [c(x)
=Ck ]), which has the i -th bit set to 1 if x’s true class is Ci and all other bits set to 0, and use
these as the ‘true’ probabilities. We can then define the squared error (SE) of the predicted
probability vector pˆ(x) = pˆ1(x), . . . ,pˆk (x) as

and the mean squared error (MSE) as the average squared error over all instances in the test
set:
79

4.9 Summary

Classification is the most common task in machine learning. Binary classification refers to
those classification tasks that have two class labels. Multi-class classification refers to those
classification tasks that have more than two class labels. Multi-label classification refers to
those classification tasks that have two or more class labels, where one or more class labels
may be predicted for each example. The performance of such classifiers can be summarised
by means of a table known as a contingency table or confusion matrix.

4.10 Keywords

Classifier, Binary Classification, Multi-Class Classification, Multi-Label Classification, Scoring,


Ranking, Confusion Matrix.

4.11 Model Questions

1. Discuss classification problem in machine learning.

2. Write notes on a) Multi Class and b) Binary Classification

3. How to handle imbalanced classification?

4. Discuss about the various performance metrics for classification problems.

5. Elaborate on scoring and ranking.


80

LESSON – 5

CONCEPT LEARNING
Structure

5.1 Introduction

5.2 Learning Objectives

5.2 The hypothesis space

5.3 Paths through the hypothesis space

5.4 Learnability

5.5 Summary

5.6 Keywords

5.7 Model Questions

5.1 Introduction

In this lesson we consider methods for learning logical expressions or concepts from examples,
which lies at the basis of both tree models and rule models. In concept learning we only learn a
description for the positive class, and label everything that doesn’t satisfy that description as
negative. We will pay particular attention to the generality ordering that plays an important role
in logical models. Inducing general functions from specific training examples is a main issue of
machine learning. The Concept Learning is Acquiring the definition of a general category from
given sample positive and negative training examples of the category. Concept Learning can
seen as a problem of searching through a predefined space of potential hypotheses for the
hypothesis that best fits the training examples.

5.2 Learning Objectives

 To understand the fundamentals of concept learning

 To learn Hypothesis and Hypothesis space

 To understand learnability and PAC.


81

5.3 The hypothesis space

The simplest concept learning setting is where we restrict the logical expressions describing

concepts to conjunctions of literals. In most supervised machine learning algorithm, our main
goal is to find out a possible hypothesis from the hypothesis space that could possibly map out
the inputs to the proper outputs. The hypothesis space has a general-to-specific ordering of
hypotheses, and the search can be efficiently organized by taking advantage of a naturally
occurring structure over the hypothesis space.

The following Figure 5.1, shows the common method to find out the possible hypothesis from
the Hypothesis space:

Figure 5.1. Hypothesis Space

Hypothesis Space (H):

Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the
machine learning algorithm would determine the best possible (only one) which would best
describe the target function or the outputs.
82

Hypothesis (h):

A hypothesis is a function that best describes the target in supervised machine learning. The
hypothesis that an algorithm would come up depends upon the data and also depends upon
the restrictions and bias that we have imposed on the data. To better understand the Hypothesis
Space and Hypothesis consider the following coordinate that shows the distribution of some
data:

Figure 5.2(a). Hypothesis – Example

Say suppose we have test data for which we have to determine the outputs or results. The test
data is as shown below:

Figure 5.2(b). Hypothesis – Example

We can predict the outcomes by dividing the coordinate as shown below:

Figure 5.2(c). Hypothesis – Example


83

The way in which the coordinate would be divided depends on the data, algorithm and constraints.
All these legal possible ways in which we can divide the coordinate plane to predict the outcome
of the test data composes of the Hypothesis Space.

Each individual possible way is known as the hypothesis.

Hence, in this example the hypothesis space would be like:

Figure 5.2(d). Hypothesis – Example

5.4 Paths through the hypothesis space

Consider the following example,

EnjoySport – Hypothesis Representation is carried with following tasks,

• Each hypothesis consists of a conjuction of constraints on the instance attributes.

• Each hypothesis will be a vector of six constraints, specifying the values of the six attributes

– (Sky, AirTemp, Humidity, Wind, Water, and Forecast).

• Each attribute will be:

? - indicating any value is acceptable for the attribute (don’t care)

single value – specifying a single required value (ex. Warm) (specific)

0 - indicating no value is acceptable for the attribute (no value)


84

A hypothesis:

Sky AirTemp Humidity Wind Water Forecast

< Sunny, ? , ? , Strong , ? , Same >

• The most general hypothesis – that every day is a positive example

<?, ?, ?, ?, ?, ?>

• The most specific hypothesis – that no day is a positive example

<0, 0, 0, 0, 0, 0>

• EnjoySport concept learning task requires learning the sets of days for which EnjoySport=yes,
describing this set by a conjunction of constraints over the instance attributes.

EnjoySport Concept Learning Task

Given

– Instances X : set of all possible days, each described by the attributes

• Sky – (values: Sunny, Cloudy, Rainy)

• AirTemp – (values: Warm, Cold)

• Humidity – (values: Normal, High)

• Wind – (values: Strong, Weak)

• Water – (values: Warm, Cold)

• Forecast – (values: Same, Change)

– Target Concept (Function) c : EnjoySport : X -> {0,1}

– Hypotheses H : Each hypothesis is described by a conjunction of constraints on the attributes.

– Training Examples D : positive and negative examples of the target function Determine

– A hypothesis h in H such that h(x) = c(x) for all x in D.


85

General-to-Specific Ordering of Hypotheses

Many algorithms for concept learning organize the search through the hypothesis space by
relying on a general-to-specific ordering of hypotheses. By taking advantage of this naturally
occurring structure over the hypothesis space, we can design learning algorithms that
exhaustively search even infinite hypothesis spaces without explicitly enumerating every
hypothesis.

• Consider two hypotheses

h1 = (Sunny, ?, ?, Strong, ?, ?)

h2 = (Sunny, ?, ?, ?, ?, ?)

• Now consider the sets of instances that are classified positive by hl and by h2.

– Because h2 imposes fewer constraints on the instance, it classifies more instances as

positive.

– In fact, any instance classified positive by hl will also be classified positive by h2.

– Therefore, we say that h2 is more general than hl.

More-General-Than Relation

For any instance x in X and hypothesis h in H, we say that x satisfies h if and only if h(x) = 1.

More-General-Than-Or-Equal Relation:

Let h1 and h2 be two boolean-valued functions defined over X.

Then h1 is more-general-than-or-equal-to h2 (written h1 e” h2)

if and only if any instance that satisfies h2 also satisfies h1.

h1 is more-general-than h2 ( h1 > h2) if and only if h1e”h2 is true and h2e”h1 is false. We also
say h2 is more-specific-than h1.
86

5.5 Learnability

The downside of a more expressive concept language is that it may be harder to learn. The field
of computational learning theory studies exactly this question of learnability. To kick things off
we need a learning model: a clear statement of what we mean if we say that a concept language
is learnable. One of the most common learning models is the model of probably approximately
correct (PAC) learning. PAC-learnability means that there exists a learning algorithm that gets
it mostly right, most of the time. The model makes an allowance for mistakes on non-typical
examples: hence the ‘mostly right’ or ‘approximately correct’. The model also makes an
allowance for sometimes getting it completely wrong.

The only realistic expectation of a good learner is that with high probability it will learn a close
approximation to the target concept. The only realistic expectation of a good learner is that with
high probability it will learn a close approximation to the target concept. In Probably Approximately
Correct (PAC) learning, one requires that given small parameters and δ, With probability at
least 1 - δ , a learner produces a hypothesis with error at most. The only reason we can hope
for this is the consistent distribution assumption.

Consider a concept class C defined over an instance space X (containing instances of length
n), and a learner L using a hypothesis space H. The concept class C is PAC learnable by L
using H if for all f C, for all distribution D over X, and fixed 0<, δ < 1, given m examples
sampled independently according to D, the algorithm L produces, with probability at least (1- δ),
a hypothesis h  H that has error at most , where m is polynomial in 1/, 1/ δ, n and size(H).

We impose two limitations, Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that

approximate f ?

• Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a

good hypothesis h ?
87

Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption)

– for every target function f in the class C

5.6 Summary

The Concept Learning is Acquiring the definition of a general category from given sample positive
and negative training examples of the category. A hypothesis is a function that best describes
the target in supervised machine learning. The field of computational learning theory studies
exactly this question of learnability. One of the most common learning models is the model of
probably approximately correct (PAC) learning.

5.7 Keywords

Concept Learning, Hypothesis, Hypothesis Space, Probably Approximately Correct(PAC)

5.8 Model Questions

1. Elaborate on Hypothesis Space.

2. Write in detail about concept learning.


88

LESSON – 6

TREE MODELS
Structure

6.1 Introduction

6.2 Learning Objectives

6.3 Tree Models

6.4 Decision Trees

6.5 Ranking and probability estimation trees

6.6 Tree learning as variance induction

6.7 Summary

6.8 Keywords

6.9 Model Questions

6.1 Introduction

Tree models are among the most popular models in machine learning. Trees are expressive
and easy to understand, and of particular appeal to computer scientists due to their recursive
‘divide-and-conquer’ nature. A feature tree is a tree such that each internal node (the nodes that
are not leaves) is labelled with a feature, and each edge emanating from an internal node is
labelled with a literal. The set of literals at a node is called a split. Each leaf of the tree represents
a logical expression, which is the conjunction of literals encountered on the path from the root
of the tree to the leaf. The extension of that conjunction (the set of instances covered by it) is
called the instance space segment associated with the leaf. Let us discuss in detail in this
lesson.
89

6.2 Learning Objectives

 To learn tree induction model

 To understand decision tree model

 To explore ranking and probability estimation in trees

6.3 Tree models

Essentially, a feature tree is a compact way of representing a number of conjunctive concepts


in the hypothesis space. The learning problem is then to decide which of the possible concepts
will be best to solve the given task. It assumes that the following three functions are defined:

Homogeneous(D) returns true if the instances in D are homogeneous enough to be labelled


with a single label, and false otherwise;

Label(D) returns the most appropriate label for a set of instances D;

BestSplit(D,F) returns the best set of literals to be put at the root of the tree.

The generalized algorithm is presented as,

Algorithm:6.1

GrowTree(D,F) – grow a feature tree from training data.

Input : data D; set of features F.

Output : feature tree T with labelled leaves.

Step 1. if Homogeneous(D) then return Label(D); // Homogeneous

Step 2. S  BestSplit(D,F) ;

Step 3. split D into subsets Di according to the literals in S;

Step 4. for each i do

Step 5. if Di Ø then Ti  GrowTree(Di ,F) else Ti is a leaf labelled with Label(D);


90

Step 6. end

Step 7. return a tree whose root is labelled with S and whose children are Ti

The above Algorithm is a divide-and-conquer algorithm: it divides the data into subsets, builds a
tree for each of those and then combines those subtrees into a single tree. Divide-and-conquer
algorithms are a tried-and-tested technique in computer science. They are usually implemented
recursively, because each subproblem (to build a tree for a subset of the data) is of the same
form as the original problem.

This works as long as there is a way to stop the recursion, which is what the first line of the
algorithm does. However, it should be noted that such algorithms are greedy: whenever there is
a choice (such as choosing the best split), the best alternative is selected on the basis of the
information then available, and this choice is never reconsidered. This may lead to sub-optimal
choices. An alternative would be to use a backtracking search algorithm which can return an
optimal solution, at the expense of increased computation time and memory requirements.

6.4 Decision Trees

Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome. In a Decision tree, there are two
nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any
decision and have multiple branches, whereas Leaf nodes are the output of those decisions
and do not contain any further branches.

The decisions or the test are performed on the basis of features of the given dataset. It is a
graphical representation for getting all the possible solutions to a problem/decision based on
given conditions. It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure. In order to build
a tree, we use the CART algorithm, which stands for Classification and Regression Tree
algorithm. A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
91

Below diagram (Figure 6.1) explains the general structure of a decision tree:

Figure 6.1 Decision Tree

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.

Below are the two reasons for using the Decision tree:

Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand. The logic behind the decision tree can be easily understood because it shows a
tree-like structure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.


92

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below Algorithm 6.2 :

Algorithm 6.2

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels.

The next decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider
the below figure 6.2:
93

Figure 6.2 Decision Tree – Example

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

o Information Gain

o Gini Index

Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.

o It calculates how much information a feature provides us about a class.

o According to the value of information gain, we split the node and build the decision tree.

o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
94

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

To understand this better let’s consider an example:

Suppose our entire population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t.

Now we have two features to predict whether he/she will go to the gym or not.

Feature 1 is “Energy” which takes two values “high” and “low”

Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly motivated”.

Let’s see how our decision tree will be made using these 2 features. We’ll use information gain
to decide which feature should be the root node and which feature should be placed after the
split.

Figure 6.3 Information Gain – Example

Let’s calculate the entropy:(Refer Figure 6.3)


95

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:

Our parent entropy was near 0.99 and after looking at this value of information gain, we can say
that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples

o P(yes)= probability of yes

o P(no)= probability of no

Now we know what entropy is and what is its formula, Next, we need to know that how exactly
does it work in this algorithm. Entropy basically measures the impurity of a node. Impurity is the
degree of randomness; it tells how random our data is. A pure sub-split means that either you
should be getting “yes”, or you should be getting “no”. Suppose a feature has 8 “yes” and 4 “no”
initially, after the first split the left node gets 5 ‘yes’ and 2 ‘no’ whereas right node gets 3 ‘yes’ and
2 ‘no’. We see here the split is not pure, why? Because we can still see some negative classes
in both the nodes. In order to make a decision tree, we need to calculate the impurity of each
split, and when the purity is 100%, we make it as a leaf node.
96

To check the impurity of feature 2 (Figure 6.4) and feature 3 we will take the help for Entropy
formula.

Figure 6.4 Feature -2 Entropy calculation

For feature 3,
97

We can clearly see from the tree itself that left node has low entropy or more purity than right
node since left node has a greater number of “yes” and it is easy to decide here. Always remember
that the higher the Entropy, the lower will be the purity and the higher will be the impurity.

Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.

o An attribute with the low Gini index should be preferred as compared to the high Gini
index.

o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.

o Gini index can be calculated using the below formula:

Gini Index= 1- jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree. A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
98

o Cost Complexity Pruning

o Reduced Error Pruning.

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.

o It can be very useful for solving decision-related problems.

o It helps to think about all the possible outcomes for a problem.

o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.

o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.

o For more class labels, the computational complexity of the decision tree may increase.

6.5 Ranking and probability estimation trees

Probability estimation trees (PETs) generalize classification trees in that they assign class
probability distributions instead of class labels to examples that are to be classified. It has
further been shown that the use of probability correction improves the performance of PETs.
Tree induction is one of the most effective and widely used methods for building classification
models. However, many applications require cases to be ranked by the probability of class
membership. Probability estimation trees (PETs) have the same attractive features as
classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on
large data sets).

Probability Tree Diagrams

Here is a tree diagram (Figure 6.5) for the toss of a coin:


99

There are two “branches” (Heads and Tails)

The probability of each branch is written on the branch

The outcome is written at the end of the branch

Figure 6.5(a)Probability in Trees

We can extend the tree diagram to two tosses of a coin:

Figure 6.5(b)Probability in Trees

How do we calculate the overall probabilities?

We multiply probabilities along the branches

We add probabilities down columns


100

Figure 6.5(c)Probability in Trees

Now we can see such things as:

The probability of “Head, Head” is 0.5×0.5 = 0.25

All probabilities add to 1.0 (which is always a good check) . The probability of getting at least
one Head from two tosses is 0.25+0.25+0.25 = 0.75 ... and more

6.6 Tree learning as variance induction

Reduction in Variance is a method for splitting the node used when the target variable is
continuous, i.e., regression problems. It is so-called because it uses variance as a measure for
deciding the feature on which node is split into child nodes.

Variance is used for calculating the homogeneity of a node. If a node is entirely homogeneous,
then the variance is zero. Here are the steps to split a decision tree using reduction in variance:

1. For each split, individually calculate the variance of each child node

2. Calculate the variance of each split as the weighted average variance of child nodes
101

3. Select the split with the lowest variance

4. Perform steps 1-3 until completely homogeneous nodes are achieved

6.7 Summary

Tree models are among the most popular models in machine learning. Trees are expressive
and easy to understand, and of particular appeal to computer scientists due to their recursive
‘divide-and-conquer’ nature. Decision Tree is a Supervised learning technique that can be used
for both classification and Regression problems, but mostly it is preferred for solving
Classification problems. There are two popular techniques for ASM, which are Information Gain
and Gini Index. Pruning is a process of deleting the unnecessary nodes from a tree in order to
get the optimal decision tree. Probability estimation trees (PETs) generalize classification trees
in that they assign class probability distributions instead of class labels to examples that are to
be classified. Reduction in Variance is a method for splitting the node used when the target
variable is continuous, i.e., regression problems.

6.8 Keywords

Tree Models, Decision Trees, Ranking, Probability Estimation Trees, Variance Induction, Attribute
selection measure (ASM)

6.9 Model Questions

1. Write a generalized tree model algorithm.

2. Elaborate on Decision tree with a suitable diagram.

3. Write notes on Attribute selection measure (ASM) in decision tree.

4. Explain Probability estimation trees (PETs).


102

LESSON – 7

RULE MODELS
Structure

7.1 Introduction

7.2 Learning Objectives

7.3 Learning Ordered rule lists

7.4 Learning Ordered rule sets

7.5 First order rule learning

7.6 Summary

7.7 Keywords

7.8 Model Questions

7.1 Introduction

Rule models are the second major type of logical machine learning models. Generally speaking,
they offer more flexibility than tree models: for instance, while decision tree branches are mutually
exclusive, the potential overlap of rules may give additional information. This flexibility comes at
a price, however: while it is very tempting to view a rule as a single, independent piece of
information, this is often not adequate because of the way the rules are learned. Particularly in
supervised learning, a rule model is more than just a set of rules: the specification of how the
rules is to be combined to form predictions is a crucial part of the model.

There are essentially two approaches to supervised rule learning. One is inspired by decision
tree learning: find a combination of literals – the body of the rule, which is what we previously
called a concept – that covers a sufficiently homogeneous set of examples, and find a label to
put in the head of the rule. The second approach goes in the opposite direction: first select a
class you want to learn, and then find rule bodies that cover (large subsets of) the examples of
that class. The first approach naturally leads to a model consisting of an ordered sequence of
rules – a rule list.
103

7.2 Learning Objectives

 To learn rule models

 To understand rule sets and rule sets

 To understand rule learning

7.2 Learning ordered rule lists

The key idea of this kind of rule learning algorithm is to keep growing a conjunctive rule body by
adding the literal that most improves its homogeneity. A decision rule is a simple IF-THEN
statement consisting of a condition (also called antecedent) and a prediction. For example: IF it
rains today AND if it is April (condition), THEN it will rain tomorrow (prediction). A single decision
rule or a combination of several rules can be used to make predictions.

Decision rules follow a general structure: IF the conditions are met THEN make a certain
prediction. Decision rules are probably the most interpretable prediction models. Their IF-THEN
structure semantically resembles natural language and the way we think, provided that the
condition is built from intelligible features, the length of the condition is short (small number of
features=value pairs combined with an AND) and there are not too many rules. In programming,
it is very natural to write IF-THEN rules. New in machine learning is that the decision rules are
learned through an algorithm.

Imagine using an algorithm to learn decision rules for predicting the value of a house (low,
medium or high). One decision rule learned by this model could be: If a house is bigger than
100 square meters and has a garden, then its value is high. More formally: IF size>100 AND
garden=1 THEN value=high.

Let us break down the decision rule:

size>100 is the first condition in the IF-part.

garden=1 is the second condition in the IF-part.

The two conditions are connected with an ‘AND’ to create a new condition. Both must
be true for the rule to apply.
104

The predicted outcome (THEN-part) is value=high.

A decision rule uses at least one feature=value statement in the condition, with no upper limit on
how many more can be added with an ‘AND’. An exception is the default rule that has no explicit
IF-part and that applies when no other rule applies, but more about this later. The usefulness of
a decision rule is usually summarized in two numbers: Support and accuracy.

Support or coverage of a rule: The percentage of instances to which the condition of a rule
applies is called the support. Take for example the rule size=big AND location=good THEN
value=high for predicting house values. Suppose 100 of 1000 houses are big and in a good
location, then the support of the rule is 10%. The prediction (THEN-part) is not important for the
calculation of support.

Accuracy or confidence of a rule: The accuracy of a rule is a measure of how accurate the
rule is in predicting the correct class for the instances to which the condition of the rule applies.
For example: Let us say of the 100 houses, where the rule size=big AND location=good THEN
value=high applies, 85 have value=high, 14 have value=medium and 1 has value=low, then the
accuracy of the rule is 85%.

Usually there is a trade-off between accuracy and support: By adding more features to the
condition, we can achieve higher accuracy, but lose support. To create a good classifier for
predicting the value of a house you might need to learn not only one rule, but maybe 10 or 20.
Then things can get more complicated and you can run into one of the following problems:

Rules can overlap: What if I want to predict the value of a house and two or more rules apply
and they give me contradictory predictions?

No rule applies: What if I want to predict the value of a house and none of the rules apply?

There are two main strategies for combining multiple rules: Decision lists (ordered) and decision
sets (unordered). Both strategies imply different solutions to the problem of overlapping rules.

A decision list introduces an order to the decision rules. If the condition of the first rule is true for
an instance, we use the prediction of the first rule. If not, we go to the next rule and check if it
applies and so on. Decision lists solve the problem of overlapping rules by only returning the
prediction of the first rule in the list that applies. A decision set resembles a democracy of the
105

rules, except that some rules might have a higher voting power. In a set, the rules are either
mutually exclusive, or there is a strategy for resolving conflicts, such as majority voting, which
may be weighted by the individual rule accuracies or other quality measures. Interpretability
suffers potentially when several rules apply.

Both decision lists and sets can suffer from the problem that no rule applies to an instance.
This can be resolved by introducing a default rule. The default rule is the rule that applies when
no other rule applies. The prediction of the default rule is often the most frequent class of the
data points which are not covered by other rules. If a set or list of rules covers the entire feature
space, we call it exhaustive. By adding a default rule, a set or list automatically becomes
exhaustive.

The algorithms are chosen to cover a wide range of general ideas for learning rules, so all three
of them represent very different approaches.

1. OneR learns rules from a single feature. OneR is characterized by its simplicity, interpretability
and its use as a benchmark.

2. Sequential covering is a general procedure that iteratively learns rules and removes the
data points that are covered by the new rule. This procedure is used by many rule learning
algorithms.

3. Bayesian Rule Lists combine pre-mined frequent patterns into a decision list using Bayesian
statistics. Using pre-mined patterns is a common approach used by many rule learning
algorithms

Learn Rules from a Single Feature (OneR)

The OneR algorithm suggested by Holte (1993) is one of the simplest rule induction
algorithms. From all the features, OneR selects the one that carries the most information
about the outcome of interest and creates decision rules from this feature.

Despite the name OneR, which stands for “One Rule”, the algorithm generates more than
one rule: It is actually one rule per unique feature value of the selected best feature. A better
name would be OneFeatureRules.
106

The algorithm is simple and fast:

1. Discretize the continuous features by choosing appropriate intervals.

2. For each feature:

o Create a cross table between the feature values and the (categorical)
outcome.

o For each value of the feature, create a rule which predicts the most frequent
class of the instances that have this particular feature value (can be read
from the cross table).

o Calculate the total error of the rules for the feature.

3. Select the feature with the smallest total error.

Sequential Covering

Sequential covering is a general procedure that repeatedly learns a single rule to create a
decision list (or set) that covers the entire dataset rule by rule. Many rule-learning algorithms
are variants of the sequential covering algorithm. This chapter introduces the main recipe
and uses RIPPER, a variant of the sequential covering algorithm for the examples. The
idea is simple: First, find a good rule that applies to some of the data points. Remove all
data points which are covered by the rule.

A data point is covered when the conditions apply, regardless of whether the points are
classified correctly or not. Repeat the rule-learning and removal of covered points with the
remaining points until no more points are left or another stop condition is met. The result is
a decision list.

This approach of repeated rule-learning and removal of covered data points is called
“separate-and-conquer”. Suppose we already have an algorithm that can create a single
rule that covers part of the data. The sequential covering algorithm for two classes (one
positive, one negative) works like this:
107

1. Start with an empty list of rules (rlist).

2. Learn a rule r.

3. While the list of rules is below a certain quality threshold (or positive examples
are not yet covered):

Add rule r to rlist.

Remove all data points covered by rule r.

Learn another rule on the remaining data.

4. Return the decision list.

Bayesian Rule Lists

Pre-mine frequent patterns from the data that can be used as conditions for the decision rules.
Learn a decision list from a selection of the pre-mined rules. A specific approach using this
recipe is called Bayesian Rule Lists or BRL for short. BRL uses Bayesian statistics to learn
decision lists from frequent patterns which are pre-mined with the FP-tree algorithm.

Learning Bayesian Rule Lists

The goal of the BRL algorithm is to learn an accurate decision list using a selection of the
pre-mined conditions, while prioritizing lists with few rules and short conditions. BRL
addresses this goal by defining a distribution of decision lists with prior distributions for the
length of conditions (preferably shorter rules) and the number of rules (preferably a shorter
list).

The posteriori probability distribution of lists makes it possible to say how likely a decision
list is, given assumptions of shortness and how well the list fits the data. Our goal is to find
the list that maximizes this posterior probability.

Since it is not possible to find the exact best list directly from the distributions of lists, BRL
suggests the following recipe:
108

1) Generate an initial decision list, which is randomly drawn from the priori distribution.

2) Iteratively modify the list by adding, switching or removing rules, ensuring that the resulting
lists follow the posterior distribution of lists.

3) Select the decision list from the sampled lists with the highest probability according to
the posteriori distribution.

7.4 First order rule learning

One of the most expressive and human readable representations for learned hypotheses
is sets of production rules (if-then rules). Rules can be derived from other representations
(e.g., decision trees) or they can be learned directly. Here, we are concentrating on the
direct method. An important aspect of direct rule-learning algorithms is that they can learn
sets of first-order rules which have much more representational power than the propositional
rules that can be derived from decision trees. Propositional Logic does not include variables
and thus cannot express general relations among the values of the attributes.

Example 1: in Propositional logic, you can write:

IF (Father1=Bob) ^ (Name 2=Bob)^ (Female 1=True) THEN Daughter1,2=True.

This rule applies only to a specific family!

Example 2: In First-Order logic, you can write:

IF Father(y,x) ^ Female(y), THEN Daughter(x,y)

This rule (which you cannot write in Propositional Logic) applies to any family!

First order logic is much more expressive than propositional logic – i.e. it allows a finer-grain of
specification and reasoning when representing knowledge. In the context of machine learning,
consider learning the relational concept daughter (x, y) defined over pairs of persons x, y, where
– persons are represented by attributes: Name, Mother, Father, Male, Female

Training examples then have the form:

person1, person 2,target attribute value


109

E.g. <<Name1 = Ann,Mother1 = Sue,Father1 = Bob,Male1 = F,Female1 = T>

<Name 2 = Bob,Mother 2 = Gill,Father 2 = Joe,Male 2 = T,Female 2 = F>

Daughter1,2 = T>

first order rule learner can learn the rule: IF Father(y, x)’”Female(y) THEN Daughter(x, y).

7.6 Summary

Rule models are the second major type of logical machine learning models. The key idea of this
kind of rule learning algorithm is to keep growing a conjunctive rule body by adding the literal
that most improves its homogeneity. Decision lists solve the problem of overlapping rules by
only returning the prediction of the first rule in the list that applies. Sequential covering is a
general procedure that repeatedly learns a single rule to create a decision list (or set) that
covers the entire dataset rule by rule. A specific approach using this recipe is called Bayesian
Rule Lists or BRL for short. BRL uses Bayesian statistics to learn decision lists from frequent
patterns which are pre-mined with the FP-tree algorithm.

7.7 Keywords

Rule models, Decision rules, learning models, Support, Accuracy, Decision lists, OneR

7.8 Model Questions

1. Explain in detail rule models.

2. Write notes on

a) Bayesian rule lists

b) First order rule learning

3. How learning ordered rule lists are formed?


110

LESSON – 8

LINEAR MODELS
Structure

8.1 Introduction

8.2 Learning Objectives

8.3 Linear Models – Types

8.4 Least square methods

8.5 The perceptron

8.6 Support Vector Machines

8.7 Summary

8.8 Keywords

8.9 Model Questions

8.1 Introduction

Linearity plays a fundamental role in mathematics and related disciplines, and the mathematics
of linear models is well-understood. In machine learning, linear models are of particular interest
because of their simplicity. Here are a couple of manifestations of this simplicity. Linear models
are parametric, meaning that they have a fixed form with a small number of numeric parameters
that need to be learned from data. This is different from tree or rule models, where the structure
of the model (e.g., which features to use in the tree, and where) is not fixed in advance. Linear
models are stable, which is to say that small variations in the training data have only limited
impact on the learned model. A linear model is a model that assumes the data is linearly separable.

Tree models tend to vary more with the training data, as the choice of a different split at the root
of the tree typically means that the rest of the tree is different as well. Linear models are less
likely to overfit the training data than some other models, largely because they have relatively
few parameters. The flipside of this is that they sometimes lead to underfitting: e.g., imagine
111

you are learning where the border runs between two countries from labelled samples, then a
linear model is unlikely to give a good approximation.

The last two points can be summarised by saying that linear models have low variance but high
bias. Such models are often preferable when you have limited data and want to avoid overfitting.
High variance–low bias models such as decision trees are preferable if data is abundant but
underfitting is a concern.

It is usually a good idea to start with simple, high-bias models such as linear models and only
move on to more elaborate models if the simpler ones appear to be underfitting. Linear models
exist for all predictive tasks, including classification, probability estimation and regression. Linear
regression, in particular, is a well-studied problem that can be solved by the least-squares
method. In the field of machine learning, the goal of statistical classification is to use an object’s
characteristics to identify which class (or group) it belongs to. A linear classifier achieves this
by making a classification decision based on the value of a linear combination of the
characteristics.

8.2 Learning Objectives

 To learn Linear Models and its types

 To study Least square methods

 To comprehend the perceptron algorithm

 To explore Support Vector Machines

8.3 Linear Models – Types

We’ll explore two types of linear models: Linear regression, which is used for regression
(numerical predictions), and Logistic regression, which is used for classification (categorical
predictions).

Linear regression

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
112

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable. The linear regression model provides
a sloped straight line representing the relationship between the variables. Consider the below
Figure 8.1,

Figure 8.1 Linear Regression Model

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Y= Dependent Variable (Target Variable)

X= Independent Variable (predictor Variable)

a0= intercept of the line (Gives an additional degree of freedom)

a1 = Linear regression coefficient (scale factor to each input value).

ε = random error
113

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

Simple Linear Regression:

If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.

Multiple Linear regression:

If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:

Positive Linear Relationship:

If the dependent variable increases on the Y-axis and independent variable increases on X-axis,
then such a relationship is termed as a Positive linear relationship. It is depicted as below,

Figure 8.2 Positive Linear


114

Negative Linear Relationship:

If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship. It is depicted as below,

Figure 8.3 Negative Linear

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error. The different values for weights or the coefficient of lines (a0, a1) gives a
different line of regression, so we need to calculate the best values for a0 and a1 to find the best
fit line, so to calculate this we use cost function.

Cost function

o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.

o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.

o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
115

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:

Where,

N=Total number of observation

Yi = Actual value

(a1xi+a0)= Predicted value.

Gradient Descent:

Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.

A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function. It is done by a random selection of values of coefficient and then iteratively update
the values to reach the minimum cost function.

The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be achieved
by below method:

R-squared method:

o R-squared is a statistical method that determines the goodness of fit.

o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.

o The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.

o It is also called a coefficient of determination, or coefficient of multiple


determination for multiple regression.
116

o It can be calculated from the below formula:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.

o Linear relationship between the features and target:

Linear regression assumes the linear relationship between the dependent and
independent variables.

o Small or no multicollinearity between the features:

Multicollinearity means high-correlation between the independent variables. Due to


multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.

o Homoscedasticity Assumption:

Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.

o Normal distribution of error terms:

Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will become
either too wide or too narrow, which may cause difficulties in finding coefficients.

o No autocorrelations:

The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.
117

Simple Linear Regression in Machine Learning

Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a continuous/
real value. However, the independent variable can be measured on continuous or categorical
values.

Simple Linear regression algorithm has mainly two objectives:

Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc. Forecasting new observations. Such as Weather
forecasting according to temperature, Revenue of a company according to the investments in
a year, etc.

Simple Linear Regression Model:

The Simple Linear Regression model can be represented using the below equation:

y= a0+a1x+ ε

Where,

a0= It is the intercept of the Regression line (can be obtained putting x=0)

a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.

ε = The error term. (For a good model it will be negligible)

Multiple Linear Regression

In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may be
various cases in which the response variable is affected by more than one predictor variable;
for such cases, the Multiple Linear Regression algorithm is used. Moreover, Multiple Linear
118

Regression is an extension of Simple Linear regression as it takes more than one predictor
variable to predict the response variable. We can define it as: Multiple Linear Regression is one
of the important regression algorithms which models the linear relationship between a single
dependent continuous variable and more than one independent variable.

In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor
variables x1, x2, x3, ..., xn. Since it is an enhancement of Simple Linear Regression, so the
same is applied for the multiple linear regression equation, the equation becomes:

Y= b0+b1x1+ b2x2+ b3x3+...... bnxn

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

Assumptions for Multiple Linear Regression:

 A linear relationship should exist between the Target and predictor variables.

 The regression residuals must be normally distributed.

 MLR assumes little or no multicollinearity (correlation between the independent variable)


in data.

8.4 Least square methods

The least-squares method is a form of mathematical regression analysis used to determine


the line of best fit for a set of data, providing a visual demonstration of the relationship between
the data points. Each point of data represents the relationship between a known independent
variable and an unknown dependent variable. The least-squares method is a statistical procedure
to find the best fit for a set of data points by minimizing the sum of the offsets or residuals of
points from the plotted curve. Least squares regression is used to predict the behavior of
119

dependent variables. The least-squares method provides the overall rationale for the placement
of the line of best fit among the data points being studied.

This method of regression analysis begins with a set of data points to be plotted on an x- and y-
axis graph. An analyst using the least-squares method will generate a line of best fit that explains
the potential relationship between independent and dependent variables. The least-squares
method provides the overall rationale for the placement of the line of best fit among the data
points being studied. The most common application of this method, which is sometimes referred
to as “linear” or “ordinary,” aims to create a straight line that minimizes the sum of the squares
of the errors that are generated by the results of the associated equations, such as the squared
residuals resulting from differences in the observed value, and the value anticipated, based on
that model.

The Line of Best Fit Equation

The line of best fit determined from the least square’s method has an equation that tells the
story of the relationship between the data points. Line of best-fit equations may be determined
by computer software models, which include a summary of outputs for analysis, where the
coefficients and summary outputs explain the dependence of the variables being tested.

Least Squares Regression Line

If the data shows a leaner relationship between two variables, the line that best fits this linear
relationship is known as a least-squares regression line, which minimizes the vertical distance
from the data points to the regression line. The term “least squares” is used because it is the
smallest sum of squares of errors, which is also called the “variance.” In regression analysis,
dependent variables are illustrated on the vertical y-axis, while independent variables are
illustrated on the horizontal x-axis. These designations will form the equation for the line of best
fit, which is determined from the least-squares method.

In contrast to a linear problem, a non-linear least-squares problem has no closed solution and
is generally solved by iteration. The discovery of the least-squares method is attributed to Carl
Friedrich Gauss, who discovered the method in 1795.
120

Example of the Least Squares Method

An example of the least-squares method is an analyst who wishes to test the relationship
between a company’s stock returns, and the returns of the index for which the stock is a
component. In this example, the analyst seeks to test the dependence of the stock returns on
the index returns. To achieve this, all of the returns are plotted on a chart. The index returns are
then designated as the independent variable, and the stock returns are the dependent variable.
The line of best fit provides the analyst with coefficients explaining the level of dependence. The
least-squares method is a mathematical technique that allows the analyst to determine the
best way of fitting a curve on top of a chart of data points.

It is widely used to make scatter plots easier to interpret and is associated with regression
analysis. These days, the least-squares method can be used as part of most statistical software
programs. The least-squares method is used in a wide variety of fields, including finance and
investing. For financial analysts, the method can help to quantify the relationship between two
or more variables—such as a stock’s share price and its earnings per share (EPS). By
performing this type of analysis, investors may attempt to forecast the future behavior of stock
prices or other factors.

To illustrate, consider the case of an investment considering whether to invest in a gold mining
company. The investor might wish to know how sensitive the company’s stock price is to changes
in the market price of gold. To study this, the investor could use the least-squares method to
trace the relationship between those two variables over time onto a scatter plot. This analysis
could help the investor predict the degree to which the stock’s price would likely rise or fall for
any given increase or decrease in the price of gold.

Consider an example. Tom who is the owner of a retail shop, found the price of different T-shirts
vs the number of T-shirts sold at his shop over a period of one week.

He tabulated this like shown below:


121

Table 8.1

Let us use the concept of least squares regression to find the line of best fit for the above data.

Step 1: Calculate the slope ‘m’ by using the following formula:

After you substitute the respective values, m = 1.518 approximately.

Step 2: Compute the y-intercept value

After you substitute the respective values, c = 0.305 approximately.

Step 3: Substitute the values in the final equation

Once you substitute the values, it should look something like this:
122

Table 8.2

Let’s construct a graph that represents the y=mx + c line of best fit(Figure 8.4) :

Figure 8.4 Line of best fit

Now Tom can use the above equation to estimate how many T-shirts of price $8 can he sell at
the retail shop.

y = 1.518 x 8 + 0.305 = 12.45 T-shirts

This comes down to 13 T-shirts! That’s how simple it is to make predictions using Linear
Regression.

8.5 The perceptron

A linear classifier that will achieve perfect separation on linearly separable data is the perceptron,
originally proposed as a simple neural network. The perceptron iterates over the training set,
123

updating the weight vector every time it encounters an incorrectly classified example. For
example, let xi be a misclassified positive example, then we have yi =+1 and w·xi < t . We
therefore want to find w’ such that w’·xi >w·xi , which moves the decision boundary towards and
hopefully past xi . This can be achieved by calculating the new weight vector as w’ = w+ηxi ,
where 0 < η 1 is the learning rate.

The perceptron training algorithm is given below,

Perceptron(D,η) – train a perceptron for linear classification.

Input : labelled training data D in homogeneous coordinates;

learning rate η.

Output : weight vector w defining classifier ˆ y = sign(w· x).

1 W 0 ; // Other initialisations of the weight vector are possible

2 converged  false;

3 while converged = false do

4 converged  true;

5 for i = 1 to |D| do

6 if yiw· xi d” 0 // i.e., yˆi  yi

7 then

8 w  w+ηyi xi ;

9 converged  false; //We changed w so haven’t converged yet

10 end

11 end

12 end
124

The algorithm iterates through the training examples until all examples are correctly classified.
The algorithm can easily be turned into an online algorithm that processes a stream of examples,
updating the weight vector only if the last received example is misclassified. The perceptron is
guaranteed to converge to a solution if the training data is linearly separable, but it won’t converge
otherwise.

The key point of the perceptron algorithm is that, every time an example xi is misclassified, we
add yi xi to the weight vector. After training has completed, each example has been misclassified
zero or more times – denote this number αi for example xi . Using this notation the weight
vector can be expressed as

8.6 Support Vector Machines

“Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used
for both classification or regression challenges. However, it is mostly used in classification
problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space
(where n is a number of features you have) with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates
the two classes very well. Support Vectors are simply the coordinates of individual observation.
The SVM classifier is a frontier that best segregates the two classes (hyper-plane/ line).

Let’s understand:

Identify the right hyper-plane (Scenario-1) in Figure 8.5: Here, we have three hyper-planes (A,
B, and C). Now, identify the right hyper-plane to classify stars and circles.

Figure 8.5 Hyper-plane


125

We need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane
which segregates the two classes better”. In this scenario, hyper-plane “B” has excellently
performed this job.

Identify the right hyper-plane (Scenario-2) ) in Figure 8.6: Here, we have three hyper-planes (A,
B, and C) and all are segregating the classes well. Now, How can we identify the right hyper-
plane?

Figure 8.6 Hyper-plane

Here, maximizing the distances between nearest data point (either class) and hyper-plane will
help us to decide the right hyper-plane. This distance is called as Margin. Let’s look at the below
Figure 8.7:

Figure 8.7 Hyper-plane


126

Above, you can see that the margin for hyper-plane C is high as compared to both A and B.
Hence, we name the right hyper-plane as C. Another lightning reason for selecting the hyper-
plane with higher margin is robustness. If we select a hyper-plane having low margin then there
is high chance of miss-classification.

Identify the right hyper-plane (Scenario-3) in Figure 8.8, Hint: Use the rules as discussed in
previous section to identify the right hyper-plane.

Figure 8.8 Hyper-plane

But, here is the catch, SVM selects the hyper-plane which classifies the classes accurately prior
to maximizing margin. Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A.

Can we classify two classes (Scenario-4)? in Figure 8.9 : Unable to segregate the two classes
using a straight line, as one of the stars lies in the territory of other(circle) class as an outlier.

Figure 8.9 Hyper-plane

The SVM algorithm has a feature to ignore outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to outliers.
127

8.5 Summary

Linear models are parametric, meaning that they have a fixed form with a small number of
numeric parameters that need to be learned from data. Linear regression, which is used for
regression and Logistic regression, which is used for classification. The least-squares method
is a form of mathematical regression analysis used to determine the line of best fit for a set of
data, providing a visual demonstration of the relationship between the data points. A linear
classifier that will achieve perfect separation on linearly separable data is the perceptron, originally
proposed as a simple neural network. “Support Vector Machine” (SVM) is a supervised machine
learning algorithm that can be used for both classification or regression challenges. However, it
is mostly used in classification problems

8.6 Keywords

Linear Model, Linear Regression, Logistic Regression, Linear Models, Least square methods,
Support Vector Machines.

8.7 Model Questions

1. Identify the basics of choosing between linear regression and logistic regression for solving
machine learning problems.

2. Write notes on R-Squared function.

3. Elaborate on SVM with a neat diagram.

4. Discuss on Least squared method.

5. What are the assumptions of linear regression and multiple linear regression.
128

LESSON - 9

DISTANCE-BASED MODELS
Structure

9.1 Introduction

9.2 Learning Objectives

9.3 Neighbours and exemplars

9.4 Nearest neighbour classification

9.5 Distance based clustering

9.6 Hierarchical Clustering

9.7 Summary

9.8 Keywords

9.9 Model Questions

9.1 Introduction

Any forms of learning are based on generalizing from training data to unseen data by exploiting
the similarities between the two. With grouping models such as decision trees these similarities
take the form of an equivalence relation or partition of the instance space: two instances are
similar whenever they end up in the same segment of this partition. In this chapter we consider
learning methods that utilize more graded forms of similarity. There are many different ways in
which similarity can be measured, and in this section we take a look at the most important of
them. We a discuss two key concepts in distance-based machine learning: neighbors and
exemplars. we consider what is perhaps the best-known distance-based learning method: the
nearest-neighbour classifier and K- means clustering, hierarchical clustering by constructing
dendrograms.
129

9.2 Learning Objectives

 To comprehend distance-based learning algorithms

 To learn neighbours and exemplars

 To apply nearest neighbour classification

 To study distance-based clustering

 To explore more on hierarchical clustering

9.3 Neighbours and exemplars

Distance metrics are a key part of several machine learning algorithms. These distance metrics
are used in both supervised and unsupervised learning, generally to calculate the similarity
between data points. An effective distance metric improves the performance of our machine
learning model, whether that’s for classification tasks or clustering. Hence, we can calculate
the distance between points and then define the similarity between them. The four types of
Distance Metrics in Machine Learning are,

 Euclidean Distance

 Manhattan Distance

 Minkowski Distance

 Hamming Distance

Euclidean Distance

Euclidean Distance represents the shortest distance between two points. Most machine learning
algorithms including K-Means use this distance metric to measure the similarity between
observations. Let’s say we have two points as shown below: A (p1, p2) and B (q1, q2). So, the
Euclidean Distance between these two points A and B will be:

Here’s the formula for Euclidean Distance:


130

We use this formula when we are dealing with 2 dimensions. We can generalize this for an n-
dimensional space as:

Where,

n = number of dimensions

pi, qi = data points

Manhattan Distance

Manhattan Distance is the sum of absolute differences between points across all the dimensions.
Note that Manhattan Distance is also known as city block distance.

We can represent Manhattan Distance as:

And the generalized formula for an n-dimensional space is given as:

Where,

n = number of dimensions

pi, qi = data points

Minkowski Distance

Minkowski Distance is the generalized form of Euclidean and Manhattan Distance. The formula
for Minkowski Distance is given as:
131

Here, p represents the order of the norm.

Hamming Distance

Hamming Distance measures, the similarity between two strings of the same length. The
Hamming Distance between two strings of the same length is the number of positions at which
the corresponding characters are different.

Let’s understand the concept using an example. Let’s say we have two strings:

“euclidean” and ”manhattan”

Since the length of these strings is equal, we can calculate the Hamming Distance. We will go
character by character and match the strings. The first character of both the strings (e and m
respectively) is different. Similarly, the second character of both the strings (u and a) is different.
and so on.

Look carefully – seven characters are different whereas two characters (the last two characters)
are similar:

Hence, the Hamming Distance here will be 7. Note that larger the Hamming Distance between
two strings, more dissimilar will be those strings (and vice versa).

The two most important of these are: formulating the model in terms of a number of prototypical

instances or exemplars, and defining the decision rule in terms of the nearest exemplars or
neighbours. The arithmetic mean minimizes squared Euclidean distance. The arithmetic mean
μ of a set of data points D in a Euclidean space is the unique point that minimizes the sum of
squared Euclidean distances to those data points. Notice that minimizing the sum of squared
Euclidean distances of a given set of points is the same as minimizing the average squared
Euclidean distance.
132

In certain situations, it makes sense to restrict an exemplar to be one of the given data points.
In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that
doesn’t have to occur in the data. Finding a medoid requires us to calculate, for each data point,
the total distance to all other data points, in order to choose the point that minimizes it. Regardless
of the distance metric used, this is an O(n2) operation for n points, so for medoid there is no
computational reason to prefer one distance metric over another.

Once we have determined the exemplars, the basic linear classifier constructs the decision
boundary as the perpendicular bisector of the line segment connecting the two exemplars. An
alternative, distance-based way to classify instances without direct reference to a decision
boundary is by the following decision rule. if x is nearest to μ•” then classify it as positive,
otherwise as negative; or equivalently, classify an instance to the class of the nearest exemplar.
So the basic linear classifier can be interpreted from a distance-based perspective as
constructing exemplars that minimize squared Euclidean distance within each class, and then
applying a nearest-exemplar decision rule.

To summarize, the main ingredients of distance-based models are

 distance metrics, which can be Euclidean, Manhattan, Minkowski or Mahalanobis,

among many others;

 exemplars: centroids that find a centre of mass according to a chosen distance

 metric, or medoid that find the most centrally located data point; and

 distance-based decision rules, which take a vote among the k nearest exemplars.

9.4 Nearest neighbour classification

K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection. K Nearest Neighbour is a simple algorithm that
stores all the available cases and classifies the new data or case based on a similarity measure.
It is mostly used to classifies a data point based on how its neighbours are classified. K-Nearest
Neighbour(K-NN) is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. K-NN algorithm assumes the similarity between the new case/data and
133

available cases and put the new case into the category that is most similar to the available
categories. The algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.

It can be used for Regression as well as for Classification but mostly it is used for the
Classification problems. It is a non-parametric algorithm, which means it does not make any
assumption on underlying data. It is also called a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset. K-NN algorithm at the training phase just
stores the dataset and when it gets new data, then it classifies that data into a category that is
much similar to the new data.

KNN can be used for both classification and regression predictive problems. However, it is
more widely used in classification problems in the industry. To evaluate any technique, we
generally look at 3 important aspects:

 Ease to interpret output

 Calculation time

 Predictive Power

Let’s take a simple case to understand this algorithm. Suppose there are two categories, i.e.,
Category A and Category B, and we have a new data point x1, so this data point will lie in which
of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of
K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram Figure 9.1,
134

Figure 9.1 K-NN Algorithm

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors.

Step-2: Calculate the Euclidean distance of K number of neighbors.

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.

Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below Figure 9.2 :
135

Figure 9.2 K-NN Algorithm – Example

Firstly, we will choose the number of neighbors, so we will choose the k=5.

Next, we will calculate the Euclidean distance between the data points(Figure 9.3) . The Euclidean
distance is the distance between two points. It can be calculated as:

Figure 9.3 K-NN Algorithm – Euclidian Distance Example

By calculating the Euclidean distance, we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below Figure 9.4:
136

Figure 9.4 K-NN Algorithm – Euclidian Distance Example

As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A. Below are some points to remember while selecting the value of K in the
K-NN algorithm:

 There is no particular way to determine the best value for “K”, so we need to try some
values to find the best out of them. The most preferred value for K is 5.

 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.

 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

It is simple to implement.

It is robust to the noisy training data

It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

Always needs to determine the value of K which may be complex some time.
137

The computation cost is high because of calculating the distance between the data points for
all the training samples.

9.5 Distance based clustering

In a distance-based context, unsupervised learning is usually taken to refer to clustering, and


we will now review a number of distance-based clustering methods. The ones considered in
this section are all exemplar-based and hence predictive: they naturally generalize to unseen
instances. Predictive distance-based clustering methods use the same ingredients as distance
based classifiers: a distance metric, away to construct exemplars and a distance-based decision
rule. In the absence of an explicit target variable, the assumption is that the distance metric
indirectly encodes the learning target, so that we aim to find clusters that are compact with
respect to the distance metric. Distance based methods optimize a global criteria based on the
distance between the patterns. k-means, CLARA, CLARANS are examples of distance based
clustering method.

K-Means Clustering

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number
of pre-defined clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties. It allows us to cluster
the data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training. It is a centroid-based algorithm,
where each cluster is associated with a centroid. The main aim of this algorithm is to minimize
the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
138

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.

 Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster. Hence each cluster has data points with some
commonalities, and it is away from other clusters.

The below Figure 9.5 explains the working of the K-means Clustering Algorithm:

Figure 9.5 K-Means algorithm

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each data point to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.


139

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below in Figure 9.6:

Figure 9.6 K-Means algorithm – Example

Let’s take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.

We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below Figure 9.7:

Figure 9.7 K-Means algorithm – Example

Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the below
Figure 9.8,
140

Figure 9.8 K-Means algorithm – Example

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let’s color them as
blue and yellow for clear visualization.

Figure 9.9 K-Means algorithm – Example

As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:
141

Figure 9.10 K-Means algorithm – Example

Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

Figure 9.11 K-Means algorithm – Example

From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
142

Figure 9.12 K-Means algorithm – Example

As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:

Figure 9.13 K-Means algorithm – Example

As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
143

Figure 9.14 K-Means algorithm – Example

The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some different
ways to find the optimal number of clusters, but here we are discussing the most appropriate
method to find the number of clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS
(for 3 clusters) is given below:

In the above formula of WCSS, “Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the
distances between each data point and its centroid within a cluster1 and the same for the other
two terms. To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.
144

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).

o For each value of K, calculates the WCSS value.

o Plots a curve between calculated WCSS values and the number of clusters K.

o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.

Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:

Figure 9.15 Elbow Method

9.6 Hierarchical Clustering

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster
145

is distinct from each other cluster, and the objects within each cluster are broadly similar to
each other. Clustering is basically a technique that groups similar data points such that the
points in the same group are more similar to each other than the points in the other groups. The
group of similar data points is called a Cluster.

Hierarchical clustering Techniques:

Hierarchical clustering is one of the popular and easy to understand clustering technique. This
clustering technique is divided into two types:

Agglomerative

 Divisive

Agglomerative Hierarchical Clustering Technique: In this technique, initially each data point
is considered as an individual cluster. At each iteration, the similar clusters merge with other
clusters until one cluster or K clusters are formed.

The basic algorithm of Agglomerative is straight forward.

 Compute the proximity matrix

 Let each data point be a cluster

 Repeat: Merge the two closest clusters and update the proximity matrix

 Until only a single cluster remains

Key operation is the computation of the proximity of two clusters. To understand better let’s see
a pictorial representation of the Agglomerative Hierarchical clustering Technique. Let’s say we
have six data points {A,B,C,D,E,F}.

Step- 1: In the initial step, we calculate the proximity of individual points and consider all the six
data points as individual clusters as shown in the Figure 9.16 below.
146

Figure 9.16 Hierarchical clustering

 Step- 2: In step two, similar clusters are merged together and formed as a single cluster.
Let’s consider B,C, and D,E are similar clusters that are merged in step two. Now, we’re
left with four clusters which are A, BC, DE, F.

 Step- 3: We again calculate the proximity of new clusters and merge the similar clusters to
form new clusters A, BC, DEF.

 Step- 4: Calculate the proximity of the new clusters. The clusters DEF and BC are similar
and merged together to form a new cluster. We’re now left with two clusters A, BCDEF.

 Step- 5: Finally, all the clusters are merged together and form a single cluster.

The Hierarchical Clustering Technique can be visualized using a Dendrograms which is


presented in Figure 9.17 . A Dendrogram is a tree-like diagram that records the sequences of
merges or splits.
147

Figure 9.17 Dendrogram

Divisive Hierarchical Clustering Technique: Since the Divisive Hierarchical clustering


Technique is not much used in the real world, I’ll give a brief of the Divisive Hierarchical clustering
Technique.

In simple words, we can say that the Divisive Hierarchical clustering is exactly the opposite of
the Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we consider
all the data points as a single cluster and in each iteration, we separate the data points from the
cluster which are not similar. Each data point which is separated is considered as an individual
cluster. In the end, we’ll be left with n clusters. As we’re dividing the single clusters into n
clusters, it is named as Divisive Hierarchical clustering.

Space and Time Complexity of Hierarchical clustering Technique:

Space complexity: The space required for the Hierarchical clustering Technique is very high
when the number of data points are high as we need to store the similarity matrix in the RAM.
The space complexity is the order of the square of n.

Space complexity = O(n²) where n is the number of data points.

Time complexity: Since we’ve to perform n iterations and in each iteration, we need to update
the similarity matrix and restore the matrix, the time complexity is also very high. The time
complexity is the order of the cube of n.

Time complexity = O(n³) where n is the number of data points.


148

Limitations of Hierarchical clustering Technique:

1. There is no mathematical objective for Hierarchical clustering.

2. All the approaches to calculate the similarity between clusters has its own disadvantages.

3. High space and time complexity for Hierarchical clustering. Hence this clustering
algorithm cannot be used when we have huge data.

9.7 Summary

Two key concepts in distance-based machine learning: neighbors and exemplars. Distance
metrics are a key part of several machine learning algorithms. These distance metrics are
used in both supervised and unsupervised learning, generally to calculate the similarity between
data points. K-Nearest Neighbors is one of the most basic yet essential classification algorithms
in Machine Learning. It can be used for Regression as well as for Classification but mostly it is
used for the Classification problems. It is a non-parametric algorithm. Distance based methods
optimize a global criterion based on the distance between the patterns. k-means, CLARA,
CLARANS are examples of distance based clustering method. K-Means Clustering is an
unsupervised learning algorithm that is used to solve the clustering problems in machine learning
or data science. K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Hierarchical clustering, also known as hierarchical
cluster analysis, is an algorithm that groups similar objects into groups called clusters.

9.8 Keywords

Nearest-neighbour classifier, K- means clustering, Hierarchical clustering, Elbow Method,


Distance based clustering

9.9 Model Questions


1. What are the types of distance metrics used in machine learning algorithms?

2. With an example case study explain hierarchical clustering.

3. Elaborate on Nearest neighbour classification.

4. Explain distance-based clustering.

5. Discuss on K-means clustering with an example.


149

LESSON – 10

PROBABILISTIC MODELS
Structure

10.1 Introduction

10.2 Learning Objectives

10.3 The normal distribution and its geometric interpretations

10.4 Probabilistic models for categorical data

10.5 Summary

10.6 Keywords

10.7 Model Questions

10.1 Introduction

We have already seen how probabilities can be useful to express a model’s expectation about
the class of a given instance. For example, a probability estimation tree attaches a class
probability distribution to each leaf of the tree, and each instance that gets filtered down to a
particular leaf in a tree model is labelled with that particular class distribution. Similarly, a calibrated
linear model translates the distance from the decision boundary into a class probability. One of
the most attractive features of the probabilistic perspective is that it allows us to view learning
as a process of reducing uncertainty. The key point is that probabilities do not have to be
interpreted as estimates of relative frequencies, but can carry the more general meaning of
(possibly subjective) degrees of belief.

10.2 Learning Objectives

 To explore probability distribution

 To understand normal distribution with area properties


150

 To analyze the various plots for normal distribution

 To study on Probabilistic models for categorical data

10.3 The normal distribution and its geometric interpretations

The normal distribution is a core concept in statistics, the backbone of data science. While
performing exploratory data analysis. We can draw a connection between probabilistic and
geometric models by considering probability distributions defined over Euclidean spaces. The
most common such distributions are normal distributions, also called Gaussians; the most
important facts concerning univariate and multivariate normal distributions. We start by
considering the univariate, two-class case. Suppose the values of x “ R follow a mixture model:
i.e., each class has its own probability distribution (a component of the mixture model). We will
assume a Gaussian mixture model, which means that the components of the mixture are both
Gaussians. We thus have,

where μ•” and σ•” are the mean and standard deviation for the positive class, and μ and σ are
the mean and standard deviation for the negative class.

Normal Distribution is an important concept in statistics and the backbone of Machine Learning.
A Data Scientist needs to know about Normal Distribution when they work with Linear Models
(perform well if the data is normally distributed), Central Limit Theorem, and exploratory data
analysis. As discovered by Carl Friedrich Gauss, Normal Distribution/Gaussian Distribution is
a continuous probability distribution. It has a bell-shaped curve that is symmetrical from the
mean point to both halves of the curve, which is presented in Figure 10.1.

Figure 10.1 Normal Distribution


151

Mathematical Definition:

A continuous random variable “x” is said to follow a normal distribution with parameter μ(mean)
and σ(standard deviation), if it’s probability density function is given by,

This is also called a normal variate.

Standard Normal Variate:

If “x” is a normal variable with a mean(μ) and a standard deviation(σ) then,

where z = standard normal variate.

Standard Normal Distribution:

The simplest case of the normal distribution, known as the Standard Normal Distribution, has
an expected value of μ(mean) 0 and σ(s.d.) 1, and is described by this probability density
function,

Distribution Curve Characteristics:

1. The total area under the normal curve is equal to 1.

2. It is a continuous distribution.
152

3. It is symmetrical about the mean. Each half of the distribution is a mirror image of the
other half.

4. It is asymptotic to the horizontal axis.

5. It is unimodal.

Area Properties:

The normal distribution carries with it assumptions and can be completely specified by two
parameters: the mean and the standard deviation. If the mean and standard deviation are known,
you can access every data point on the curve.

The empirical rule is a handy quick estimate of the data’s spread given the mean and standard
deviation of a data set that follows a normal distribution. It states that(Figure 10.2):

 68.26% of the data will fall within 1 sd of the mean(μ±1σ)

 95.44% of the data will fall within 2 sd of the mean(μ±2σ)

 99.7% of the data will fall within 3 sd of the mean(μ±3σ)

 95% — (μ±1.96σ)

 99% — (μ±2.75σ)

Figure 10.2 Normal distribution Curve – Area


153

In Machine Learning, data satisfying Normal Distribution is beneficial for model building. It makes
math easier. Models like LDA, Gaussian Naive Bayes, Logistic Regression, Linear Regression,
etc., are explicitly calculated from the assumption that the distribution is a bivariate or multivariate
normal. Also, Sigmoid functions work most naturally with normally distributed data.

Many natural phenomena in the world follow a log-normal distribution, such as financial
data and forecasting data. By applying transformation techniques, we can convert the data into
a normal distribution. Also, many processes follow normality, such as many measurement errors
in an experiment, the position of a particle that experiences diffusion, etc. So it’s better to
critically explore the data and check for the underlying distributions for each variable before
going to fit the model. Normality is an assumption for the ML models. It is not mandatory that
data should always follow normality. ML models work very well in the case of non-normally
distributed data also. Models like decision tree, XgBoost, don’t assume any normality and work
on raw data as well. Also, linear regression is statistically effective if only the model errors are
Gaussian, not exactly the entire dataset.

Let’s see a few different ways to check the normality of the distribution that we have,

Histogram

A Histogram visualizes the distribution of data over a continuous interval. Each bar (Figure
10.3) in a histogram represents the tabulated frequency at each interval/bin. In simple words,
height represents the frequency for the respective bin (interval).

Figure 10.3 Histogram


154

KDE Plots

A density plot is a smoothed, continuous version of a histogram (Figure 10.4) estimated from
the data. The most common form of estimation is known as kernel density estimation (KDE). In
this method, a continuous curve (the kernel) is drawn at every individual data point and all of
these curves are then added together to make a single smooth density estimation.

Figure 10.4 KDE Plots

Q_Q Plot

Quantiles are cut points dividing the range of a probability distribution into continuous intervals
with equal probabilities or dividing the observations in a sample in the same way. The plot is
presented in Figure 10.5.

Figure 10.5 Q_Q Plot


155

 2 quantile is known as the Median

 4 quantile is known as the Quartile

 10 quantile is known as the Decile

 100 quantile is known as the Percentile

10.4 Probabilistic models for categorical data

In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli
distribution, multinoulli distribution) is a discrete probability distribution that describes the possible
results of a random variable that can take on one of K possible categories, with the probability
of each category separately specified.

The Bernoulli distribution, named after the Swiss seventeenth century mathematician Jacob
Bernoulli, concerns Boolean or binary events with two possible outcomes: success or 1, and
failure or 0. A Bernoulli distribution has a single parameter θ which gives the probability of
success: hence P(X = 1) = θ and P(X = 0) = 1"θ. The Bernoulli distribution has expected value
E[X] = θ and variance E[(X “E[X])]= θ(1”θ).

The binomial distribution arises when counting the number of successes S in ‘n’ independent
Bernoulli trials with the same parameter θ. It is described by,

Using a naive Bayes model for classification

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other.

It is a classification technique based on Bayes’ Theorem with an assumption of independence


among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.
156

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features,
all of these properties independently contribute to the probability that this fruit is an apple and
that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:

Above,

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).

 P(c) is the prior probability of class.

 P(x|c) is the likelihood which is the probability of predictor given class.

 P(x) is the prior probability of predictor.

Let’s understand it using an example (Figure 10.6). Consider a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify
whether players will play or not based on weather condition. Let’s follow the below steps to
perform it.
157

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.

Figure 10.6 Example

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcome of prediction.

Problem: Players will play if weather is sunny. Is this statement is correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.

Pros:

 It is easy and fast to predict class of test data set. It also performs well in multi class
prediction
158

 When assumption of independence holds, a Naive Bayes classifier performs better


compare to other models like logistic regression and you need less training data.

 It performs well in case of categorical input variables compared to numerical variable(s).


For numerical variable, normal distribution is assumed (bell curve, which is a strong
assumption).

Cons:

 If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to make
a prediction. This is often known as “Zero Frequency”. To solve this, we can use the
smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.

 Another limitation of Naive Bayes is the assumption of independent predictors. In real


life, it is almost impossible that we get a set of predictors which are completely
independent.

Applications of Naive Bayes Algorithms

 Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.

 Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.

 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers


mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis
(in social media analysis, to identify positive and negative customer sentiments)

 Recommendation System: Naive Bayes Classifier and Collaborative Filtering together


builds a Recommendation System that uses machine learning and data mining
techniques to filter unseen information and predict whether a user would like a given
resource or not
159

Probabilistic models with hidden variables

Suppose you are dealing with a four-class classification problem with classes A, B, C and D. If
you have a sufficiently large and representative training sample of size n, you can use the
relative frequencies in the sample nA, . . . ,nD to estimate the class prior pˆA = nA/n, . . . ,pˆD =
nD/n, as we have done many times before.6 Conversely, if you know the prior and want to know
the most likely class distribution in a random sample of n instances, you would use the prior to
calculate expected values E[nA] =pA ·n, . . . ,E[nD] = pD ·n. So, complete knowledge of one
allows us to estimate or infer.

the other. However, sometimes we have a bit of knowledge about both. For example, we may
know that pA = 1/2 and that C is twice as likely as B, without knowing the complete prior.

Expectation-Maximization

The expectation-maximization algorithm is an approach for performing maximum likelihood


estimation in the presence of latent variables. It does this by first estimating the values for the
latent variables, then optimizing the model, then repeating these two steps until convergence. It
is an effective and general approach and is most commonly used for density estimation with
missing data, such as clustering algorithms like the Gaussian Mixture Model.

Under the domain of statistics, Maximum Likelihood Estimation is the approach of estimating
the parameters of a probability distribution through maximizing the likelihood function to make
the observed data most probable for the statistical modelling. There is a limitation with MLE, it
considers that data is complete and fully observable, and assumes that all the model-associated
variables are present already. Instead, in most of the cases, some relevant variables might be
hidden that makes inconsistencies. Such unobserved or hidden data variables are known as
Latent variables.

Probability density estimation is the forming of the estimates on the basis of observed data
that incorporates picking a probability distribution function and the parameters of that function
to explain the joint probability of the observed data.

Convergence is simply the instinct on the basis of probability, suppose there is a very small
difference of probability between the two random variables, then it is said to be converged.
Hare, convergence implies the values match with each other.
160

A latent variable model consists of observable and unobservable variables. Observed variables
are ones that can be measured or recorded, and latent/ hidden variables are those that can’t be
observed directly instead need to be inferred from the observed variables.

Gaussian Mixture Model or Mixture of Gaussian as it is sometimes called, is not so much a


model as it is a probability distribution. It is a universally used model for generative unsupervised
learning or clustering. It is also called Expectation-Maximization Clustering or EM Clustering
and is based on the optimization strategy. Gaussian Mixture models are used for representing
Normally Distributed subpopulations within an overall population. The advantage of Mixture
models is that they do not require which subpopulation a data point belongs to. It allows the
model to learn the subpopulations automatically. This constitutes a form of unsupervised learning.

A Gaussian is a type of distribution, and it is a popular and mathematically convenient type of


distribution. A distribution is a listing of outcomes of an experiment and the probability associated
with each outcome.

Sometimes our data has multiple distributions or it has multiple peaks. It does not always have
one peak, and one can notice that by looking at the data set. It will look like there are multiple
peaks happening here and there. There are two peak points and the data seems to be going up
and down twice or maybe three times or four times. But if there are Multiple Gaussian distributions
that can represent this data, then we can build what we called a Gaussian Mixture Model.

10.5 Summary

Probabilities can be useful to express a model’s expectation about the class of a given instance.
The normal distribution is a core concept in statistics, the backbone of data science. Normal
Distribution/Gaussian Distribution is a continuous probability distribution. It has a bell-shaped
curve that is symmetrical from the mean point to both halves of the curve. A Histogram visualizes
the distribution of data over a continuous interval. Each bar in a histogram represents the
tabulated frequency at each interval/bin. In probability theory and statistics, a categorical
distribution (also called a generalized Bernoulli distribution, Multinoulli distribution) is a discrete
probability distribution that describes the possible results of a random variable. Naive Bayes
classifiers are a collection of classification algorithms based on Bayes’ Theorem. Gaussian
Mixture Model or Mixture of Gaussian as it is sometimes called, is not so much a model as it is
a probability distribution.
161

10.6 Keywords

Normal distribution, Histogram, KDE Plots, Q_Q Plot, Bernoulli distribution, Naive Bayes model

10.7 Model Questions

1. Discuss about normal distribution with its properties.

2. How conditional probability is defined using naïve bayes algorithm?

3. Write notes on Histogram, KDE Plots, Q_Q Plot.

4. Explain Expectation-Maximization algorithm.

5. Deliberate on Probabilistic models for categorical data.


162

LESSON - 11

FEATURES
Structure

11.1 Introduction

11.2 Learning Objectives

11.3 Kinds of feature

11.4 Feature transformations

11.5 Feature construction and selection

11.6 Summary

11.7 Keywords

11.8 Model Questions

11.1 Introduction

Features, also called attributes, are defined as mappings fi :X ’!Fi from the instance space X to
the feature domain Fi . We can distinguish features by their domain: common feature domains
include real and integer numbers, but also discrete sets such as colours, the Booleans, and so
on. We can also distinguish features by the range of permissible operations. For example, we
can calculate a group of people’s average age but not their average blood type, so taking the
average value is an operation that is permissible on some features but not on others.

Although many data sets come with pre-defined features, they can be manipulated in many
ways. For example, we can change the domain of a feature by rescaling or discretization; we
can select the best features from a larger set and only work with the selected ones; or we can
combine two or more features into a new feature. In fact, a model itself is a way of constructing
a new feature that solves the task at hand.
163

11.2 Learning Objectives

 To understand the importance of features

 To get insight on types of features

 To learn feature transformations

 To explore more on feature construction and selection

 Apply the techniques for better model building

11.3 Kinds of feature

Consider two features, one describing a person’s age and the other their house number. Both
features map into the integers, but the way we use those features can be quite different.
Calculating the average age of a group of people is meaningful, but an average house number
is probably not very useful! In other words, what matters is not just the domain of a feature, but
also the range of permissible operations. These, in turn, depend on whether the feature values
are expressed on a meaningful scale. Despite appearances, house numbers are not really
integers but ordinals: we can use them to determine that number 10’s neighbours are number
8 and number 12, but we cannot assume that the distance between 8 and 10 is the same as
the distance between 10 and 12. Because of the absence of a linear scale it is not meaningful
to add or subtract house numbers, which precludes operations such as averaging.

Calculations on features

Let’s take a closer look at the range of possible calculations on features, often referred to as
aggregates or statistics. Three main categories are statistics of central tendency, statistics of
dispersion and shape statistics. Each of these can be interpreted either as a theoretical property
of an unknown population or a concrete property of a given sample – here we will concentrate
on sample statistics. Starting with statistics of central tendency, the most important ones are

 the mean or average value;

 the median, which is the middle value if we order the instances from lowest to

 highest feature value; and the mode, which is the majority value or values.
164

Of these statistics, the mode is the one we can calculate whatever the domain of the feature:
so, for example, we can say that the most frequent blood type in a group of people is O+. In
order to calculate the median, we need to have an ordering on the feature values: so we can
calculate both the mode and the median house number in a set of addresses.1 In order to
calculate the mean, we need a feature expressed on some scale: most often this will be a
linear scale for which we calculate the familiar arithmetic mean. It is often suggested that the
median tends to lie between the mode and the mean, but there are plenty of exceptions to this
‘rule’. The famous statistician Karl Pearson suggested a more specific rule of thumb (with
therefore even more exceptions): the median tends to fall one-third of the way from mean to
mode.

The second kind of calculation on features are statistics of dispersion or ‘spread’. Two well-
known statistics of dispersion are the variance or average squared deviation from the (arithmetic)
mean, and its square root, the standard deviation. Variance and standard deviation essentially
measure the same thing, but the latter has the advantage that it is expressed on the same
scale as the feature itself. For example, the variance of the body weight in kilograms of a group
of people is measured in kg2 , whereas the standard deviation is measured in kilograms. The
absolute difference between the mean and the median is never larger than the standard deviation
– this is a consequence of Chebyshev’s inequality, which states that at most 1/k2 of the values
are more than k standard deviations away from the mean.

A simpler dispersion statistic is the difference between maximum and minimum value, which is
called the range. A natural statistic of central tendency to be used with the range is the midrange
point, which is the mean of the two extreme values. These definitions assume a linear scale but
can be adapted to other scales using suitable transformations. For example, for a feature
expressed on a logarithmic scale, such as frequency, we would take the ratio of the highest and
lowest frequency as the range, and the harmonic mean of these two extremes as the midrange
point. Other statistics of dispersion include percentiles.

The p-th percentile is the value such that p per cent of the instances fall below it. If we have 100
instances, the 80th percentile is the value of the 81st instance in a list of increasing values. If p
is a multiple of 25 the percentiles are also called quartiles, and if it is a multiple of 10 the
percentiles are also called deciles. Note that the 50th percentile, the 5th decile and the second
quartile are all the same as the median. Percentiles, deciles and quartiles are special cases of
165

quantiles. Once we have quantiles we can measure dispersion as the distance between different
quantiles. For instance, the interquartile range is the difference between the third and first quartile
(i.e., the 75th and 25th percentile). The skew and ‘peakedness’ of a distribution can be measured
by shape statistics such as skewness and kurtosis. The main idea is to calculate the third and
fourth central moment of the sample.

Categorical, ordinal and quantitative features

Given these various statistics we can distinguish three main kinds of feature: those with a
meaningful numerical scale, those without a scale but with an ordering, and those without
either. We will call features of the first type quantitative; they most often involve a mapping into
the reals (another term in common use is ‘continuous’). Even if a feature maps into a subset of
the reals, such as age expressed in years, the various statistics such as mean or standard
deviation still require the full scale of the reals.

Features with an ordering but without scale are called ordinal features. The domain of an ordinal
feature is some totally ordered set, such as the set of characters or strings. Even if the domain
of a feature is the set of integers, denoting the feature as ordinal means that we have to dispense
with the scale, as we did with house numbers. Another common example are features that
express a rank order: first, second, third, and so on. Ordinal features allow the mode and
median as central tendency statistics, and quantiles as dispersion statistics.

Features without ordering or scale are called categorical features (or sometimes ‘nominal’
features). They do not allow any statistical summary except the mode. One subspecies of the
categorical features is the Boolean feature, which maps into the truth values true and false.
Models treat these different kinds of feature in distinct ways. First, consider tree models such
as decision trees. A split on a categorical feature will have as many children as there are feature
values. Ordinal and quantitative features, on the other hand, give rise to a binary split, by selecting
a value v0 such that all instances with a feature value less than or equal to v0 go to one child,
and the remaining instances to the other child. It follows that tree models are insensitive to the
scale of quantitative features.

For example, whether a temperature feature is measured on the Celsius scale or on the
Fahrenheit scale will not affect the learned tree. Neither will switching from a linear scale to a
logarithmic scale have any effect: the split threshold will simply be logv0 instead of v0. In general,
166

tree models are insensitive to monotonic transformations on the scale of a feature, which are
those transformations that do not affect the relative order of the feature values. In effect, tree
models ignore the scale of quantitative features, treating them as ordinal. The same holds for
rule models.

Structured features

It is usually tacitly assumed that an instance is a vector of feature values. In other words, the
instance space is a Cartesian product of d feature domains: X =F1 ×. . .×Fd . This means that
there is no other information available about an instance apart from the information conveyed
by its feature values. Identifying an instance with its vector of feature values is what computer
scientists call an abstraction, which is the result of filtering out unnecessary information.
Representing an e-mail as a vector of word frequencies is an example of an abstraction.
However, sometimes it is necessary to avoid such abstractions, and to keep more information
about an instance than can be captured by a finite vector of feature values. For example, we
could represent an e-mail as a long string; or as a sequence of words and punctuation marks;
or as a tree that captures the HTML mark-up; and so on. Features that operate on such structured
instance spaces are called structured features.

Structured features can be constructed either prior to learning a model, or simultaneously with
it. The first scenario is often called propositionalisation because the features can be seen as a
translation from first-order logic to propositional logic without local variables. The main challenge
with propositionalisation approaches is how to deal with combinatorial explosion of the number
of potential features. Notice that features can be logically related: e.g., the second clause above
covers a subset of the instances covered by the first one. It is possible to exploit this if structured
feature construction is integrated with model building, as in inductive logic programming.

11.4 Feature transformations

Feature transformations aim at improving the utility of a feature by removing, changing or adding
information. We could order feature types by the amount of detail they convey: quantitative
features are more detailed than ordinal ones, followed by categorical features, and finally Boolean
features. The best-known feature transformations are those that turn a feature of one type into
another of the next type down this list. But there are also transformations that change the scale
of quantitative features, or add a scale (or order) to ordinal, categorical and Boolean features.
167

An overview of possible feature transformations is presented below. Normalization and calibration


adapt the scale of quantitative features, or add a scale to features that don’t have one. Ordering
adds or adapts the order of feature values without reference to a scale. The other operations
abstract away from unnecessary detail, either in a deductive way (unordering, binarisation) or
by introducing new information (thresholding, discretisation).

The simplest feature transformations are entirely deductive, in the sense that they achieve a
well-defined result that doesn’t require making any choices. Binarisation transforms a categorical
feature into a set of Boolean features, one for each value of the categorical feature. This loses
information since the values of a single categorical feature are mutually exclusive, but is
sometimes needed if a model cannot handle more than two feature values. Unordering trivially
turns an ordinal feature into a categorical one by discarding the ordering of the feature values.
This is often required since most learning models cannot handle ordinal features directly.

Thresholding and discretisation

Thresholding transforms a quantitative or an ordinal feature into a Boolean feature by finding a


feature value to split on. Concretely, let f : X ’!R be a quantitative feature and let t “ R be a
threshold, then ft : X ’! {true, false} is a Boolean feature defined by ft (x) = true if f (x) e” t and ft (x)
= false if f (x) < t . We can choose such thresholds in an unsupervised or a supervised way.

Discretisation transforms a quantitative feature into an ordinal feature. Each ordinal value is
referred to as a bin and corresponds to an interval of the original quantitative feature. Again, we
can distinguish between supervised and unsupervised approaches. Unsupervised discretisation
methods typically require one to decide the number of bins beforehand. A simple

method that often works reasonably well is to choose the bins so that each bin has approximately
the same number of instances: this is referred to as equal-frequency discretisation.
168

Another unsupervised discretisation method is equal-width discretisation, which chooses the


bin boundaries so that each interval has the same width. The interval width can be established
by dividing the feature range by the number of bins if the feature has upper and lower limits;
alternatively, we can take the bin boundaries at an integer number of standard deviations above
and below the mean. Switching now to supervised discretisation methods, we can distinguish
between top–down or divisive discretisation methods on the one hand, and bottom–up or
agglomerative discretisation methods on the other. Divisive methods work by progressively

splitting bins, whereas agglomerative methods proceed by initially assigning each instance to
its own bin and successively merging bins. In either case an important role is played by the
stopping criterion, which decides whether a further split or merge is worthwhile.

Normalisation and calibration

Thresholding and discretisation are feature transformations that remove the scale of a quantitative
feature. We now turn our attention to adapting the scale of a quantitative feature, or adding a
scale to an ordinal or categorical feature. If this is done in an unsupervised fashion it is usually
called normalisation, whereas calibration refers to supervised approaches taking in the (usually
binary) class labels. Feature normalisation is often required to neutralise the effect of different
quantitative features being measured on different scales.

Sometimes feature normalisation is understood in the stricter sense of expressing the feature
on a [0,1] scale. This can be achieved in various ways. If we know the feature’s highest and
lowest values h and l , then we can simply apply the linear scaling f ’! ( f “l )/(h “l ). We sometimes
have to guess the value of h or l , and truncate any value outside [l ,h]. For example, if the
feature measures age in years, we may take l = 0 and h = 100, and truncate any f > h to 1.

Feature calibration is understood as a supervised feature transformation adding a meaningful


scale carrying class information to arbitrary features. This has a number of important advantages.
For instance, it allows models that require scale, such as linear classifiers, to handle categorical
and ordinal features. It also allows the learning algorithm to choose whether to treat a feature as
categorical, ordinal or quantitative.

We will assume a binary classification context, and so a natural choice for the calibrated feature’s
scale is the posterior probability of the positive class, conditioned on the feature’s value. This
169

has the additional advantage that models that are based on such probabilities, such as naive
Bayes, do not require any additional training once the features are calibrated.

Incomplete features

At the end of this section on feature transformations we briefly consider what to do if we don’t
know a feature’s value for some of the instances. Missing feature values at training time are
trickier to handle. First of all, the very fact that a feature value is missing may be correlated with
the target variable. For example, the range of medical tests carried out on a patient is likely to
depend on their medical history. For such features it may be best to have a designated ‘missing’
value so that, for instance, a tree model can split on it.

However, this would not work for, say, a linear model. In such cases we can complete the
feature by ‘filling in’ the missing values, a process known as imputation. For instance, in a
classification problem we can calculate the per-class means, medians or modes over the
observed values of the feature and use this to impute the missing values. A somewhat more
sophisticated method takes feature correlation into account by building a predictive model for
each incomplete feature and uses that model to ‘predict’ the missing value.

11.5 Feature construction and selection

The previous section on feature transformation makes it clear that there is a lot of scope in
machine learning to play around with the original features given in the data. We can take this
one step further by constructing new features from several original features. we can construct
a new feature from two Boolean or categorical features by forming their Cartesian product.

For example, if we have one feature Shape with values Circle, Triangle and Square, and another
feature Colour with values Red, Green and Blue, then their Cartesian product would be the
feature (Shape,Colour) with values (Circle,Red), (Circle,Green), (Circle,Blue), (Triangle,Red),
and so on. The effect that this would have depends on the model being trained. Constructing

Cartesian product features for a naive Bayes classifier means that the two original features are
no longer treated as independent, and so this reduces the strong bias that naive Bayes models
have. This is not the case for tree models, which can already distinguish between all possible
pairs of feature values. On the other hand, a newly introduced Cartesian product feature may
incur a high information gain, so it can possibly affect the model learned.
170

There are many other ways of combining features. For instance, we can take arithmetic or
polynomial combinations of quantitative features. One attractive possibility is to first apply concept
learning or subgroup discovery, and then use these concepts or subgroups as new Boolean
features.

Once we have constructed new features it is often a good idea to select a suitable subset of
them prior to learning. Not only will this speed up learning as fewer candidate features need to
be considered, it also helps to guard against overfitting. There are two main approaches to
feature selection. The filter approach scores features on a particular metric and the top-scoring
features are selected. Many of the metrics we have seen so far can be used for feature scoring,
including information gain, the χ2 statistic, the correlation coefficient, to name just a few. An
interesting variation is provided by the Relief feature selection method, which repeatedly samples
a random instance x and finds its nearest hit h (instance of the same class) as well as its
nearest miss m (instance of opposite class). The i -th feature’s score is then decreased by
Dis(xi ,hi )2 and increased by Dis(xi ,mi )2, where Dis is some distance measure (e.g., Euclidean
distance for quantitative features, Hamming distance for categorical features). The intuition is
that we want to move closer to the nearest hit while differentiating from the nearest miss.

Matrix transformations and decompositions

We can also view feature construction and selection from a geometric perspective, assuming
quantitative features. To this end we represent our data set as a matrix X with n data points in
rows and d features in columns, which we want to transform into a new matrix W with n rows
and r columns by means of matrix operations. To simplify matters a bit, we assume that X is
zero-centred and that W= XT for some d-by-r transformation matrix T. For example, feature
scaling corresponds to T being a d-by-d diagonal matrix; this can be combined with feature
selection by removing some of T’s columns. A rotation is achieved by T being orthogonal, i.e.,
TTT = I.

One of the best-known algebraic feature construction methods is principal component analysis
(PCA). Principal components are new features constructed as linear combinations of the given
features. The first principal component is given by the direction of maximum variance in the
data; the second principal component is the direction of maximum variance orthogonal to the
first component, and so on. PCA can be explained in a number of different ways: here, we will
171

derive it by means of the singular value decomposition (SVD). Any n-by-d matrix can be uniquely
written as a product of three matrices with special properties:

11.6 Summary

The features are the descriptive attributes, and the label is what you’re attempting to predict or
forecast. Features with an ordering but without scale are called ordinal features. Features without
ordering or scale are called categorical features. Structured features can be constructed either
prior to learning a model, or simultaneously with it. Feature transformations aim at improving
the utility of a feature by removing, changing or adding information. Thresholding and
discretisation are feature transformations that remove the scale of a quantitative feature.

11.7 Keywords

Features, Ordinal features, Quantitative features, Categorical features, Structured features,


Feature transformations, Feature Selection

11.8 Model Questions

1. Elaborate on types of features with example.

2. Explain in detail feature transformations.

3. Elaborate on feature construction and selection.


172

LESSON – 12

MODEL ENSEMBLES
Structure

12.1 Introduction

12.2 Learning Objectives

12.3 Bagging and Random Forest

12.4 Boosting

12.5 Summary

12.6 Keywords

12.7 Model Questions

12.1 Introduction

Combinations of models are generally known as model ensembles. They are among the most
powerful techniques in machine learning, often outperforming other methods. This comes at
the cost of increased algorithmic and model complexity. The main motivations came from
computational learning theory on the one hand, and statistics on the other. It is a well-known
statistical intuition that averaging measurements can lead to a more stable and reliable estimate
because we reduce the influence of random fluctuations in single measurements.

So if we were to build an ensemble of slightly different models from the same training data, we
might be able to similarly reduce the influence of random fluctuations in single models. The key
question here is how to achieve diversity between these different models. As we shall see, this
can often be achieved by training models on random subsets of the data, and even by constructing
them from random subsets of the available features. In essence, ensemble methods in machine
learning have the following two things in common:

 they construct multiple, diverse predictive models from adapted versions of the training
data (most often reweighted or resampled);
173

 they combine the predictions of these models in some way, often by simple averaging
or voting (possibly weighted).

It should, however, also be stressed that these commonalities span a very large and diverse
space, and that we should correspondingly expect some methods to be practically very different
even though superficially similar. For example, it makes a big difference whether the way in
which training data is adapted for the next iteration takes the predictions of the previous models
into account or not.

12.2 Learning Objectives

 To learn about ensemble methods

 To comprehend bagging method

 To study Random Forest algorithm and application

 To learn boosting techniques.

12.3 Bagging and Random Forest

Bagging, short for ‘bootstrap aggregating’, is a simple but highly effective ensemble method
that creates diverse models on different random samples of the original data set. These samples
are taken uniformly with replacement and are known as bootstrap samples. Because samples
are taken with replacement the bootstrap sample will in general contain duplicates, and hence
some of the original data points will be missing even if the bootstrap sample is of the same size
as the original data set. This is exactly what we want, as differences between the bootstrap
samples will create diversity among the models in the ensemble.

Bagging, also known as bootstrap aggregation, is the ensemble learning method that is
commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data
in a training set is selected with replacement—meaning that the individual data points can be
chosen more than once. After several data samples are generated, these weak models are
then trained independently, and depending on the type of task—regression or classification, for
example—the average or majority of those predictions yield a more accurate estimate. As a
note, the random forest algorithm is considered an extension of the bagging method, using
both bagging and feature randomness to create an uncorrelated forest of decision trees.
174

Ensemble learning gives credence to the idea of the “wisdom of crowds,” which suggests that
the decision-making of a larger group of people is typically better than that of an individual
expert. Similarly, ensemble learning refers to a group (or ensemble) of base learners, or models,
which work collectively to achieve a better final prediction. A single model, also known as a
base or weak learner, may not perform well individually due to high variance or high bias.
However, when weak learners are aggregated, they can form a strong learner, as their
combination reduces bias or variance, yielding better model performance.

Ensemble methods are frequently illustrated using decision trees as this algorithm can be
prone to overfitting (high variance and low bias) when it hasn’t been pruned and it can also lend
itself to underfitting (low variance and high bias) when it’s very small, like a decision stump,
which is a decision tree with one level. Remember, when an algorithm overfits or underfits to its
training set, it cannot generalize well to new datasets, so ensemble methods are used to
counteract this behavior to allow for generalization of the model to new datasets. While decision
trees can exhibit high variance or high bias, it’s worth noting that it is not the only modeling
technique that leverages ensemble learning to find the “sweet spot” within the bias-variance
tradeoff.

In 1996, Leo Breiman introduced the bagging algorithm, which has three basic steps:

Bootstrapping: Bagging leverages a bootstrapping sampling technique to create diverse


samples. This resampling method generates different subsets of the training dataset by selecting
data points at random and with replacement. This means that each time you select a data point
from the training dataset, you are able to select the same instance multiple times. As a result,
a value/instance repeated twice (or more) in a sample.

Parallel training: These bootstrap samples are then trained independently and in parallel with
each other using weak or base learners.

Aggregation: Finally, depending on the task (i.e. regression or classification), an average or a


majority of the predictions are taken to compute a more accurate estimate. In the case of
regression, an average is taken of all the outputs predicted by the individual classifiers; this is
known as soft voting. For classification problems, the class with the highest majority of votes is
accepted; this is known as hard voting or majority voting.
175

There are a number of key advantages and challenges that the bagging method presents when
used for classification or regression problems. The key benefits of bagging include:

Ease of implementation: Python libraries such as scikit-learn (also known as sklearn) make
it easy to combine the predictions of base learners or estimators to improve model performance.

Reduction of variance: Bagging can reduce the variance within a learning algorithm. This is
particularly helpful with high-dimensional data, where missing values can lead to higher variance,
making it more prone to overfitting and preventing accurate generalization to new datasets.

The key challenges of bagging include:

Loss of interpretability: It’s difficult to draw very precise business insights through bagging
because due to the averaging involved across predictions. While the output is more precise
than any individual data point, a more accurate or complete dataset could also yield more
precision within a single classification or regression model.

Computationally expensive: Bagging slows down and grows more intensive as the number
of iterations increase. Thus, it’s not well-suited for real-time applications. Clustered systems or
a large number of processing cores are ideal for quickly creating bagged ensembles on large
test sets.

Less flexible: As a technique, bagging works particularly well with algorithms that are less
stable. One that are more stable or subject to high amounts of bias do not provide as much
benefit as there’s less variation within the dataset of the model. As noted in the Hands-On
Guide to Machine Learning (link resides outside of IBM), “bagging a linear regression model will
effectively just return the original predictions for large enough b.”

Applications of Bagging

The bagging technique is used across a large number of industries, providing insights for both
real-world value and interesting perspectives, such as in the GRAMMY Debates with Watson.

Key use cases include:

Healthcare: Bagging has been used to form medical data predictions. For example, research
shows that ensemble methods have been used for an array of bioinformatics problems, such
176

as gene and/or protein selection to identify a specific trait of interest. More specifically, this
research delves into its use to predict the onset of diabetes based on various risk predictors.

IT: Bagging can also improve the precision and accuracy in IT systems, such as ones network
intrusion detection systems. Meanwhile, this research looks at how bagging can improve the
accuracy of network intrusion detection—and reduce the rates of false positives.

Environment: Ensemble methods, such as bagging, have been applied within the field of
remote sensing. More specifically, this research shows how it has been used to map the types
of wetlands within a coastal landscape.

Finance: Bagging has also been leveraged with deep learning models in the finance industry,
automating critical tasks, including fraud detection, credit risk evaluations, and option pricing
problems. This research (link resides outside IBM) demonstrates how bagging among other
machine learning techniques have been leveraged to assess loan default risk. This study
highlights how bagging helps to minimize risk by to prevent credit card fraud within banking and
financial institutions.

Random Forest

Random forest is an ensemble model using bagging as the ensemble method and decision
tree as the individual model. Random Forest is a popular machine learning algorithm that belongs
to the supervised learning technique. It can be used for both Classification and Regression
problems in ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, “Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset.” Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

The below diagram explains the working of the Random Forest algorithm(Figure 12.1):
177

Figure 12.1 Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
forest classifier:

o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.

o The predictions from each tree must have very low correlations.

Below are some points that explain why we should use the Random Forest algorithm:

o It takes less training time as compared to other algorithms.

o It predicts output with high accuracy, even for the large dataset it runs efficiently.

o It can also maintain accuracy when a large proportion of data is missing.

Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
178

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below Figure 12.2 :

Figure 12.2 Example


179

Applications of Random Forest

There are mainly four sectors where Random Forest mostly used:

Banking: Banking sector mostly uses this algorithm for the identification of loan risk.

Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.

Land Use: We can identify the areas of similar land use by this algorithm.

Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

Random Forest is capable of performing both Classification and Regression tasks.

It is capable of handling large datasets with high dimensionality.

It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

12.4 Boosting

Boosting is an ensemble modeling technique that attempts to build a strong classifier from the
number of weak classifiers. It is done by building a model by using weak models in series.
Firstly, a model is built from the training data. Then the second model is built which tries to
correct the errors present in the first model. This procedure is continued and models are added
until either the complete training data set is predicted correctly or the maximum number of
models are added.

There are several boosting algorithms. The original ones, proposed by Robert Schapire and
Yoav Freund were not adaptive and could not take full advantage of the weak learners. Schapire
and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious
Gödel Prize. AdaBoost was the first really successful boosting algorithm developed for the
180

purpose of binary classification. AdaBoost is short for Adaptive Boosting and is a very popular
boosting technique that combines multiple “weak classifiers” into a single “strong classifier”.

Algorithm:

1. Initialise the dataset and assign equal weight to each of the data point.

2. Provide this as input to the model and identify the wrongly classified data points

3. Increase the weight of the wrongly classified data points.

4. if (got required results)

Goto step 5

else

Goto step 2

5. End

Similarities Between Bagging and Boosting

Bagging and Boosting, both being the commonly used methods, have a universal similarity of
being classified as ensemble methods. Here we will explain the similarities between them.

Both are ensemble methods to get N learners from 1 learner.

Both generate several training data sets by random sampling.

Both make the final decision by averaging the N learners (or taking the majority of them i.e
Majority Voting).

Both are good at reducing variance and provide higher stability.

AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an Ensemble
Method in Machine Learning. It is called Adaptive Boosting as the weights are re-assigned to
each instance, with higher weights assigned to incorrectly classified instances. Boosting is
used to reduce bias as well as variance for supervised learning. It works on the principle of
181

learners growing sequentially. Except for the first, each subsequent learner is grown from
previously grown learners. In simple words, weak learners are converted into strong ones. The
AdaBoost algorithm works on the same principle as boosting with a slight difference.

First, let us discuss how boosting works. It makes ‘n’ number of decision trees during the data
training period. As the first decision tree/model is made, the incorrectly classified record in the
first model is given priority. Only these records are sent as input for the second model. The
process goes on until we specify a number of base learners we want to create. Remember,
repetition of records is allowed with all boosting techniques.

Figure 12.3 Boosting

The above Figure 12.3, shows how the first model is made and errors from the first model are
noted by the algorithm. The record which is incorrectly classified is used as input for the next
model. This process is repeated until the specified condition is met. As you can see in the
figure, there are ‘n’ number of models made by taking the errors from the previous model. This
is how boosting works. The models 1,2, 3,…, N are individual models that can be known as
decision trees. All types of boosting models work on the same principle.

Since we now know the boosting principle, it will be easy to understand the AdaBoost algorithm.
Let’s dive into AdaBoost’s working. When the random forest is used, the algorithm makes an ‘n’
number of trees. It makes proper trees that consist of a start node with several leaf nodes.
Some trees might be bigger than others, but there is no fixed depth in a random forest. With
AdaBoost, however, the algorithm only makes a node with two leaves, known as Stump.
182

12.5 Summary

Combinations of models are generally known as model ensembles. They are among the most
powerful techniques in machine learning, often outperforming other methods. Bagging, short
for ‘bootstrap aggregating’, is a simple but highly effective ensemble method that creates diverse
models on different random samples of the original data set. Random forest is an ensemble
model using bagging as the ensemble method and decision tree as the individual model. Boosting
is an ensemble modeling technique that attempts to build a strong classifier from the number of
weak classifiers.

12.6 Keywords

Ensemble, Bagging, Boosting, Bootstrap aggregation, Random Forest

12.7 Model Questions

1. Discuss on ensemble method.

2. Explain Random Forest algorithm with a suitable diagram.

3. Explain boosting techniques.

4. What are bagging methods in machine learning?

5. Compare and contrast boosting Vs bagging.


183

LESSON – 13

MACHINE LEARNING EXPERIMENTS


Structure

13.1 Introduction

13.2 Learning Objectives

13.3 What to measure

13.4 How to measure

13.5 How to interpret

13.6 Summary

13.7 Keywords

13.8 Model Questions

13.1 Introduction

Machine Learning is a practical subject as much as a computational one. While we may be


able to prove that a particular learning algorithm converges to the theoretically optimal model
under certain assumptions, we need actual data to investigate, e.g., the extent to which those
assumptions are actually satisfied in the domain under consideration, or whether convergence
happens quickly enough to be of practical use. We thus evaluate or run particular models or
learning algorithms on one or more data sets, obtain a number of measurements and use
these to answer particular questions we might be interested in. This broadly characterizes
what is known as machine learning experiments.

Machine learning experiments pose questions about models that we try to answer by means
ofmeasurements on data. The following are common examples of the types of question we are
interested in:

How does model m perform on data from domain D?

Which of these models has the best performance on data from domain D?
184

How do models produced by learning algorithm A perform on data from domain D?

Which of these learning algorithms gives the best model on data from domain D?

13.2 Learning Objectives

 To understand the process of developing machine learning projects.

 To learn evaluation measures.

 To get insight on how to measure the project and interpret the same.

13.3 What to measure

A good starting point for our measurements is the evaluation measures. The appropriateness
of any of these for our purposes depends on how we define performance in relation to the
question the experiment is designed to answer: let’s call it our experimental objective. It is
important not to confuse performance measures and experimental objectives: the former is
something we can measure, while the latter is what we are really interested in. There is often a
discrepancy between the two.

In machine learning the situation is usually more concrete, and our experimental objective –
accuracy, say – is something we can measure in principle, or at least estimate (since we’re
generally interested in accuracy on unseen data). However, there may be unknown factors we
have to account for. For example, the model may need to operate in different operating contexts
with different class distributions.

if you choose accuracy as your evaluation measure, you are making an implicit assumption
that the class distribution in the test set is representative for the operating context in which the
model is going to be deployed. Furthermore, if all you recorded in your experiments is accuracy,
you will not be able to switch to average recall later if you realise that you need to incorporate
varying class distributions.
185

It is therefore good practice to record sufficient information to be able to reproduce the


contingency table if needed. A sufficient set of measurements would be true positive rate, true
negative rate (or false positive rate), the class distribution and the size of the test set. As a
second example of how your choice of evaluation measures can carry implicit assumptions we
consider the case of precision and recall as often reported in the information retrieval literature.

In summary, your choice of evaluation measures should reflect the assumptions you are making
about your experimental objective as well as possible contexts in which your models operate.
We have looked at the following cases:

 Accuracy is a good evaluation measure if the class distribution in your test set is
representative for the operating context.

 Average recall is the evaluation measure of choice if all class distributions are equally
likely.

 Precision and recall shift the focus from classification accuracy to a performance analysis
ignoring the true negatives.

 Predicted positive rate and AUC are relevant measures in a ranking context.

13.4 How to measure

The question of ‘how to measure it’ thus seems to have a very straightforward answer: construct
the contingency table from a test set and perform the relevant calculations. However, two issues
demand our attention: (i) which data to base our measurements on, and (ii) how to assess the
inevitable uncertainty associated with every measurement.

Cross-validation is a resampling procedure used to evaluate machine learning models on a


limited data sample. The procedure has a single parameter called k that refers to the number of
groups that a given data sample is to be split into. As such, the procedure is often called k-fold
cross-validation. If we don’t have a lot of data, the following cross-validation procedure is often
applied: randomly partition the data in k parts or ‘folds’, set one fold aside for testing, train a
186

model on the remaining k “1 folds and evaluate it on the test fold. This process is repeated k
times until each fold has been used for testing once.

This may seem curious at first since we are evaluating k models rather than a single one, but
this makes sense if we are evaluating a learning algorithm (whose output is a model, so we
want to average over models) rather than a single model (whose outputs are instance labels,
so we want to average over those). By averaging over training sets we get a sense of the
variance of the learning algorithm (i.e., its dependence on variations in the training data), although
it should be noted that the training sets in cross-validation have considerable overlap and are
clearly not independent. Once we are satisfied with the performance of our learning algorithm,
we can run it over the entire data set to obtain a single model.

If we expect the learning algorithm to be sensitive to the class distribution we should apply
stratified cross-validation: this aims at achieving roughly the same class distribution in each
fold. Cross-validation runs can be repeated for different random partitions into folds and the
results averaged again to further reduce variance in our estimates: this is referred to as, e.g.,
10

times 10-fold cross-validation. It should be kept in mind that this leads increasingly to
independence assumptions being violated – if we take this too far our accuracy estimate will
overfit the given data and not be representative for new data.

13.5 How to interpret it

Once we have estimates of a relevant evaluation measure for our models or learning algorithms
we can use them to select the best one. The fundamental issue here is how to deal with the
inherent uncertainty in these estimates. We will discuss two key concepts: confidence intervals
and significance tests. An understanding of these concepts is necessary if you want to appreciate
current practice in interpreting results from machine learning experiments – however, it is good
to realise that current practice is coming under increasing scrutiny.
187

It should also be noted that the methods described here represent only a tiny fraction of the vast
spectrum of possibilities. Suppose our estimate aˆ follows a normal distribution around the true
mean a with standard deviation σ. Assuming for the moment that we know these parameters,
we can calculate for any interval the likelihood of the estimate falling in the interval, by calculating
the area under the normal density function in that interval.

For example, the likelihood of obtaining an estimate within ±1 standard deviation around the
mean is 68%. Thus, if we take 100 estimates from independent test sets, we expect 68 of them
to be within one standard deviation on either side of the mean – or equivalently, we expect the
true mean to fall within one standard deviation on either side of the estimate in 68 cases. This
is called the 68% confidence interval of the estimate.

For two standard deviations the confidence level is 95% – these values can be looked up in
probability tables or calculated using statistical packages such as Matlab or R. Notice that
confidence intervals for normally distributed estimates are symmetric because the normal
distribution is symmetric, but this is not generally the case: e.g., the binomial distribution is
asymmetric (except for p = 1/2). Notice also that, in case of symmetry, we can easily change
the interval into a one-sided interval: for example, we expect the mean to be more than one
standard deviation above the estimate in 16 cases out of 100, which gives a one-sided 84%
confidence interval from minus infinity to the mean plus one standard deviation.

More generally, in order to construct confidence intervals we need to know (i) the sampling
distribution of the estimates, and (ii) the parameters of that distribution. We saw previously that
accuracy estimated from a single test set with n instances follows a scaled binomial distribution
with variance aˆ(1" aˆ)/n. This would lead to asymmetric confidence intervals, but the skew in
the binomial distribution is only really noticeable if na(1"a) < 5: if that is not the case the normal
distribution is a good approximation for the binomial one. So, we use the binomial expression
for the variance and use the normal distribution to construct the confidence intervals. Notice
that confidence intervals are statements about estimates rather than statements about the true
value of the evaluation measure.
188

13.6 Summary

Machine Learning is a practical subject as much as a computational one. This broadly


characterizes what is known as machine learning experiments. A good starting point for our
measurements is the evaluation measures. Cross-validation is a resampling procedure used
to evaluate machine learning models on a limited data sample. Once we have estimates of a
relevant evaluation measure for our models or learning algorithms, we can use them to select
the best one.

13.7 Keywords

Evaluation measure, Accuracy, Cross-validation, Recall, F-Measure

13.8 Model Questions

1. Discuss about the evaluation aspect of the machine learning project.

2. Write notes on Cross-Validation.


189

MODEL QUESTION PAPER


MASTER OF COMPUTER APPLICATIONS
SECOND YEAR - THIRD SEMESTER
PAPER - XI
MACHINE LEARNING
Time: 3 Hours Maximum :80 marks

Section – A (10 x 2 = 20 Marks)

Answer any 10 Questions.

1. Define Machine Learning.

2. What are the two types of Learning?

3. Differentiate testing Vs training.

4. What do you mean by regression?

5. Define Scoring and ranking.

6. Write the two types of clustering.

7. What is feature selection?

8. Define Boosting.

9. Define confusion matrix.

10. What do you mean by concept learning?


190

Section – B (5x6 = 30 Marks)

Answer any five Questions.

11. Write short notes on classification and regression.

12. Explain hypothesis space and paths.

13. Explain rule learning.

14. Write a note on Linear models.

15. Write a naïve bayes algorithm with an example.

16. Explain model ensembles.

17. Explain SVM algorithm.

Section – C (3 x 10 = 30 Marks)

Answer any three Questions.

18. Write in detail any five real time applications of Machine learning.

19. Elaborate on Tree models.

20. Explain in detail k-means clustering algorithm.

21. Explain the fundamentals of Machine learning.

22. Explain indetail about bagging and boosting algorithm in machine learning.
SPCA 202N

MASTER OF
COMPUTER APPLICATIONS

SECOND YEAR
THIRD SEMESTER

CORE PAPER - XII

PRACTICAL-V

MACHINE LEARNING LAB

INSTITUTE OF DISTANCE EDUCATION


UNIVERSITY OF MADRAS
MASTER OF COMPUTER APPLICATIONS CORE PAPER - XII
SECOND YEAR - THIRD SEMESTER
PRACTICAL-V: MACHINE
LEARNING LAB

WELCOME
Warm Greetings.

It is with a great pleasure to welcome you as a student of Institute of Distance


Education, University of Madras. It is a proud moment for the Institute of Distance education
as you are entering into a cafeteria system of learning process as envisaged by the University
Grants Commission. Yes, we have framed and introduced as per AICTE Choice Based
Credit System(CBCS) in Semester pattern from the academic year 2020-21. You are free
to choose courses, as per the Regulations, to attain the target of total number of credits set
for each course and also each degree programme. What is a credit? To earn one credit in
a semester you have to spend 30 hours of learning process. Each course has a weightage
in terms of credits. Credits are assigned by taking into account of its level of subject content.
For instance, if one particular course or paper has 4 credits then you have to spend 120
hours of self-learning in a semester. You are advised to plan the strategy to devote hours of
self-study in the learning process. You will be assessed periodically by means of tests,
assignments and quizzes either in class room or laboratory or field work. In the case of PG
(UG), Continuous Internal Assessment for 20(25) percentage and End Semester University
Examination for 80 (75) percentage of the maximum score for a course / paper. The theory
paper in the end semester examination will bring out your various skills: namely basic
knowledge about subject, memory recall, application, analysis, comprehension and
descriptive writing. We will always have in mind while training you in conducting experiments,
analyzing the performance during laboratory work, and observing the outcomes to bring
out the truth from the experiment, and we measure these skills in the end semester
examination. You will be guided by well experienced faculty.

I invite you to join the CBCS in Semester System to gain rich knowledge leisurely at
your will and wish. Choose the right courses at right times so as to erect your flag of
success. We always encourage and enlighten to excel and empower. We are the cross
bearers to make you a torch bearer to have a bright future.

With best wishes from mind and heart,

DIRECTOR

(i)
MASTER OF COMPUTER APPLICATIONS CORE PAPER - XII
SECOND YEAR - THIRD SEMESTER
PRACTICAL-V: MACHINE
LEARNING LAB

COURSE WRITER

Dr. S. SASIKALA
Associate Professor in Computer Science
Institute of Distance Education
University of Madras
Chepauk, Chennai - 600 005.

CO-ORDINATION & EDITING

Dr. S. SASIKALA
Associate Professor in Computer Science
Institute of Distance Education
University of Madras
Chepauk, Chennai - 600 005.

© UNIVERSITY OF MADRAS, CHENNAI 600 005.

(ii)
MASTER COMPUTER APPLICATIONS

SECOND YEAR

THIRD SEMESTER

CORE PAPER - XII

PRACTICAL- V

MACHINE LEARNING LAB

SYLLABUS

Course Objective: To introduce the basic concepts and techniques of Machine Learning.

To develop skills of using recent machine learning software for solving practical problems

and gain experience of doing independent study and research.

Course Outcomes: After successful completion of this course, student will be able to
ability to identify the characteristics of datasets and compare the trivial data and big data

for various applications.

Machine Learning Tools and Applications:

Machine learning platform: WEKA machine learning workbench, R platform, Python Scipy.

Machine Learning Library: scikit-learn in Python, JSAT in Java, Accord Framework in .NET

GUIs: KNIME, RapidMiner, Orange.

Applications: Prediction using data, Speech recognition, Healthcare, Object recognition in


images, Natural Language Processing, Online search

(iii)
1. Data Preprocessing:

a. Data Cleaning

b. Data Transformation

c. Data Reduction

d. Feature extraction

2. Supervised learning:

a. Decision tree classification

b. Classification using Support Vector Machines

c. Classification using Multilayer perceptron

3. Unsupervised learning:

a. Regression

b. K-Means clustering

c. Hierarchical clustering

Mini Project: Application of Data Preprocessing techniques and Machine Learning techniques
on a data set selected from UCI repository / Kaggle / Government and submission of a
report.

(iv)
INSTITUTE OF DISTANCE EDUCATION

RECORD OF PRACTICALS

MCA SECOND YEAR


THIRD SEMESTER

2020-2021

Practical - V

MACHINE LEARNING LAB

Name :

Enrolment Number :

Group No :

UNIVERSITY OF MADRAS
CHENNAI - 600 005

(v)
INSTITUTE OF DISTANCE EDUCATION
UNIVERSITY OF MADRAS
CHENNAI - 600 005.

Certified that this is the Bonafide Record of work done by ____________________

with Enrolment Number _______________________ of MCA Second Year - Third Semester

Degree Course in the Institute of Distance Education, University of Madras during the year

2020 -2021 in respect of Subject Code - SPCA 202N / Practical - V - MACHINE

LEARNING LAB

Date: Co-ordinator

Submitted for Third Year M.C.A. Degree Course Practical Examination held on

_____________________ at IDE, University of Madras.

Date: Examiners

1. Name:

Signature:

2. Name:

Signature:

(vi)
MASTER COMPUTER APPLICATIONS

SECOND YEAR - THIRD SEMESTER

CORE PAPER - XII

PRACTICAL- V

MACHINE LEARNING LAB

SCHEME OF LESSONS

Sl.No. Title Page

DATA PREPROCESSING

1 Data Cleaning 10

2 Data Transformation 16

3 Data Reduction 21

4 Feature Extraction 28

SUPERVISED LEARNING

5 Decision Tree Algorithm 30

6 Classification using Support Vector Machine 37

7 Classification using Multilayer Perceptron 42

UNSUPERVISED LEARNING

8 Regression 47

9 K-Means Clustering 53

10 Hierarchical Clustering 60

(vii)
1

Python Language

Python is meant for all those who wants to roll out a blistering career in the field of programming.

With its increasing demand and adoption in top MNCs because of its credibility and smooth
deliver ability, Python has helped young aspiring coders form a firm base in this field.

Developed in the year 1980, it is developed, supported, managed and monitored by the Python
Software Foundation (PSF).

Python is flexible & compatible with and runs smoothly on almost all the major operating systems
like iOS, Windows, Linux and .net. Thus merging easily with the back end processes of
organizations worldwide.

Python has some exceptional and amazing features, which no other programming languages
have that makes it an exclusive language to learn.

So, let’s move ahead and look at some of these exceptional features of Python that
makes Programming with Python for beginners an easy task which one could ever imagine.

Features of Python

The concepts, logic and features being unchanged, it is always on the students whether he
wants to learn a subject in an easy way or the hard way. Similarly, Python learn hard way can
break the interest of the student and ultimately can force him to opt of the field.
2

1. Python is an Open Source Language and Free of Cost:

Python can be easily downloaded and installed on any major OS for free from its official website
and has a free Open Source License (OSL), which is also stands valid for commercial purposes.

2. Python is easy to Learn, Code and Implement:

Even though Python is suitable for beginners, it is considered as an advanced coding languages
whose most of the instructions closely resembles the English Language.

It helps a professional to makes its way convenient with expertise on software other than
Python with its resemblance to object oriented structures with other software.

3. Python is Fast, Flexible and Portable:

Python is so flexible that if one developer writes an instruction in Python, it is easily


understandable by another developer who can even modify those instructions per his
requirements. Similarly, codes that are developed on Windows OS can be executed and
improvised on another OS.

Python being an interpreted language, the code gets checked at the time of execution and
then runs on the system followed by others.

4. Python Supports Multiple Domains:

The detailed listings of the packages catered by Python language is provided in the Python
Package Index. To include modules like GUI, Test, Automation, DB, Networking, Web
Development, Image Processing, Text Processing, etc., Python consists of several standard
libraries, which plays the following role-

• Machine Learning- ML provides agility that helps in improvising AI machines with


TensorFlow and Keras libraries.

• Hadoop- With the help of Hadoop, Python employs Pydoop library to render support to
Big Data processing.

• Web Development- Frameworks such as Django, Pylons and Flask, which are coded
in Python are considered to be more stable for developing websites.
3

• Automated Testing- Automated testing tools like Selenium and Splinter have application
programming interface that is capable of executing on Python. A developer can also
test on cross platforms and cross browsers with the help of Pytest.

• Graphics- By using Python’s Tkinter library, GUI applications can be written and run
effortlessly.

• Image Processing- PIL- Python image processing library supports imaging files from
various formats.

5. Python also Supports Scientific Libraries:

Python’s support to scientific libraries is improving on data processing levels rapidly. Python
helps in clearing the blocks formed in statistical data modeling by using its Numpy, Scipy,
Pandas and Matplotlib, which are described below-

• Numpy- Also known as Numeric Python as it supports higher level mathematical


calculations.

• Scipy- Scipy supports several scientific mathematical calculations as linear algebra.


Fourier transforms, interpolation tools, signal processing and statistics.

• Pandas- Python’sPandas are helpful in delivering data frame functionality and data
munging. It also supports SQL database, CSV, Excel and Text files.

• Matplotlib- Matplotlib is an advanced library used by Python learners to develop graphs.

6. Python follows both Procedural and OOP Coding Patterns:

Pythons comes with a mix of procedural and OOP coding patterns, which allows its users to
code both in procedural function and Object Oriented Programming function.

In Python, a coder and write long lines of the program in procedural pattern with a mix of code
and data to feed. OOP pattern involves programming with class, objects and methods that
opens the road to inheritance, abstraction and polymorphism functional behaviors.

• The structural unit of an object, which consists of grouped data and which functions
with reuse capability is called class. Class’s functional process is known as a method.
4

• The purpose of the object is to create a class instance during run-time or when the
code is made operational.

• A process that is used in class to hide complex procedures and to simplify its appearance
is defined as an abstraction.

• The subclass, which inherits and uses the functions and attributes of the primary or
parent class deploys a phenomenon called Inheritance to reuse code

• The time when inheritance is used, polymorphism is employed which helps the inherited
class to perform the same functions of the parent class differently.

How to Install Python:

Installation on Windows

Visit the link https://www.python.org/downloads/ to download the latest release of Python. In


this process, we will install Python 3.10.1 on our Windows operating system. When we click on
the above link, it will bring us the following page.

Step - 1: Select the Python’s version to download.

Click on the download button.


5

Step - 2: Click on the Install Now

Double-click the executable file, which is downloaded; the following window will open. Select
Customize installation and proceed. Click on the Add Path check box, it will set the Python path
automatically.

For all recent versions of Python, the recommended installation options include Pip and IDLE.
Older versions might not include such additional features.

Step – 3: Installation in Process


6

Now, try to run python on the command prompt. Type the command python -version in case of
python3.

Step – 4: The next dialog will prompt you to select whether to Disable path length limit.
Choosing this option will allow Python to bypass the 260-character MAX_PATH limit. Effectively,
it will enable Python to use long path names.

The Disable path length limit option will not affect any other system settings. Turning it on
will resolve potential name length issues that may arise with Python projects developed in
Linux.

Step - 5: Verify Python Was Installed On Windows

a) Navigate to the directory in which Python was installed on the system.

b) Double-click python.exe.

c) The output should be similar to what you can see below:


7

Step – 6: Verify Pip Was Installed

If you opted to install an older version of Python, it is possible that it did not come with Pip
preinstalled. Pip is a powerful package management system for Python software packages.
Thus, make sure that you have it installed. We recommend using Pip for most Python packages,
especially when working in virtual environments.

To verify whether Pip was installed:

1. Open the Start menu and type “cmd.”

2. Select the Command Prompt application.

3. Enter pip -V in the console. If Pip was installed successfully, you should see the following
output:
8

Packages installation:

Package: A package contains all the files you need for a module.

Check if PIP is Installed

Navigate your command line to the location of Python’s script directory, and type the following:

Example:

Check PIP version:

Download a Package

Downloading a package is very easy.

Open the command line interface and tell PIP to download the package you want.

Example

Download a package named “camelcase”:


9

Remove a Package

Use the uninstall command to remove a package:

Example

Uninstall the package named “camelcase”:


10

LESSON -1

DATA CLEANING
AIM:

To perform various data cleaning tasks such as handling missing values, duplicate values,
irrelevant data and manual errors on the given “Employee Details” dataset in Python.

INPUT DATAFRAME:

CODING:

“””

READ THE VALUES

“””

import numpy as np

import pandas as pd

employee=pd.read_csv(r’D:\Machine Learning\DataCleaning.CSV’)

print(employee)

“””

REMOVE THE DUPLICATE VALUES

“””
11

employee.drop_duplicates(subset=”Name”,keep=’first’,inplace=True)

print (employee)

“””

MISSING VALUE TREATMENT

“””

missing=employee.isnull()

print(missing)

“””

DROP THE MISSING VALUES

“””

employee=employee.dropna(axis=0)

print(employee)

“””

REMOVAL OF IRRELEVANT DATA

“””

del employee[‘Sr.No’]

“””

MANNUAL ERROR WHILE TYPING

“””

employee[‘Project’]=employee[‘Project’].str.replace(‘mobile’,’Mobile’)

“””

RENAMING COLUMNS

“””

employee.columns=[‘EmployeeName’,’Address’,’Mobile’,’Domain’,’E-mailid’]

print(employee)
12

OUTPUT:

“””

REMOVE THE DUPLICATE VALUES

“””
13

“””

MISSING VALUE TREATMENT

“””

Sr.No Name Address Mobile no. Project emailid

0 False False False False False False

1 False False False False False False

2 False False False True True False

3 False False False False False False

4 False False False False False False

5 False False False False False False

6 False False False False False False

8 False False False False False False

“””

DROP THE MISSING VALUES

“””
14

“””

REMOVAL OF IRRELEVANT DATA

“””

“””

MANUAL ERROR WHILE TYPING

“””
15

“””

RENAMING COLUMNS

“””
16

LESSON - 2

DATA TRANSFORMATION
AIM:

To perform data transformation tasks such as converting categorical data into numeric format
on the given “Employee Details” dataset in Python.

INPUT DATAFRAME:

Student_id Age Gender Grade Employed

1 19 Male 1st Class yes

2 20 Female 2nd Class no

3 18 Male 1st Class no

4 21 Female 2nd Class no

5 19 Male 1st Class no

6 20 Male 2nd Class yes

7 19 Female 3rd Class yes

8 21 Male 3rd Class yes

9 22 Female 3rd Class yes

10 21 Male 1st Class no

11 20 Female 3rd Class yes

12 20 Male 1st Class no

13 19 Male 1st Class yes

14 20 Female 3rd Class no

15 19 Male 1st Class yes

16 19 Female 2nd Class no

17 20 Male 2nd Class yes

18 18 Female 3rd Class no

19 21 Male 2nd Class yes

20 19 Male 2nd Class no


17

CODING:

import numpy as np

import pandas as pd

df=pd.read_csv(r’d:\Machine Learning\Student.CSV’,header=0)

print(df)

“””

CATEGORIAL COLUMN and SEPARTE WITH DIFFERENT DATA FRAME

“””

df_categorial=df.select_dtypes(exclude=[np.number])

print(df_categorial)

“””

Distinct Unique values in the GRADE Column

“””

x =df_categorial[‘Grade’].unique()

print(x)

“””

Grade .value counts

“””

y1=df_categorial[‘Grade’].value_counts()

print(y1)

y2 =df_categorial[‘Gender’].value_counts()

print(y2)

“””

Replace

“””

df.Grade.replace({“1st Class”:1,”2nd Class”:2,”3rd Class”:3},inplace=True)

print(df)
18

df.Employed.replace({“yes”:1,”no”:0},inplace=True)

print(df)

df.head()

OUTPUT:

“””

READ DATA

“””

Student_id Age Gender Grade Employed

0 1 19 Male 1st Class yes

1 2 20 Female 2nd Class no

2 3 18 Male 1st Class no

3 4 21 Female 2nd Class no

4 5 19 Male 1st Class no

.. ... ... ... ... ...

227 228 21 Female 1st Class no

228 229 20 Male 2nd Class no

229 230 20 Male 3rd Class yes

230 231 19 Female 1st Class yes

231 232 20 Male 3rd Class yes

“””

CATEGORIAL COLUMN and SEPARTE WITH DIFFERENT DATA FRAME

“””

Gender Grade Employed

0 Male 1st Class yes

1 Female 2nd Class no

2 Male 1st Class no

3 Female 2nd Class no


19

4 Male 1st Class no

.. ... ... ...

227 Female 1st Class no

228 Male 2nd Class no

229 Male 3rd Class yes

230 Female 1st Class yes

231 Male 3rd Class yes

“””

Distinct Unique values in the GRADE Column

“””

[‘1st Class’ ‘2nd Class’ ‘3rd Class’]

“””

Grade .value counts

“””

2nd Class 80

3rd Class 80

1st Class 72

Name: Grade, dtype: int64

Male 136

Female 96

Name: Gender, dtype: int64

“””

Replace

“””

Student_id Age Gender Grade Employed

0 1 19 Male 1 yes

1 2 20 Female 2 no

2 3 18 Male 1 no
20

3 4 21 Female 2 no

4 5 19 Male 1 no

.. ... ... ... ... ...

227 228 21 Female 1 no

228 229 20 Male 2 no

229 230 20 Male 3 yes

230 231 19 Female 1 yes

231 232 20 Male 3 yes

[232 rows x 5 columns]

Student_id Age Gender Grade Employed

0 1 19 Male 1 1

1 2 20 Female 2 0

2 3 18 Male 1 0

3 4 21 Female 2 0

4 5 19 Male 1 0

.. ... ... ... ... ...

227 228 21 Female 1 0

228 229 20 Male 2 0

229 230 20 Male 3 1

230 231 19 Female 1 1

231 232 20 Male 3 1

[232 rows x 5 columns]


21

LESSON - 3

DATA REDUCTION
AIM:

To perform data reduction using Principal Component Analysis (PCA) on the given “Iris” dataset
in Python.

INPUT DATAFRAME:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

1 5.1 3.5 1.4 0.2 Iris-setosa

2 4.9 3 1.4 0.2 Iris-setosa

3 4.7 3.2 1.3 0.2 Iris-setosa

4 4.6 3.1 1.5 0.2 Iris-setosa

5 5 3.6 1.4 0.2 Iris-setosa

6 5.4 3.9 1.7 0.4 Iris-setosa

7 4.6 3.4 1.4 0.3 Iris-setosa

8 5 3.4 1.5 0.2 Iris-setosa

9 4.4 2.9 1.4 0.2 Iris-setosa

10 4.9 3.1 1.5 0.1 Iris-setosa

11 5.4 3.7 1.5 0.2 Iris-setosa

12 4.8 3.4 1.6 0.2 Iris-setosa

13 4.8 3 1.4 0.1 Iris-setosa

14 4.3 3 1.1 0.1 Iris-setosa


22

15 5.8 4 1.2 0.2 Iris-setosa

16 5.7 4.4 1.5 0.4 Iris-setosa

17 5.4 3.9 1.3 0.4 Iris-setosa

18 5.1 3.5 1.4 0.3 Iris-setosa

19 5.7 3.8 1.7 0.3 Iris-setosa

CODING:

from sklearn.decomposition import PCA

import pandas as pd

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

df =pd.read_csv(r”D:\Machine Learning\iris.csv”)

print(df.head)

labels =df[‘Species’]

x = df.drop([“Id”,”Species”],axis =1)

print(x)

plt.scatter(x[“SepalWidthCm”],x[“PetalLengthCm”])

plt.show

variables = [“SepalLengthCm”, “SepalWidthCm”,”PetalLengthCm”, “PetalWidthCm”]

x = df.loc[:, variables].values

y = df.loc[:,[“Species”]].values

x = StandardScaler().fit_transform(x)

x = pd.DataFrame(x)

x.head()

pca = PCA()
23

x_pca = pca.fit_transform(x)

x_pca = pd.DataFrame(x_pca)

x_pca.head()

explained_variance = pca.explained_variance_ratio_

explained_variance

x_pca[“Species”]=y

x_pca.columns = [“PC1”,”PC2",”PC3",”PC4",’Species’]

x_pca.head()

fig = plt.figure()

ax = fig.add_subplot(1,1,1)

ax.set_xlabel(“Principal Component 1”)

ax.set_ylabel(“Principal Component 2”)

ax.set_title(“2 component PCA”)

targets = [“Iris-setosa”, “Iris-versicolor”, “Iris-virginica”]

colors = [“r”, “g”, “b”]

for target, color in zip(targets,colors):

indicesToKeep = x_pca[“Species”] == target

ax.scatter(x_pca.loc[indicesToKeep, “PC1”]

, x_pca.loc[indicesToKeep, “PC2”]

, c = color

, s = 50)

ax.legend(targets)

ax.grid()
24

OUTPUT:

“””

READ DATA

“””

Id SepalLengthCm .. PetalWidthCm Species

0 1 5.1 ... 0.2 Iris-setosa

1 2 4.9 ... 0.2 Iris-setosa

2 3 4.7 ... 0.2 Iris-setosa

3 4 4.6 ... 0.2 Iris-setosa

4 5 5.0 ... 0.2 Iris-setosa

.. ... ... ... ... ...

145 146 6.7 ... 2.3 Iris-virginica

146 147 6.3 ... 1.9 Iris-virginica

147 148 6.5 ... 2.0 Iris-virginica

148 149 6.2 ... 2.3 Iris-virginica

149 150 5.9 ... 1.8 Iris-virginica

“””

LABEL DATA

“””

SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

.. ... ... ... ...

145 6.7 3.0 5.2 2.3


25

146 6.3 2.5 5.0 1.9

147 6.5 3.0 5.2 2.0

148

6.2 3.4 5.4 2.3

“””

PLOT DATA

“””

149 5.9 3.0 5.1 1.8E


26

“””

TRANSFORM DATA

“””

0 1 2 3

0 -0.900681 1.032057 -1.341272 -1.312977

1 -1.143017 -0.124958 -1.341272 -1.312977

2 -1.385353 0.337848 -1.398138 -1.312977

3 -1.506521 0.106445 -1.284407 -1.312977

4 -1.021849 1.263460 -1.341272 -1.312977

Fit the data using PCA

0 1 2 3

0 -2.264542 0.505704 -0.121943 -0.023073

1 -2.086426 -0.655405 -0.227251 -0.103208

2 -2.367950 -0.318477 0.051480 -0.027825

3 -2.304197 -0.575368 0.098860 0.066311

4 -2.388777 0.674767 0.021428 0.037397

“””

CALCULATING VARIANCE

“””

0.727705

0.230305

0.0368383

0.00515193
27

PC1 PC2 PC3 PC4 Species

145 1.870522 0.382822 0.254532 -0.388890 Iris-virginica

146 1.558492 -0.905314 -0.025382 -0.221322 Iris-virginica

147 1.520845 0.266795 0.179277 -0.118903 Iris-virginica

148 1.376391 1.016362 0.931405 -0.024146 Iris-virginica

149 0.959299 -0.022284 0.528794 0.163676 Iris-virginica

“””

PLOT DATA USING PCA

“””
28

LESSON - 4

FEATURE EXTRACTION
AIM:

To perform feature extraction on the given image in Python.

CODING:

import numpy as np

from scipy import ndimage

import matplotlib.pyplot as plt

im = np.zeros((256, 256))

im[64:-6, 64:-4] = 1

plt.imshow(im)

plt.show()

im = ndimage.rotate(im, 15, mode=’constant’)

im = ndimage.gaussian_filter(im, 8)

plt.imshow(im)

plt.show

sx = ndimage.sobel(im, axis=0, mode=’constant’)

sy = ndimage.sobel(im, axis=1, mode=’constant’)

sob = np.hypot(sx, sy)

plt.imshow(sob)

plt.show

plt.figure(figsize=(16, 5))

plt.subplot(141)

plt.imshow(im, cmap=plt.cm.gray)

plt.axis(‘off’)
29

plt.title(‘square’, fontsize=20)

plt.subplot(142)

plt.imshow(sx)

plt.axis(‘off’)

plt.title(‘Sobel (x direction)’, fontsize=20)

plt.subplot(143)

plt.imshow(sob)

plt.axis(‘off’)

plt.title(‘Sobel filter’, fontsize=20)

#im += 0.07*np.random.random(im.shape)

sx = ndimage.sobel(im, axis=0, mode=’constant’)

sy = ndimage.sobel(im, axis=1, mode=’constant’)

sob = np.hypot(sx, sy)

plt.subplot(144)

plt.imshow(sob)

plt.axis(‘off’)

plt.title(‘Sobel for noisy image’, fontsize=20)

plt.subplots_adjust(wspace=0.02, hspace=0.02, top=1, bottom=0, left=0, right=0.9)

plt.show()

OUTPUT:
30

LESSON - 5

DECISION TREE ALGORITHM


AIM:

To demonstrate the working of Decision Tree algorithm to classify the “Balance Scale”
dataset in Python.

INPUT DATAFRAME:

Class L-Weight L-Distance R-Weight R-Distance

B 1 1 1 1

R 1 1 1 2

R 1 1 1 3

R 1 1 1 4

R 1 1 1 5

R 1 1 2 1

R 1 1 2 2

R 1 1 2 3

R 1 1 2 4

R 1 1 2 5

R 1 1 3 1

R 1 1 3 2

R 1 1 3 3

R 1 1 3 4

R 1 1 3 5

R 1 1 4 1

R 1 1 4 2

R 1 1 4 3

R 1 1 4 4
31

CODING:

import numpy as np

import pandas as pd

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

# Function importing Dataset

def importdata():

balance_data = pd.read_csv(r’D:\Machine Learning\balance-scale.csv’)

# Printing the dataswet shape

print (“Dataset Length: “, len(balance_data))

print (“Dataset Shape: “, balance_data.shape)

# Printing the dataset obseravtions

print (“Dataset: “,balance_data.head())

return balance_data

# Function to split the dataset

def splitdataset(balance_data):

# Separating the target variable

X = balance_data.values[:, 1:5]

Y = balance_data.values[:, 0]

# Splitting the dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(

X, Y, test_size = 0.3, random_state = 100)


32

return X, Y, X_train, X_test, y_train, y_test

# Function to perform training with giniIndex.

def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object

clf_gini = DecisionTreeClassifier(criterion = “gini”,

random_state = 100,max_depth=3, min_samples_leaf=5)

# Performing training

clf_gini.fit(X_train, y_train)

return clf_gini

# Function to perform training with entropy.

def tarin_using_entropy(X_train, X_test, y_train):

# Decision tree with entropy

clf_entropy = DecisionTreeClassifier(

criterion = “entropy”, random_state = 100,

max_depth = 3, min_samples_leaf = 5)

# Performing training

clf_entropy.fit(X_train, y_train)

return clf_entropy

# Function to make predictions

def prediction(X_test, clf_object):

# Predicton on test with giniIndex

y_pred = clf_object.predict(X_test)

print(“Predicted values:”)

print(y_pred)
33

return y_pred

# Function to calculate accuracy

def cal_accuracy(y_test, y_pred):

print(“Confusion Matrix: “,

confusion_matrix(y_test, y_pred))

print (“Accuracy : “,

accuracy_score(y_test,y_pred)*100)

print(“Report : “,

classification_report(y_test, y_pred))

def main():

# Building Phase

data = importdata()

X, Y, X_train, X_test, y_train, y_test = splitdataset(data)

clf_gini = train_using_gini(X_train, X_test, y_train)

clf_entropy =tarin_using_entropy(X_train, X_test, y_train)

# Operational Phase

print(“Results Using Gini Index:”)

# Prediction using gini

y_pred_gini = prediction(X_test, clf_gini)

cal_accuracy(y_test, y_pred_gini)

print(“Results Using Entropy:”)

# Prediction using entropy

y_pred_entropy = prediction(X_test, clf_entropy)

cal_accuracy(y_test, y_pred_entropy)
34

# Calling main function

if __name__==”__main__”:

main()

OUTPUT:

READ DATA

Dataset Length: 625

Dataset Shape: (625, 5)

Dataset: Class L-Weight L-Distance R-Weight R-Distance

0 B 1 1 1 1

1 R 1 1 1 2

2 R 1 1 1 3

3 R 1 1 1 4

4 R 1 1 1 5

Results Using Gini Index:

Predicted values:

[‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’

‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’

‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’

‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’

‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’

‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’

‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’

‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’

‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’

‘L’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’

‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘R’ ‘R’]


35

Confusion Matrix:

[[ 0 6 7]

[ 0 67 18]

[ 0 19 71]]

Accuracy : 73.40425531914893

Report : precision recall f1-score support

B 0.00 0.00 0.00 13

L 0.73 0.79 0.76 85

R 0.74 0.79 0.76 90

accuracy 0.73 188

macro avg 0.49 0.53 0.51 188

weighted avg 0.68 0.73 0.71 188

Results Using Entropy:

Predicted values:

[‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’

‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’

‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’

‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’

‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’

‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’

‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’

‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’

‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’

‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’

‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘R’]


36

Confusion Matrix:

[[ 0 6 7]

[ 0 63 22]

[ 0 20 70]]

Accuracy : 70.74468085106383

Report : precision recall f1-score support

B 0.00 0.00 0.00 13

L 0.71 0.74 0.72 85

R 0.71 0.78 0.74 90

accuracy 0.71 188

macro avg 0.47 0.51 0.49 188

weighted avg 0.66 0.71 0.68 188


37

LESSON - 6

CLASSIFICATION USING SUPPORT VECTOR MACHINE

AIM:

To implement Support Vector Machine algorithm to classify “Iris” dataset in Python.

INPUT DATAFRAME:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

1 5.1 3.5 1.4 0.2 Iris-setosa

2 4.9 3 1.4 0.2 Iris-setosa

3 4.7 3.2 1.3 0.2 Iris-setosa

4 4.6 3.1 1.5 0.2 Iris-setosa

5 5 3.6 1.4 0.2 Iris-setosa

6 5.4 3.9 1.7 0.4 Iris-setosa

7 4.6 3.4 1.4 0.3 Iris-setosa

8 5 3.4 1.5 0.2 Iris-setosa

9 4.4 2.9 1.4 0.2 Iris-setosa

10 4.9 3.1 1.5 0.1 Iris-setosa

11 5.4 3.7 1.5 0.2 Iris-setosa

12 4.8 3.4 1.6 0.2 Iris-setosa

13 4.8 3 1.4 0.1 Iris-setosa


38

14 4.3 3 1.1 0.1 Iris-setosa

15 5.8 4 1.2 0.2 Iris-setosa

16 5.7 4.4 1.5 0.4 Iris-setosa

17 5.4 3.9 1.3 0.4 Iris-setosa

18 5.1 3.5 1.4 0.3 Iris-setosa

19 5.7 3.8 1.7 0.3 Iris-setosa

CODING:

import numpy as np

import matplotlib.pyplot as plt

from sklearn import svm, datasets

def make_meshgrid(x, y, h=.02):

“””Create a mesh of points to plot in

Parameters

—————

x: data to base x-axis meshgrid on

y: data to base y-axis meshgrid on

h: stepsize for meshgrid, optional

Returns

———

xx, yy : ndarray

“””
39

x_min, x_max = x.min() - 1, x.max() + 1

y_min, y_max = y.min() - 1, y.max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

np.arange(y_min, y_max, h))

return xx, yy

def plot_contours(ax, clf, xx, yy, **params):

“””Plot the decision boundaries for a classifier.

Parameters

—————

ax: matplotlib axes object

clf: a classifier

xx: meshgrid ndarray

yy: meshgrid ndarray

params: dictionary of params to pass to contourf, optional

“””

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

out = ax.contourf(xx, yy, Z, **params)

return out

# import some data to play with

iris = datasets.load_iris()

# Take the first two features. We could avoid this by using a two-dim dataset

X = iris.data[:, :2]

y = iris.target
40

# we create an instance of SVM and fit out data. We do not scale our

# data since we want to plot the support vectors

C = 1.0 # SVM regularization parameter

models = (svm.SVC(kernel=’linear’, C=C),

svm.LinearSVC(C=C, max_iter=10000),

svm.SVC(kernel=’rbf’, gamma=0.7, C=C),

svm.SVC(kernel=’poly’, degree=3, gamma=’auto’, C=C))

models = (clf.fit(X, y) for clf in models)

# title for the plots

titles = (‘SVC with linear kernel’,

‘LinearSVC (linear kernel)’,

‘SVC with RBF kernel’,

‘SVC with polynomial (degree 3) kernel’)

# Set-up 2x2 grid for plotting.

fig, sub = plt.subplots(2, 2)

plt.subplots_adjust(wspace=0.4, hspace=0.4)

X0, X1 = X[:, 0], X[:, 1]

xx, yy = make_meshgrid(X0, X1)

for clf, title, ax in zip(models, titles, sub.flatten()):

plot_contours(ax, clf, xx, yy,

cmap=plt.cm.coolwarm, alpha=0.8)
41

ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors=’k’)

ax.set_xlim(xx.min(), xx.max())

ax.set_ylim(yy.min(), yy.max())

ax.set_xlabel(‘Sepal length’)

ax.set_ylabel(‘Sepal width’)

ax.set_xticks(())

ax.set_yticks(())

ax.set_title(title)

plt.show()

OUTPUT:
42

LESSON - 7

CLASSIFICATION USING MULTILAYER PERCEPTRON


AIM:

To implement Multilayer Perceptron and use it to classify the “Bank Notes” dataset in
Python.

INPUT DATAFRAME:

Image.Var Image.Skew Image.Curt Entropy Class

3.6216 8.6661 -2.8073 -0.44699 0

4.5459 8.1674 -2.4586 -1.4621 0

3.866 -2.6383 1.9242 0.10645 0

3.4566 9.5228 -4.0112 -3.5944 0

0.32924 -4.4552 4.5718 -0.9888 0

4.3684 9.6718 -3.9606 -3.1625 0

3.5912 3.0129 0.72888 0.56421 0

2.0922 -6.81 8.4636 -0.60216 0

3.2032 5.7588 -0.75345 -0.61251 0

1.5356 9.1772 -2.2718 -0.73535 0

1.2247 8.7779 -2.2135 -0.80647 0

3.9899 -2.7066 2.3946 0.86291 0

1.8993 7.6625 0.15394 -3.1108 0

-1.5768 10.843 2.5462 -2.9362 0

3.404 8.7261 -2.9915 -0.57242 0

4.6765 -3.3895 3.4896 1.4771 0

2.6719 3.0646 0.37158 0.58619 0

0.80355 2.8473 4.3439 0.6017 0

1.4479 -4.8794 8.3428 -2.1086 0


43

CODING:

import numpy as np

import pandas as pd

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

from sklearn.neural_network import MLPClassifier

import matplotlib.pyplot as plt

from sklearn.metrics import classification_report

# loading the data from csv to the pandas

bnotes= pd.read_csv(r’D:\Machine Learning\bank_note_data.csv’)

bnotes.head()

bnotes.tail()

bnotes.shape

bnotes.isnull().sum()

print(bnotes.head())

print(bnotes[‘Class’].unique())

bnotes.describe(include=’all’)

x=bnotes.drop(‘Class’,axis=1)

y=bnotes[‘Class’]

print(x.head(2))

print(y.head(2))

#splitting the data

# Splitting the dataset into train and test

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

print(x_train.shape)
44

print(y_train.shape)

# trainning under sklearn

mlp = MLPClassifier(hidden_layer_sizes=(3,2),max_iter=500,activation=’relu’)

#fitting the data

mlp.fit(x_train,y_train)

#prediting the data and calculating confusion matrix and classification error

pred=mlp.predict(x_test)

confusion_matrix(y_test,pred)

print(classification_report(y_test,pred))

OUTPUT:

READ DATA

Image.Var Image.Skew Image.Curt Entropy Class

0 3.62160 8.6661 -2.8073 -0.44699 0

1 4.54590 8.1674 -2.4586 -1.46210 0

2 3.86600 -2.6383 1.9242 0.10645 0

3 3.45660 9.5228 -4.0112 -3.59440 0

4 0.32924 -4.4552 4.5718 -0.98880 0

Image.Var Image.Skew Image.Curt Entropy Class

1367 0.40614 1.34920 -1.4501 -0.55949 1

1368 -1.38870 -4.87730 6.4774 0.34179 1

1369 -3.75030 -13.45860 17.5932 -2.77710 1

1370 -3.56370 -8.38270 12.3930 -1.28230 1

1371 -2.54190 -0.65804 2.6842 1.19520 1

(1372, 5)
45

NULL VALUE

Image.Var 0

Image.Skew 0

Image.Curt 0

Entropy 0

Class 0

CLASS LABEL

[0 1]

Image.Var Image.Skew Image.Curt Entropy

0 3.6216 8.6661 -2.8073 -0.44699

1 4.5459 8.1674 -2.4586 -1.46210

0 0

1 0

Name: Class, dtype: int64

SHAPE

(960, 4)

(960,)

PREDICTED VALUES

[1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 0 1

0010000101000001000001000110010010110

0000001001011010010100110000111111010

0001111100001000100111001001010111000

1111010010000111101101111101010001000

0010011001111101110010001110100011100

0110000100010001110000100010010010101
46

1100001011110010010111001100001000010

1011011110100001111000110101000101000

0110100010001011010100000010011010001

1100001010000011001010011001100000011

0 0 0 1 1]

CONFUSION MATRIX

[[240 1]

[ 0 171]]

CLASSIFICATION REPORT

precision recall f1-score support

0 1.00 1.00 1.00 241

1 0.99 1.00 1.00 171

accuracy 1.00 412

macro avg 1.00 1.00 1.00 412

weighted avg 1.00 1.00 1.00 412


47

LESSON - 8

REGRESSION

AIM:

To implement Linear Regression to predict the salary value from the “Salary” dataset in
Python.

INPUT DATAFRAME:

YearsExperience Salary

1.1 39343

1.3 46205

1.5 37731

2 43525

2.2 39891

2.9 56642

3 60150

3.2 54445

3.2 64445

3.7 57189

3.9 63218
48

4 55794

4 56957

4.1 57081

4.5 61111

4.9 67938

5.1 66029

5.3 83088

5.9 81363

CODING:

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

import matplotlib.pyplot as plt

# loading the data from csv to the pandas

Salary_Data= pd.read_csv(r’D:\Machine Learning\Salary_Data.csv’)

Salary_Data.head()

Salary_Data.tail()

Salary_Data.shape

Salary_Data.isnull().sum()

# splitting the feature of the target

X = Salary_Data.iloc[:,:-1].values

y= Salary_Data.iloc[:,1].values
49

print(X)

print(y)

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

regressor = LinearRegression()

regressor.fit(X_train,y_train)

# predicting the test results

y_pred=regressor.predict(X_test)

print(y_pred)

# calculate the coefficient

print(regressor.coef_)

print(regressor.intercept_)

# calculate r2 values

n =r2_score(y_test,y_pred)

print(n)

plt.scatter(X_test,y_test,color=’red’)

plt.plot(X_test,y_pred,color=’blue’)

plt.show()

OUTPUT:

Loading the data from csv to the pandas

Years Experience Salary

0 1.1 39343.0

1 1.3 46205.0

2 1.5 37731.0

3 2.0 43525.0

4 2.2 39891.0
50

Years Experience Salary

25 9.0 105582.0

26 9.5 116969.0

27 9.6 112635.0

28 10.3 122391.0

29 10.5 121872.0

Out[3]: (30, 2)

Splitting the feature of the target

YearsExperience 0

Salary 0

[[ 1.1]

[ 1.3]

[ 1.5]

[ 2. ]

[ 2.2]

[ 2.9]

[ 3. ]

[ 3.2]

[ 3.2]

[ 3.7]

[ 3.9]

[ 4. ]

[ 4. ]

[ 4.1]

[ 4.5]
51

[ 4.9]

[ 5.1]

[ 5.3]

[ 5.9]

[ 6. ]

[ 6.8]

[ 7.1]

[ 7.9]

[ 8.2]

[ 8.7]

[ 9. ]

[ 9.5]

[ 9.6]

[10.3]

[10.5]]

[ 39343. 46205. 37731. 43525. 39891. 56642. 60150. 54445. 64445.

57189. 63218. 55794. 56957. 57081. 61111. 67938. 66029. 83088.

81363. 93940. 91738. 98273. 101302. 113812. 109431. 105582. 116969.

112635. 122391. 121872.]

Predicting the test results

[ 40748.96184072 122699.62295594 64961.65717022 63099.14214487

115249.56285456 107799.50275317]

Calculate the Coefficient and Intercept

[9312.57512673]

26780.099150628186
52

Calculate r2 values

0.988169515729126
53

LESSON - 9

K-MEANS CLUSTERING
AIM:

To implement K-Means clustering algorithm to cluster the “Mall Customers” data in


Python.

INPUT DATAFRAME:

CustomerID Gender Age Annual_Income_(k$) Spending_Score

1 Male 19 15 39

2 Male 21 15 81

3 Female 20 16 6

4 Female 23 16 77

5 Female 31 17 40

6 Female 22 17 76

7 Female 35 18 6

8 Female 23 18 94

9 Male 64 19 3

10 Female 30 19 72

11 Male 67 19 14

12 Female 35 19 99

13 Female 58 20 15

14 Female 24 20 77

15 Male 37 20 13

16 Male 22 20 79

17 Female 35 21 35

18 Male 20 21 66

19 Male 52 23 29
54

CODING:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

import seaborn as sns

# loading the data from csv to the pandas

df = pd.read_csv(r’D:\Machine Learning\Mall_Customers.csv’)

df.head()

df.info()

df.isnull().sum()

# splitting the feature of the target

x = df.iloc[:,[3,4]].values

print(x)

# within cluster sum of squares-

wcss=[]

for i in range(1,11):

kmeans=KMeans(n_clusters=i,init=’k-means++’,random_state=42)

kmeans.fit(x)

wcss.append(kmeans.inertia_)

# plot an elbow graph- to show minimumcluster value

sns.set()

plt.plot(range(1,11),wcss)

plt.title(‘The elbow point graph’)

plt.xlabel(‘No of Clusters’)

plt.ylabel(‘wcss’)

plt.show()
55

# training kmeans cluser model

kmeans= KMeans(n_clusters=5,init =”k-means++”,random_state = 0)

y =kmeans.fit_predict(x)

print(y)

# visualising the plot

plt.figure(figsize=(8,8))

plt.scatter(x[y==0,0],x[y==0,1],s=50,c=”green”,label=”cluster1")

plt.scatter(x[y==1,0],x[y==1,1],s=50,c=”red”,label=”cluster2")

plt.scatter(x[y==2,0],x[y==2,1],s=50,c=”yellow”,label=”cluster3")

plt.scatter(x[y==3,0],x[y==3,1],s=50,c=”violet”,label=”cluster4")

plt.scatter(x[y==4,0],x[y==4,1],s=50,c=”blue”,label=”cluster5")

# Plot Kmeans Cluster

plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=100,c=”cyan”,label=”centroid”)

plt.title(“customer Group”)

plt.xlabel(“Annual Income”)

plt.ylabel(“Spending”)

plt.show()

OUTPUT:

Loading the data from csv to the pandas

RangeIndex: 200 entries, 0 to 199

Data columns (total 5 columns):

# Column Count Non-Null Dtype

— -—— ——— ——————— ——

0 CustomerID 200 non-null int64

1 Genre 200 non-null object

2 Age 200 non-null int64

3 Annual_Income_(k$) 200 non-null int64

4 Spending_Score 200 non-null int64


56

Splitting the feature of the target

[[ 15 39]

[ 15 81]

[ 16 6]

[ 16 77]

[ 17 40]

[ 17 76]

[ 18 6]

[ 18 94]

[ 19 3]

[ 19 72]

[ 19 14]

[ 19 99]

[ 20 15]

[ 20 77]

[ 20 13]

[ 20 79]

[ 21 35]

[ 21 66]

[ 23 29]

[ 23 98]

[ 24 35]

[ 24 73]

[ 25 5]

[ 25 73]

[103 69]
57

[113 8]

[113 91]

[120 16]

[120 79]

[126 28]

[126 74]

[137 18]

[137 83]]

Within cluster sum of squares

[269981.28]

[269981.28, 181363.59595959596]

[269981.28, 181363.59595959596, 106348.37306211118]

[269981.28, 181363.59595959596, 106348.37306211118, 73679.78903948834]

[269981.28, 181363.59595959596, 106348.37306211118, 73679.78903948834,


44448.45544793371]

[269981.28, 181363.59595959596, 106348.37306211118, 73679.78903948834,


44448.45544793371, 37265.86520484347]

[269981.28, 181363.59595959596, 106348.37306211118, 73679.78903948834,


44448.45544793371, 37265.86520484347, 30241.34361793659]

[269981.28, 181363.59595959596, 106348.37306211118, 73679.78903948834,


44448.45544793371, 37265.86520484347, 30241.34361793659, 25336.946861471864]

[269981.28, 181363.59595959596, 106348.37306211118, 73679.78903948834,


44448.45544793371, 37265.86520484347, 30241.34361793659, 25336.946861471864,
21850.165282585633]

[269981.28, 181363.59595959596, 106348.37306211118, 73679.78903948834,


44448.45544793371, 37265.86520484347, 30241.34361793659, 25336.946861471864,
21850.165282585633, 19634.55462934998]
58

1 181363.59595959596

106348.37306211118

73679.78903948834

44448.45544793371

37265.86520484347

30241.34361793659

25336.946861471864

21850.165282585633

19634.55462934998

Plot an elbow graph

Training kmeans cluser model

[3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3

1313130310000000000000000000000000000

0000000000000000000000000000000000000
59

0000000000002420242420242424242024242

4242424242424242424242424242424242424

2 4 2 4 2 4 2 4 2 4 2 4 2 4 2]

Visualizing the plot


60

LESSON - 10

HIERARCHICAL CLUSTERING
AIM:

To implement Hierarchical Clustering on the given “Student” dataset in Python.

INPUT DATAFRAME:

Marks StudentID

18 A1

22 A2

43 A3

42 A4

27 A5

25 A6

CODING:

# Importing Modules

from scipy.cluster.hierarchy import linkage, dendrogram

import matplotlib.pyplot as plt

import pandas as pd

# Reading the DataFrame

seeds_df = pd.read_csv(r’d:\StudentInf.csv’)

# Remove the grain species from the DataFrame, save for later

varieties = list(seeds_df.pop(‘StudentID’))

print(varieties)

# Extract the measurements as a NumPy array

samples = seeds_df.values
61

“””

Perform hierarchical clustering on samples using the

linkage() function with the method=’complete’ keyword argument.

Assign the result to mergings.

“””

mergings = linkage(samples, method=’complete’)

“””

Plot a dendrogram using the dendrogram() function on mergings,

specifying the keyword arguments labels=varieties, leaf_rotation=90,

and leaf_font_size=6.

“””

dendrogram(mergings,

labels=varieties,

leaf_rotation=90,

leaf_font_size=6,

plt.title(‘Cluster’)

plt.xlabel(‘students Id’)

plt.ylabel(‘Marks’)

plt.show()

OUTPUT:

Student Id List

[‘A1’, ‘A2’, ‘A3’, ‘A4’, ‘A5’, ‘A6’]


62

We can choose a threshold value, along the tallest vertical line, which best divides the two blue
lines to form clusters. In this graph, a threshold value of 12 will form two clusters. Cluster 1
includes A1 and A2 while Cluster 2 is made up of A3, A4, A5 and A6 respectively.

You might also like