PCATG
PCATG
MASTER OF
COMPUTER APPLICATIONS
SECOND YEAR
THIRD SEMESTER
PAPER - XI
MACHINE LEARNING
WELCOME
Warm Greetings.
I invite you to join the CBCS in Semester System to gain rich knowledge leisurely at
your will and wish. Choose the right courses at right times so as to erect your flag of
success. We always encourage and enlighten to excel and empower. We are the cross
bearers to make you a torch bearer to have a bright future.
DIRECTOR
(i)
MASTER OF COMPUTER PAPER - XI
APPLICATIONS MACHINE LEARNING
SECOND YEAR - THIRD SEMESTER
COURSE WRITER
Dr. S. Sasikala
Associate Professor in Computer Science
Institute of Distance Education
University of Madras
Chepauk, Chennnai - 600 005.
(ii)
MASTER OF COMPUTER APPLICATIONS
SECOND YEAR
THIRD SEMESTER
PAPER - XI
MACHINE LEARNING
SYLLABUS
Objective of the course : To introduce the basic concepts and techniques of Machine
Learning and to develop skills of using recent machine learning software for solving practical
problems.
Course Outcomes : After successful completion of this course, the students should be
able to recognize the characteristics of machine learning that make it useful to real-world
problems. Understand the foundation of generative models.
Unit 1: The Fundamentals of Machine Learning: The Machine Learning Landscape - Types
of Machine Learning Systems - Main Challenges of Machine Learning - Testing and
Validating. End-to-End Machine Learning Project - Look at the Big Picture - Get the Data -
Discover and Visualize the Data to Gain Insights - Prepare the Data for Machine Learning
Algorithms - Select and Train a Model - Fine-Tune Your Model - Launch, Monitor, and Maintain
Your System.
Unit 3: Tree Models: Decision trees – Ranking and probability estimation trees – tree
learning as variance reduction. Rule Models: Learning ordered rule lists – learning unordered
rule sets – descriptive rule learning – first–order rule learning. Linear Models: The least-
squares method – The perceptron – Support vector machines.
(iii)
Unit 4: Distance-based Models: Neighbours and exemplars – Nearest-neighbour
classification – Distance-based clustering – K-Means algorithm – Hierarchical clustering.
Probabilistic Models: The normal distribution and its geometric interpretations – probabilistic
models for categorical data – Naïve Bayes model for classification – probabilistic models
with hidden values – Expectation-Maximization.
Text Books:
1. Flach, P, “Machine Learning: The Art and Science of Algorithms that Make Sense of
Data”, Cambridge University Press, 2012
2. Aurélien Géron, “Hands-On Machine Learning with Scikit-Learn and Tensor Flow:
Concepts, Tools, and Techniques to Build Intelligent Systems”, First Edition, 2017
(Chapters 1 and 2)
References
3. Ethem Alpaydin, “Introduction to Machine Learning”, MIT Press, Third Edition, 2014
(iv)
MASTER OF COMPUTER APPLICATIONS
SECOND YEAR
THIRD SEMESTER
PAPER - XI
MACHINE LEARNING
SCHEME OF LESSONS
11 Features 162
(v)
1
LESSON -1
1.1 Introduction
1.9 Summary
1.10 Keywords
1.1 Introduction
The goal of Machine Learning (ML) is to construct computer programs that can learn from data.
Machine learning is a method of data analysis that automates analytical model building. It is a
branch of artificial intelligence based on the idea that systems can learn from data, identify
patterns and make decisions with minimal human intervention. Machine learning is a subfield of
computer science that evolved from the study of pattern recognition and computational learning
theory in artificial intelligence. Machine learning explores the construction and study of algorithms
that can learn from and make predictions on data.
2
[Machine Learning is the] field of study that gives computers the ability to learn without
being explicitly programmed. Arthur Samuel, 1959
Finally, Machine Learning can help humans learn (Figure 1.1 ): ML algorithms can be inspected
to see what they have learned.
• Problems for which existing solutions require a lot of hand-tuning or long lists of rules:
one Machine Learning algorithm can often simplify code and perform better.
• Complex problems for which there is no good solution at all using a traditional approach:
the best Machine Learning techniques can find a solution.
There are so many different types of Machine Learning systems (Figure 1.2) that it is useful to
classify them in broad categories based on:
• Whether or not they are trained with human supervision (supervised, unsupervised,
semi supervised, and Reinforcement Learning)
• Whether or not they can learn incrementally on the fly (online versus batch learning)
• Whether they work by simply comparing new data points to known data points, or instead
detect patterns in the training data and build a predictive model, much like scientists do
(instance-based versus model-based learning)
Machine Learning systems can be classified according to the amount and type of supervision
they get during training. There are four major categories: supervised learning, unsupervised
learning, semi supervised learning, and Reinforcement Learning.
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means
that the customer won’t purchase it.
• Figure B: It is a Meteorological dataset that serves the purpose of predicting wind speed
based on different parameters.
While training the model, data is usually split in the ratio of 80:20 i.e. 80% as training data
and rest as testing data. In training data, we feed input as well as output for 80% of data.
The model learns from training data only. We use different machine learning algorithms
(which we will discuss in detail in the next articles) to build our model. By learning, it means
that the model will build some logic of its own. Once the model is ready then it is good to be
tested. At the time of testing, the input is fed from the remaining 20% data which the model
has never seen before, the model will predict some value and we will compare it with actual
output and calculate the accuracy. The supervised algorithms are categorized as below,
Example in above Figure B, Output – Wind Speed is not having any discrete value but
is continuous in the particular range. The goal here is to predict a value as much closer
to the actual output value as our model can and then evaluation is done by calculating
the error value. The smaller the error the greater the accuracy of our regression model.
• Linear Regression
• Nearest Neighbor
• Decision Trees
• Random Forest
Applications
• Spam Classification: If you use a modern email system, chances are you’ve
encountered a spam filter. That spam filter is a supervised learning system. Fed email
examples and labels (spam/not spam), these systems learn how to preemptively filter
out malicious emails so that their user is not harassed by them. Many of these also
behave in such a way that a user can provide new labels to the system and it can learn
user preference.
• Face Recognition: Do you use Facebook? Most likely your face has been used in a
supervised learning algorithm that is trained to recognize your face. Having a system
7
that takes a photo, finds faces, and guesses who that is in the photo (suggesting a tag)
is a supervised process. It has multiple layers to it, finding faces and then identifying
them, but is still supervised nonetheless.
Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns
or data groupings without the need for human intervention. Its ability to discover similarities and
differences in information make it the ideal solution for exploratory data analysis, cross-selling
strategies, customer segmentation, and image recognition. Unsupervised learning is very much
the opposite of supervised learning. It features no labels. Instead, our algorithm would be fed a
lot of data and given the tools to understand the properties of the data. From there, it can learn
to group, cluster, and/or organize the data in a way such that a human (or other intelligent
algorithm) can come in and make sense of the newly organized data.
No labels are given to the learning algorithm, leaving it on its own to find structure in its input.
Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means
towards an end (feature learning).
Applications
Recommender Systems: If you’ve ever used YouTube or Netflix, you’ve most likely encountered
a video recommendation system. These systems are often times placed in the unsupervised
domain. We know things about videos, maybe their length, their genre, etc. We also know the
watch history of many users. Taking into account users that have watched similar videos as
you and then enjoyed other videos that you have yet to see, a recommender system can see
this relationship in the data and prompt you with such a suggestion.
Buying Habits: It is likely that your buying habits are contained in a database somewhere and
that data is being bought and sold actively at this time. These buying habits can be used in
unsupervised learning algorithms to group customers into similar purchasing segments. This
helps companies market to these grouped segments and can even resemble recommender
systems.
8
Grouping User Logs: Less user facing, but still very relevant, we can use unsupervised learning
to group user logs and issues. This can help companies identify central themes to issues their
customers face and rectify these issues, through improving a product or designing an FAQ to
handle common issues. Either way, it is something that is actively done and if you’ve ever
submitted an issue with a product or submitted a bug report, it is likely that it was fed to an
unsupervised learning algorithm to cluster it with other similar issues.
News Sections: Google News uses unsupervised learning to categorize articles on the same
story from various online news outlets. For example, the results of a presidential election could
be categorized under their label for “US” news.
Computer vision: Unsupervised learning algorithms are used for visual perception tasks, such
as object recognition.
Anomaly detection: Unsupervised learning models can comb through large amounts of data
and discover atypical data points within a dataset. These anomalies can raise awareness around
faulty equipment, human error, or breaches in security.
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k-means clustering, Decision
tree, etc. Once it applies the suitable algorithm, the algorithm divides the data objects into
groups according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized (Figure 1.6) into two types of
problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities remains into a group and has less or no similarities with the objects of another
group. Cluster analysis finds the commonalities between the data objects and categorizes
them as per the presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing strategy more effective. Such
as people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A
typical example of Association rule is Market Basket Analysis.
10
o K-means clustering
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Apriori algorithm
The most basic disadvantage of any Supervised Learning algorithm is that the dataset has
to be hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly
process, especially when dealing with large volumes of data. The most basic disadvantage of
any Unsupervised Learning is that it’s application spectrum is limited.
Intuitively, one may imagine the three types of learning algorithms as Supervised learning where
a student is under the supervision of a teacher at both home and school, Unsupervised learning
11
where a student has to figure out a concept himself and Semi-Supervised learning where a
teacher teaches a few concepts in class and gives questions as homework which are based
on similar concepts.
Applications
1. Speech Analysis: Since labeling of audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.
3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.
Example: The problem is as follows (Figure 1.7): We have an agent and a reward, with many
hurdles in between. The agent is supposed to find the best possible path to reach the reward.
The following problem explains the problem more easily.
12
The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward
that is the diamond and avoid the hurdles that are fire. The robot learns by trying all the possible
paths and then choosing the path which gives him the reward with the least hurdles. Each right
step will give the robot a reward and each wrong step will subtract the reward of the robot. The
total reward will be calculated when it reaches the final reward that is the diamond.
Input: The input should be an initial state from which the model will start
Output: There are many possible output as there are variety of solution to a particular problem
Training: The training is based upon the input, the model will return a state and the user will
decide to reward or punish the model based on its output.
1. Positive
• Maximizes Performance
• Too much Reinforcement can lead to overload of states which can diminish the result
2. Negative
• Increases Behavior
• RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.
3. The only way to collect information about the environment is to interact with it.
Machine Learning is not quite there yet; it takes a lot of data for most Machine Learning algorithms
to work properly. Even for very simple problems you typically need thousands of examples, and
for complex problems such as image or speech recognition you may need millions of examples
(unless you can reuse parts of an existing model).
In order to generalize well, it is crucial that the training data be representative of the new cases
we want to generalize to. This is true whether we use instance-based learning or model-based
learning.
Poor-Quality Data
Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-quality
measurements), it will make it harder for the system to detect the underlying patterns, so system
is less likely to perform well. It is often well worth the effort to spend time cleaning up the training
data. For example:
• If some instances are clearly outliers, it may help to simply discard them or try to fix the
errors manually.
• If some instances are missing a few features (e.g., 5% of your customers did not specify
their age), we must decide whether we want to ignore this attribute altogether, ignore
these instances, fill in the missing values (e.g., with the median age), or train one model
with the feature and one model without it, and so on.
Irrelevant Features
Critical part of the success of a Machine Learning project is coming up with a good set of
features to train on. This process, called feature engineering, involves:
15
• Feature selection: selecting the most useful features to train on among existing features.
• Feature extraction: combining existing features to produce a more useful one (as we
saw earlier, dimensionality reduction algorithms can help).
Overgeneralizing is something that we humans do all too often, and unfortunately machines
can fall into the same trap if we are not careful. In Machine Learning this is called overfitting: it
means that the model performs well on the training data, but it does not generalize well.
Underfitting is the opposite of overfitting: it occurs when the model is too simple to learn the
underlying structure of the data. For example, a linear model of life satisfaction is prone to
underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate,
even on the training examples.
• Reducing the constraints on the model (e.g., reducing the regularization hyper parameter)
Effective machine learning (ML) algorithms require quality training and testing data —
and often lots of it — to make accurate predictions. Different datasets serve different
purposes in preparing an algorithm to make predictions and decisions based on real-
world data.
Training Dataset: The sample of data used to fit the model. The actual dataset that we use to
train the model. The model sees and learns from this data. This type of data builds up the machine
learning algorithm. The data scientist feeds the algorithm input data, which corresponds to an
16
expected output. The model evaluates the data repeatedly to learn more about the data’s behavior
and then adjusts itself to serve its intended purpose.
Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit
on the training dataset while tuning model hyper parameters. During training, validation data
infuses new data into the model that it hasn’t evaluated before. Validation data provides the first
test against unseen data, allowing data scientists to evaluate how well the model makes
predictions based on the new data. Not all data scientists use validation data, but it can provide
some helpful information to optimize hyper parameters, which influence how the model assesses
data. The evaluation becomes more biased as skill on the validation dataset is incorporated
into the model configuration. The validation set is used to evaluate a given model, but this is for
frequent evaluation. Hence the model occasionally sees this data, but never does it “Learn”
from this. So the validation set affects a model, but only indirectly. The validation set is also
known as the Dev set or the Development set. This makes sense since this dataset helps
during the “development” stage of the model.
Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on
the training dataset. The Test dataset provides the gold standard used to evaluate the model. It
is only used once a model is completely trained (using the train and validation sets). The test
set is generally what is used to evaluate competing models. After the model is built, testing data
once again validates that it can make accurate predictions. If training and validation data include
labels to monitor performance metrics of the model, the testing data should be unlabeled. Test
data provides a final, real-world check of an unseen dataset to confirm that the ML algorithm
was trained effectively. Many a times the validation set is used as the test set, but it is not good
practice. The test set is generally well curated. It contains carefully sampled data that spans
the various classes that the model would face, when used in the real world. It is common to use
80% of the data for training and hold out 20% for testing.
ML algorithms require training data to achieve an objective. The algorithm will analyze this
training dataset, classify the inputs and outputs, then analyze it again. Trained enough, an
algorithm will essentially memorize all of the inputs and outputs in a training dataset — this
becomes a problem when it needs to consider data from other sources, such as real-world
customers.
17
Here is where validation data is useful. Validation data provides an initial check that the model
can return useful predictions in a real-world setting, which training data cannot do. The ML
algorithm can assess training data and validation data at the same time. Validation data is an
entirely separate segment of data, though a data scientist might carve out part of the training
dataset for validation — as long as the datasets are kept separate throughout the entirety of
training and testing.
For example, let’s say an ML algorithm is supposed to analyze a picture of a vertebrate and
provide its scientific classification. The training dataset would include lots of pictures of mammals,
but not all pictures of all mammals, let alone all pictures of all vertebrates. So, when the validation
data provides a picture of a squirrel, an animal the model hasn’t seen before, the data scientist
can assess how well the algorithm performs in that task. This is a check against an entirely
different dataset than the one it was trained on. Based on the accuracy of the predictions after
the validation stage, data scientists can adjust hyper parameters such as learning rate, input
features and hidden layers. These adjustments prevent overfitting, in which the algorithm can
make excellent determinations on the training data, but can’t effectively adjust predictions for
additional data. The opposite problem, Underfitting, occurs when the model isn’t complex enough
to make accurate predictions against either training data or new data.
Not all data scientists rely on both validation data and testing data. To some degree, both datasets
serve the same purpose: make sure the model works on real data. However, there are some
practical differences between validation data and testing data. Validation data occurs as part of
the model training process. Conversely, the model acts as a black box when we run testing
data through it. Thus, validation data tunes the model, whereas testing data simply confirms
that it works.
1.9 Summary
In this lesson, we have discussed the fundamental notions of machine learning and the process.
The various types of learning mechanisms are delineated in detail. The testing, training, and
validating the machine models are further explained with examples.
18
1.10 Keywords
LESSON – 2
2.1 Introduction
2.10 Summary
2.11 Keywords
2.1 Introduction
End-to-End machine learning is concerned with preparing data, training a model on it, and then
deploying that model. The cycle of Machine Learning Projects involves the following steps,
The model selection and fine tune the model for better performance
Machine Learning it is best to actually experiment with real-world data, not just artificial datasets.
Fortunately, there are thousands of open datasets to choose from, ranging across all sorts of
domains. Here are a few places we can look to get data:
o Kaggle datasets
21
o http://dataportals.org/
o http://opendatamonitor.eu/
o http://quandl.com/
o Quora.com question
o Datasets subreddit
Visualizing information is a critical part of data analysis. Being able to gain a visual understanding
of information creates a solid foundation from which to base points on. Data that stands alone
and is stored on computers over time becomes invisible. To be able to see information or
interpret it, it needs to be visualized. This turns the once invisible non-actionable data into
understandable pictures and images. When creating visualizations, tables alone are not enough
to be able to correctly and accurately interpret the data available.
Converting the information into a rudimentary graph doesn’t allow us to be able to identify a
pattern immediately. For this reason, it’s good to get creative and think outside the box a little.
For example, when referencing geographic locations by using a map of the area with embedded
visual data, it makes it considerably easier to digest the available statistics.
Where insights are readily available from an array of sources that focus on a wide range of
business issues, opportunities, developments, and risks, the accurate interpretation of data
can pay dividends for companies and secure their long-term future in the process (Figure 2.1).
22
Every new visualization is likely to give us some insights into our data. Some of those insights
might be already known (but perhaps not yet proven) while other insights might be completely
new or even surprising to us. Some new insights might mean the beginning of a story, while
others could just be the result of errors in the data, which are most likely to be found by visualizing
the data.
Data visualization is capable of producing significant levels of insight, provided that said data is
extracted in a clear and non-convoluted manner. Such insights can be used to delve deeper
into data sets, and have the ability to extract actionable information for businesses and clients
alike. For example, if an organization has been found to be operating at a loss, it would be wise
to draw on data to identify where changes can be made. This not only makes it easier for
employers to take mitigating measures, but it also makes such measures much easier to
explain to board members and employees.
While insights can be largely generated automatically, it’s important to be mindful when crafting
your visualization. As the infographic above shows, graphs need to not only possess the relevant
data but also be functional. This means that your X and Y axes must not only be well-listed in
order to display information in the right context but also shouldn’t take on too much data. Diffuse
23
and convoluted visualizations can ultimately cause identified issues to appear much more difficult
for viewers. The ideal data visualization must be easy to understand, informative, and eye-
catching. Charts that are too complex, lacking context, missing vital information, or are difficult
to interpret due to design flaws can severely undermine the process.
Now that we understand how data visualization can be used, let’s apply the different types of
data visualization to their uses. There are numerous tools available to help create data
visualizations. Some are more manual and some are automated, but either way they should
allow you to make any of the following types of visualizations.
Line chart
A line chart illustrates changes over time. The x-axis is usually a period of time, while the y-axis
is quantity. So, this could illustrate a company’s sales for the year broken down by month or
how many units a factory produced each day for the past week.
Area chart
An area chart is an adaptation of a line chart where the area under the line is filled in to emphasize
its significance. The color fill for the area under each line should be somewhat transparent so
that overlapping areas can be discerned.
Bar chart
A bar chart also illustrates changes over time. But if there is more than one variable, a bar chart
can make it easier to compare the data for each variable at each moment in time. For example,
a bar chart could compare the company’s sales from this year to last year.
Histogram
A histogram looks like a bar chart, but measures frequency rather than trends over time. The x-
axis of a histogram lists the “bins” or intervals of the variable, and the y-axis is frequency, so
each bar represents the frequency of that bin. For example, you could measure the frequencies
of each answer to a survey question. The bins would be the answer: “unsatisfactory,” “neutral,”
and “satisfactory.” This would tell you how many people gave each answer.
24
Scatter plot
Scatter plots are used to find correlations. Each point on a scatter plot means “when x = this,
then y equals this.” That way, if the points trend a certain way (upward to the left, downward to
the right, etc.) there is a relationship between them. If the plot is truly scattered with no trend at
all, then the variables do not affect each other at all.
Bubble chart
A bubble chart is an adaptation of a scatter plot, where each point is illustrated as a bubble
whose area has meaning in addition to its placement on the axes. A pain point associated with
bubble charts is the limitations on sizes of bubbles due to the limited space within the axes. So,
not all data will fit effectively in this type of visualization.
Pie chart
A pie chart is the best option for illustrating percentages, because it shows each element as
part of a whole. So, if your data explains a breakdown in percentages, a pie chart will clearly
present the pieces in the proper proportions.
Gauge
A gauge can be used to illustrate the distance between intervals. This can be presented as a
round clock-like gauge or as a tube type gauge resembling a liquid thermometer. Multiple gauges
can be shown next to each other to illustrate the difference between multiple intervals.
Map
Much of the data dealt with in businesses has a location element, which makes it easy to
illustrate on a map. An example of a map visualization is mapping the number of purchases
customers made in each state in the U.S. In this example, each state would be shaded in and
states with less purchases would be a lighter shade, while states with more purchases would
be darker shades. Location information can also be very valuable for business leadership to
understand, making this an important data visualization to use.
25
Heat map
A heat map is basically a color-coded matrix. A formula is used to color each cell of the matrix
is shaded to represent the relative value or risk of that cell. Usually heat map colors range from
green to red, with green being a better result and red being worse. This type of visualization is
helpful because colors are quicker to interpret than numbers.
Frame diagram
Frame diagrams are basically tree maps which clearly show hierarchical relationship structure.
A frame diagram consists of branches, which each have more branches connecting to them
with each level of the diagram consisting of more and more branches.
Data preparation may be one of the most difficult steps in any machine learning project.
The reason is that each dataset is different and highly specific to the project. Nevertheless,
there are enough commonalities across predictive modeling projects that we can define a
loose sequence of steps and subtasks that you are likely to perform.
This process provides a context in which we can consider the data preparation required for the
project, informed both by the definition of the project performed before data preparation and the
evaluation of machine learning algorithms performed after.
We can define data preparation as the transformation of raw data into a form that is more
suitable for modeling. Nevertheless, there are steps in a predictive modeling project before and
after the data preparation step that are important and inform the data preparation that is to be
performed. The process of applied machine learning consists of a sequence of steps.
but all projects have the same general steps; they are:
We are concerned with the data preparation step (step 2), and there are common or standard
tasks that you may use or explore during the data preparation step in a machine learning project.
Feature Selection: Identifying those input variables that are most relevant to the task.
Data Cleaning
Data cleaning involves fixing systematic problems or errors in “messy” data. The most useful
data cleaning involves deep domain expertise and could involve identifying and addressing
specific observations that may be incorrect.
There are many reasons data may have incorrect values, such as being mistyped, corrupted,
duplicated, and so on. Domain expertise may allow obviously erroneous observations to be
identified as they are different from what is expected, such as a person’s height of 200 feet.
Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed.
This might involve removing a row or a column. Alternately, it might involve replacing observations
with new values.
Nevertheless, there are general data cleaning operations that can be performed, such as:
Identifying columns that have the same value or no variance and removing them.
27
Data cleaning is an operation that is typically performed first, prior to other data preparation
operations.
Feature Selection
Feature selection refers to techniques for selecting a subset of input features that are most
relevant to the target variable that is being predicted.
This is important as irrelevant and redundant input variables can distract or mislead learning
algorithms possibly resulting in lower predictive performance. Additionally, it is desirable to
develop models only using the data that is required to make a prediction, e.g. to favor the
simplest possible well performing model.
Feature selection techniques are generally grouped into those that use the target variable
(supervised) and those that do not (unsupervised). Additionally, the supervised techniques
can be further divided into models that automatically select features as part of fitting the model
(intrinsic), those that explicitly choose features that result in the best performing model
(wrapper) and those that score each input feature and allow a subset to be selected (filter).
Statistical methods are popular for scoring input features, such as correlation. The features
can then be ranked by their scores and a subset with the largest scores used as input to a
model. The choice of statistical measure depends on the data types of the input variables and
a review of different statistical measures that can be used.
Data Transforms
Data transforms are used to change the type or distribution of data variables. This is a large
umbrella of different techniques and they may be just as easily applied to input and output
variables. Data may have one of a few types, such as numeric or categorical, with subtypes
for each, such as integer and real-valued for numeric, and nominal, ordinal, and Boolean for
categorical.
28
Feature Engineering
Feature engineering refers to the process of creating new input variables from the available
data. Engineering new features is highly specific to your data and data types. As such, it often
requires the collaboration of a subject matter expert to help identify new features that could be
constructed from the data. This specialization makes it a challenging topic to generalize to
general methods.
Nevertheless, there are some techniques that can be reused, such as:
Adding new variables for each component of a compound variable, such as a date-
time. A popular approach drawn from statistics is to create copies of numerical input
29
variables that have been changed with a simple mathematical operation, such as raising
them to a power or multiplied with other input variables, referred to as polynomial features.
Polynomial Transform: Create copies of numerical input variables that are raised to a
power.
The theme of feature engineering is to add broader context to a single observation or decompose
a complex variable, both in an effort to provide a more straightforward perspective on the input
data.
Dimensionality Reduction
The number of input features for a dataset may be considered the dimensionality of the data.
For example, two input variables together can define a two-dimensional area where each row
of data defines a point in that space. This idea can then be scaled to any number of input
variables to create large multi-dimensional hyper-volumes.
The problem is, the more dimensions this space has (e.g. the more input variables), the more
likely it is that the dataset represents a very sparse and likely unrepresentative sampling of that
space. This is referred to as the curse of dimensionality. This motivates feature selection,
although an alternative to feature selection is to create a projection of the data into a lower-
dimensional space that still preserves the most important properties of the original data.
Model selection is the process of choosing one among many candidate models for a predictive
modeling problem. There may be many competing concerns when performing model selection
30
beyond model performance, such as complexity, maintainability, and available resources. The
two main classes of model selection techniques are probabilistic measures and resampling
methods.
Model selection is the process of selecting one final machine learning model from among a
collection of candidate machine learning models for a training dataset. Model selection is a
process that can be applied both across different types of models (e.g. logistic regression,
SVM, KNN, etc.) and across models of the same type configured with different model hyper
parameters (e.g. different kernels in an SVM). For example, we may have a dataset for which
we are interested in developing a classification or regression predictive model.
We do not know beforehand as to which model will perform best on this problem, as it is
unknowable. Therefore, we fit and evaluate a suite of different models on the problem. Model
selection is the process of choosing one of the models as the final model that addresses the
problem. Model selection is different from model assessment. For example, we evaluate or
assess candidate models in order to choose the best one, and this is model selection. Whereas
once a model is chosen, it can be evaluated in order to communicate how well it is expected to
perform in general; this is model assessment.
Fitting models is relatively straightforward, although selecting among them is the true challenge
of applied machine learning. Firstly, we need to get over the idea of a “best” model. All models
have some predictive error, given the statistical noise in the data, the incompleteness of the
data sample, and the limitations of each different model type. Therefore, the notion of a perfect
or best model is not useful. Instead, we must seek a model that is “good enough.”
The project stakeholders may have specific requirements, such as maintainability and limited
model complexity. As such, a model that has lower skill but is simpler and easier to understand
may be preferred. Alternately, if model skill is prized above all other concerns, then the ability of
the model to perform well on out-of-sample data will be preferred regardless of the computational
complexity involved.
Therefore, a “good enough” model may refer to many things and is specific to your project,
such as:
31
A model that is sufficiently skillful given the time and resources available.
For example, we are not selecting a fit model, as all models will be discarded. This is because
once we choose a model, we will fit a new final model on all available data and start using it to
make predictions.
The best approach to model selection requires “sufficient” data, which may be nearly infinite
depending on the complexity of the problem. In this ideal situation, we would split the data
into training, validation, and test sets, then fit candidate models on the training set, evaluate and
select them on the validation set, and report the performance of the final model on the test set.
Instead, there are two main classes of techniques to approximate the ideal case of model
selection; they are:
Probabilistic Measures
Probabilistic measures involve analytically scoring a candidate model using both its performance
on the training dataset and the complexity of the model. It is known that training error is
optimistically biased, and therefore is not a good basis for choosing a model. The performance
can be penalized based on how optimistic the training error is believed to be. This is typically
achieved using algorithm-specific methods, often linear, that penalize the score based on the
complexity of the model.
32
Resampling Methods
Resampling methods seek to estimate the performance of a model (or more precisely, the
model development process) on out-of-sample data. This is achieved by splitting the training
dataset into sub train and test sets, fitting a model on the sub train set, and evaluating it on the
test set. This process may then be repeated multiple times and the mean performance across
each trial is reported.
Bootstrap.
Model training is the phase in the data science development lifecycle where practitioners try to
fit the best combination of weights and bias to a machine learning algorithm to minimize a loss
function over the prediction range. The purpose of model training is to build the best mathematical
representation of the relationship between data features and a target label (in supervised learning)
or among the features themselves (unsupervised learning). Loss functions are a critical aspect
of model training since they define how to optimize the machine learning algorithms. Depending
on the objective, type of data and algorithm, data science practitioner use different type of loss
functions. One of the popular examples of loss functions is Mean Square Error (MSE).
33
Model training is the key step in machine learning that results in a model ready to be validated,
tested, and deployed. The performance of the model determines the quality of the applications
that are built using it. Quality of training data and the training algorithm are both important assets
during the model training phase. Typically, training data is split for training, validation and testing.
The training algorithm is chosen based on the end use case. There are a number of tradeoff
points in deciding the best algorithm–model complexity, interpretability, performance, compute
requirements, etc. All these aspects of model training make it both an involved and important
process in the overall machine learning development cycle.
There are a lot of factors in play for deciding how much machine learning training data you
need. First and foremost is how important accuracy is. Say you’re creating a sentiment
analysis algorithm. Your problem is complex. A sentiment algorithm that achieves 85 or
90% accuracy is more than enough for most people’s needs and a false positive or negative
here or there isn’t going to substantively change much of anything.
Of course, more complicated use cases generally require more data than less complex
ones. A computer vision that’s looking to only identify foods versus one that’s trying to
identify objects generally will need less training data as a rule of thumb. Note that there’s
really no such thing as too much high-quality data. Better training data, and more of it, will
improve your models.
Fine-tuning takes a model that has already been trained for a particular task and then fine-
tuning or tweaking it to make it perform a second similar task. Fine tuning machine learning
predictive model is a crucial step to improve accuracy of the forecasted results. Sometimes,
we have to explore how model parameters can enhance forecasting accuracy of our machine
learning model.
Tuning is usually a trial-and-error process by which you change some hyper parameters (for
example, the number of trees in a tree-based algorithm or the value of alpha in a linear algorithm),
run the algorithm on the data again, then compare its performance on your validation set in
order to determine which set of hyper parameters results in the most accurate model.
34
All machine learning algorithms have a “default” set of hyper parameters, which means “a
configuration that is external to the model and whose value cannot be estimated from data.”
Diff erent algorithms consist of dif ferent hyper parameters. For example,
regularized regression models have coefficients penalties, decision trees have a set number
of branches, and neural networks have a set number of layers. When building models, analysts
and data scientists choose the default configuration of these hyper parameters after running
the model on several datasets.
While the generic set of hyper parameters for each algorithm provides a starting point for analysis
and will generally result in a well-performing model, it may not have the optimal configurations
for your particular dataset and business problem. In order to find the best hyper parameters for
your data, you need to tune them. Model tuning allows you to customize your models so they
generate the most accurate outcomes and give you highly valuable insights into your data,
enabling you to make the most effective business decisions.
Bayesian Optimization
Bayesian Optimization has emerged as an efficient tool for hyperparameter tuning of machine
learning algorithms, more specifically, for complex models like deep neural networks. It offers
an efficient framework for optimising the highly expensive black-box functions without knowing
its form. It has been applied in several fields including learning optimal robot mechanics,
sequential experimental design, and synthetic gene design.
Evolutionary Algorithms
Evolutionary algorithms (EA) are optimization algorithms that work by modifying a set of candidate
solutions (population) according to certain rules called operators. One of the main advantages
of the EA is their generality: Meaning EA can be used in a broad range of conditions due to their
simplicity and independence from the underlying problem. In hyperparameter tuning problems,
evolutionary algorithms have proved to perform better than grid search techniques based on an
accuracy-speed ratio.
Gradient-Based Optimization
the hyper parameters. This hyperparameter tuning methodology can be applied when some
differentiability and continuity conditions of the training criterion are satisfied.
Grid Search
Grid search is a basic method for hyperparameter tuning. It performs an exhaustive search on
the hyperparameter set specified by users. This approach is the most straightforward leading
to the most accurate predictions. Using this tuning method, users can find the optimal
combination. Grid search is applicable for several hyper-parameters, however, with limited
search space.
Keras’ Tuner
Keras tuning is a library that allows users to find optimal hyper parameters for machine learning
or deep learning models. The library helps to find kernel sizes, learning rate for optimization,
and different hyper-parameters. Keras tuner can be used for getting the best parameters for
various deep learning models for the highest accuracy.
Population-based Optimization
Population-based methods are essentially a series of random search methods based on genetic
algorithms, such as evolutionary algorithms, particle swarm optimization, among others. One
of the most widely used population-based methods is population-based training (PBT), proposed
by DeepMind. PBT is a unique method in two aspects:
ParamILS
ParamILS uses default and random settings for initialization and employs iterative first
improvement as a subsidiary local search procedure. It also uses a fixed number of random
36
moves for perturbation and always accepts better or equally-good parameter configurations,
but re-initializes the search at random with probability.
Random Search
Random search can be said as a basic improvement on grid search. The method refers to a
randomised search over hyper-parameters from certain distributions over possible parameter
values. The searching process continues till the desired accuracy is reached. Random search
is similar to grid search but has proven to create better results than the latter. The approach is
often applied as the baseline of HPO to measure the efficiency of newly designed algorithms.
Though random search is more effective than grid search, it is still a computationally intensive
method.
2.11 Summary
Understanding the machine learning project is crucial. The cycle of the entire process is
discussed in a step-by-step manner. Visualizing information is a critical part of data analysis.
There are numerous tools available to help create data visualizations. Model selection is the
process of choosing one among many candidate models for a predictive modeling problem.
There are two main classes of techniques to approximate the ideal case of model selection are
probabilistic and resampling. Fine tuning machine learning predictive model is a crucial step to
improve accuracy of the forecasted results.
2.12 Keywords
LESSON - 3
3.1 Introduction
3.3 Tasks
3.4 Models
3.5 Features
3.6 Summary
3.7 Keywords
3.1 Introduction
Machine learning is all about using the right features to build the right models that achieve the
right tasks. In essence, features define a ‘language’ in which we describe the relevant objects in
our domain, be they e-mails or complex organic molecules. We should not normally have to go
back to the domain objects themselves once we have a suitable feature representation, which
is why features play such an important role in machine learning.
To get a deep insight on Logical models, Geometric models, Probabilistic models, and
grouping and grading
3.3 Tasks: The problems that can be solved with machine learning
The various kind of problems are addressed by the machine learning techniques. In machine
learning, multiclass or multinomial classification is the problem of classifying instances into
one of three or more classes. (Classifying instances into one of two classes is called binary
classification.) While some classification algorithms naturally permit the use of more than two
classes, others are by nature binary algorithms; these can, however, be turned into multinomial
classifiers by a variety of strategies. Multiclass classification should not be confused with multi-
label classification, where multiple labels are to be predicted for each instance.
Regression analysis is a set of statistical processes for estimating the relationships between a
dependent variable (often called the ‘outcome variable’) and one or more independent variables
(often called ‘predictors’, ‘covariates’, or ‘features’). The most common form of regression
analysis is linear regression, in which a researcher finds the line (or a more complex linear
function) that most closely fits the data according to a specific mathematical criterion.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense) to each other than to
those in other groups (clusters). It is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including machine learning, pattern
39
Applications of Machine learning are many, including external (client-centric) applications such
as product recommendation, customer service, and demand forecasts, and internally to help
businesses improve products or speed up manual and time-consuming processes. Machine
learning algorithms are typically used in areas where the solution requires continuous
improvement post-deployment. Adaptable machine learning solutions are incredibly dynamic
and are adopted by companies across verticals. Some of them are described below,
Identifying Spam
Spam e-mail recognition was one example of the task or problem in machine learning. It
constitutes a binary classification task, which is easily the most common task in machine
learning. Spam identification is one of the most basic applications of machine learning. Most of
our email inboxes also have an unsolicited, bulk, or spam inbox, where our email provider
automatically filters unwanted spam emails.
One obvious variation is to consider classification problems with more than two classes. For
instance, we may want to distinguish different kinds of ham e-mails, e.g., work-related e-mails
and private messages. We could approach this as a combination of two binary classification
tasks: the first task is to distinguish between spam and ham, and the second task is, among
ham e-mails, to distinguish between work- related and private ones
Recommender systems are one of the most characteristic and ubiquitous machine learning
use cases in day-to-day life. These systems are used everywhere by search engines, e-
commerce websites (Amazon), entertainment platforms (Google Play, Netflix), and multiple
web & mobile apps.
Prominent online retailers like Amazon and eBay often show a list of recommended products
individually for each of their consumers. These recommendations are typically based on
40
behavioral data and parameters such as previous purchases, item views, page views, clicks,
form fill-ins, purchases, item details (price, category), and contextual data (location, language,
device), and browsing history.
These recommender systems allow businesses to drive more traffic, increase customer
engagement, reduce churn rate, deliver relevant content and boost profits. All such recommended
products are based on a machine learning model’s analysis of customer’s behavioral data. It is
an excellent way for online retailers to offer extra value and enjoy various upselling opportunities
using machine learning.
Customer Segmentation
Customer segmentation, churn prediction and customer lifetime value (LTV) prediction are the
main challenges faced by any marketer. Businesses have a huge amount of marketing relevant
data from various sources such as email campaigns, website visitors and lead data.
Using data mining and machine learning, an accurate prediction for individual marketing offers
and incentives can be achieved. Using ML, savvy marketers can eliminate guesswork involved
in data-driven marketing. For example, given the pattern of behavior by a user during a trial
period and the past behaviors of all users, identifying chances of conversion to paid version can
be predicted. A model of this decision problem would allow a program to trigger customer
interventions to persuade the customer to convert early or better engage in the trial.
Advances in deep learning (a subset of machine learning) have stimulated rapid progress in
image & video recognition techniques over the past few years. They are used for multiple
areas, including object detection, face recognition, text detection, visual search, logo and
landmark detection, and image composition.
Since machines are good at processing images, Machine Learning algorithms can train Deep
Learning frameworks to recognize and classify images in the dataset with much more accuracy
than humans. Similar to image recognition, companies such as Shutterstock, eBay, Salesforce,
Amazon, and Facebook use Machine Learning for video recognition where videos are broken
down frame by frame and classified as individual digital images.
41
Fraudulent Transactions
Fraudulent banking transactions are quite a common occurrence today. However, it is not feasible
(in terms of cost involved and efficiency) to investigate every transaction for fraud, translating to
a poor customer service experience. Machine Learning in finance can automatically build super-
accurate predictive maintenance models to identify and prioritize all kinds of possible fraudulent
activities. Businesses can then create a data-based queue and investigate the high priority
incidents.
It allows to deploy resources in an area where we will see the greatest return on the investigative
investment. Further, it also helps to optimize customer satisfaction by protecting their accounts
and not challenging valid transactions. Such fraud detection using machine learning can help
banks and financial organizations save money on disputes/chargebacks as one can train Machine
Learning models to flag transactions that appear fraudulent based on specific characteristics.
Demand Forecasting
The concept of demand forecasting is used in multiple industries, from retail and e-commerce
to manufacturing and transportation. It feeds historical data to Machine Learning algorithms
and models to predict the number of products, services, power, and more. It allows businesses
to efficiently collect and process data from the entire supply chain, reducing overheads and
increasing efficiency. ML-powered demand forecasting is very accurate, rapid, and transparent.
Businesses can generate meaningful insights from a constant stream of supply/demand data
and adapt to changes accordingly.
From Alexa and Google Assistant to Cortana and Siri, we have multiple virtual personal assistants
to find accurate information using our voice instruction, such as calling someone, opening an
email, scheduling an appointment, and more. These virtual assistants use Machine Learning
algorithms for recording our voice instructions, sending them over the server to a cloud, followed
by decoding them using Machine Learning algorithms and acting accordingly.
42
Sentiment Analysis
Sentiment analysis is one of the beneficial and real-time machine learning applications that
help determine the emotion or opinion of the speaker or the writer. For instance, if you’ve written
a review, email, or any other form of a document, a sentiment analyzer will be able to assess
the actual thought and tone of the text. This sentiment analysis application can be used to
analyze decision-making applications, review-based websites, and more.
Managing an increasing number of online customer interactions has become a pain point for
most businesses. It is because they simply don’t have the customer support staff available to
deal with the sheer number of inquiries, they receive daily. Machine learning algorithms have
made it possible and super easy for chatbots and other similar automated systems to fill this
gap. This application of machine learning enables companies to automate routine and low
priority tasks, freeing up their employees to manage more high-level customer service tasks.
Further, Machine Learning technology can access the data, interpret behaviors and recognize
the patterns easily. This could also be used for customer support systems that can work identical
to a real human being and solve all of the customers’ unique queries. The Machine Learning
models behind these voice assistants are trained on human languages and variations in the
human voice because it has to efficiently translate the voice to words and then make an on-
topic and intelligent response. If implemented the right way, problems solved by machine learning
can streamline the entire process of customer issue resolution and offer much-needed
assistance along with enhanced customer satisfaction.
3.4 Models
Models form the central concept in machine learning as they are what is being learned from the
data, in order to solve a given task. There is a considerable – not to say bewildering – range of
machine learning models to choose from. The basic idea for creating a taxonomy of algorithms
is that we divide the instance space by using one of three ways:
43
Logical models
Geometric models
Probabilistic models
Logical models
Logic models are hypothesized descriptions of the chain of causes and effects leading to an
outcome of interest (e.g. prevalence of cardiovascular diseases, annual traffic collision, etc).
While they can be in a narrative form, logic model usually take form in a graphical depiction of
the “if-then” (causal) relationships between the various elements leading to the outcome. However,
the logic model is more than the graphical depiction: it is also the theories, scientific evidences,
assumptions and beliefs that support it and the various processes behind it.
Logical models use a logical expression to divide the instance space into segments and hence
construct grouping models. A logical expression is an expression that returns a Boolean value,
i.e., a True or False outcome. Once the data is grouped using a logical expression, the data is
divided into homogeneous groupings for the problem we are trying to solve. For example, for a
classiûcation problem, all the instances in the group belong to one class.
There are mainly two kinds of logical models: Tree models and Rule models.
Rule models consist of a collection of implications or IF-THEN rules. For tree-based models,
the ‘if-part’ deûnes a segment and the ‘then-part’ deûnes the behaviour of the model for this
segment. Rule models follow the same reasoning.
Tree models can be seen as a particular type of rule model where the if-parts of the rules are
organised in a tree structure. Both Tree models and Rule models use the same approach to
supervised learning. The approach can be summarised in two strategies: we could first find
the body of the rule (the concept) that covers a sufficiently homogeneous set of examples and
then find a label to represent the body. Alternately, we could approach it from the other direction,
i.e., first select a class we want to learn and then find rules that cover examples of the class.
44
A simple tree-based model is shown in Figure 3.1. The tree shows survival numbers of
passengers on the Titanic (“sibsp” is the number of spouses or siblings aboard). The values
under the leaves show the probability of survival and the percentage of observations in the leaf.
The model can be summarised as: The chances of survival were good if you were (i) a female
or (ii) a male younger than 9.5 years with less than 2.5 siblings.
To understand logical models further, we need to understand the idea of Concept Learning.
Concept Learning involves learning logical expressions or concepts from examples. The idea
of Concept Learning fits in well with the idea of Machine learning, i.e., inferring a general function
from specific training examples. Concept learning forms the basis of both tree-based and rule-
based models. More formally, Concept Learning involves acquiring the definition of a general
category from a given set of positive and negative training examples of the category. A Formal
Definition for Concept Learning is “The inferring of a Boolean-valued function from training
examples of its input and output.” In concept learning, we only learn a description for the positive
class and label everything that doesn’t satisfy that description as negative.
45
The following example explains (Figure 3.2) this idea in more detail.
A Concept Learning Task called “Enjoy Sport” as shown above is defined by a set of data from
some example days. Each data is described by six attributes. The task is to learn to predict the
value of Enjoy Sport for an arbitrary day based on the values of its attribute values. The problem
can be represented by a series of hypotheses. Each hypothesis is described by a conjunction
of constraints on the attributes. The training data represents a set of positive and negative
examples of the target function. In the example above, each hypothesis is a vector of six
constraints, specifying the values of the six attributes – Sky, AirTemp, Humidity, Wind, Water,
and Forecast. The training phase involves learning the set of days (as a conjunction of attributes)
for which Enjoy Sport = yes.
Given instances X which represent a set of all possible days, each described by the
attributes:
We can also formulate Concept Learning as a search problem. We can think of Concept
learning as searching through a set of predefined space of potential hypotheses to identify a
hypothesis that best fits the training examples. Concept learning is also an example of Inductive
Learning. Inductive learning, also known as discovery learning, is a process where the learner
discovers rules by observing examples. Inductive learning is different from deductive learning,
where students are given rules that they then need to apply. Inductive learning is based on
the inductive learning hypothesis. The Inductive Learning Hypothesis postulates that: Any
hypothesis found to approximate the target function well over a sufficiently large set of training
examples is expected to approximate the target function well over other unobserved examples.
This idea is the fundamental assumption of inductive learning.
Geometric models
In the previous section, we have seen that with logical models, such as decision trees, a logical
expression is used to partition the instance space. Two instances are similar when they end up
in the same logical segment. In this section, we consider models that define similarity by
considering the geometry of the instance space. In Geometric models, features could be
described as points in two dimensions (x- and y-axis) or a three-dimensional space (x, y, and z).
Even when features are not intrinsically geometric, they could be modelled in a geometric
manner (for example, temperature as a function of time can be modelled in two axes). In
geometric models, there are two ways we could impose similarity.
We could use geometric concepts like lines or planes to segment (classify) the
instance space. These are called Linear models.
Linear models
Linear models are relatively simple. In this case, the function is represented as a linear
combination of its inputs. Thus, if x1 and x2 are two scalars or vectors of the same dimension
and a and b are arbitrary scalars, then ax1 + bx2 represents a linear combination of x1 and x2.
In the simplest case where f(x) represents a straight line, we have an equation of the form f (x)
= mx + c where c represents the intercept and m represents the slope.
Linear models are parametric, which means that they have a ûxed form with a small number
of numeric parameters that need to be learned from data. For example, in f (x)
= mx + c, m and c are the parameters that we are trying to learn from the data. This technique
is different from tree or rule models, where the structure of the model (e.g., which features to
use in the tree, and where) is not fixed in advance.
Linear models are stable, i.e., small variations in the training data have only a limited impact on
the learned model. In contrast, tree models tend to vary more with the training data, as the
choice of a different split at the root of the tree typically means that the rest of the tree is
different as well. As a result of having relatively few parameters, Linear models have low
variance and high bias. This implies that Linear models are less likely to overfit the training
data than some other models. However, they are more likely to underfit. For example, if we
want to learn the boundaries between countries based on labelled data, then linear models are
not likely to give a good approximation.
Distance-based models
Distance-based models are the second class of Geometric models. Like Linear models, distance-
based models are based on the geometry of data. As the name implies, distance-based models
work on the concept of distance. In the context of Machine learning, the concept of distance is
not based on merely the physical distance between two points. Instead, we could think of the
distance between two points considering the mode of transport between two points. Travelling
between two cities by plane covers less distance physically than by train because a plane is
unrestricted. Similarly, in chess, the concept of distance depends on the piece used – for
example, a Bishop can move diagonally. Thus, depending on the entity and the mode of travel,
the concept of distance can be experienced differently. The distance metrics commonly used
are Euclidean, Minkowski, Manhattan, and Mahalanobis.
48
Distance is applied through the concept of neighbours and exemplars. Neighbours are points in
proximity with respect to the distance measure expressed through exemplars. Exemplars are
either centroids that ûnd a centre of mass according to a chosen distance metric or medoids
that ûnd the most centrally located data point. The most commonly used centroid is the arithmetic
mean, which minimizes squared Euclidean distance to all other points.
The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean position
of all the points in the figure from the centroid point. This definition extends to any object in n-
dimensional space: its centroid is the mean position of all the points. Medoids are similar in
concept to means or centroids. Medoids are most commonly used on data when a mean or
centroid cannot be defined. They are used in contexts where the centroid is not representative
of the dataset, such as in image data. Examples of distance-based models include the nearest-
neighbour models, which use the training data as exemplars – for example, in classification.
The K-means clustering algorithm also uses exemplars to create clusters of similar data points.
Probabilistic models
The third family of machine learning algorithms is the probabilistic models. probabilistic classifier
is a classifier that is able to predict, given an observation of an input, a probability distribution
over a set of classes, rather than only outputting the most likely class that the observation
should belong to. Probabilistic classifiers provide classification that can be useful in its own
right or when combining classifiers into ensembles.
We have seen before that the k-nearest neighbour algorithm uses the idea of distance (e.g.,
Euclidian distance) to classify entities, and logical models use a logical expression to partition
the instance space. In this section, we see how the probabilistic models use the idea of
probability to classify new entities.
Probabilistic models see features and target variables as random variables. The process of
modelling represents and manipulates the level of uncertainty with respect to these variables.
There are two types of probabilistic models: Predictive and Generative. Predictive probability
models use the idea of a conditional probability distribution P (Y |X) from which Y can be
predicted from X. Generative models estimate the joint distribution P (Y, X). Once we know
the joint distribution for the generative models, we can derive any conditional or marginal
distribution involving the same variables. Thus, the generative model is capable of creating new
49
data points and their labels, knowing the joint probability distribution. The joint distribution looks
for a relationship between two variables. Once this relationship is inferred, it is possible to infer
new data points.
The goal of any probabilistic classifier is given a set of features (x_0 through x_n) and a set of
classes (c_0 through c_k), we aim to determine the probability of the features occurring in each
class, and to return the most likely class. Therefore, for each class, we need to calculate P(c_i
| x_0, …, x_n). We can do this using the Bayes rule defined as
The Naïve Bayes algorithm is based on the idea of Conditional Probability. Conditional probability
is based on finding the probability that something will happen, given that something else has
already happened. The task of the algorithm then is to look at the evidence and to determine the
likelihood of a specific class and assign a label accordingly to each entity.
Grouping models do this by breaking up the instance space into groups or segments, the
number of which is determined at training time. One could say that grouping models have a
fixed and finite ‘resolution’ and cannot distinguish between individual instances beyond this
resolution.
3.5 Features
In machine learning, features are individual independent variables that act like a input in your
system. Actually, while making the predictions, models use such features to make the predictions.
And using the feature engineering process, new features can also be obtained from old features
in machine learning. To understand in more simple way, let’s take an example, where you can
consider one column of your data set to be one feature which is also know as “variables or
attributes” and the more number of features are known as dimensions. And depending on what
you are trying to analyze the features you include in your dataset can vary widely.
Feature engineering is the process of using the domain knowledge of the data to create features
that makes machine learning algorithms work properly. If feature engineering is performed
properly, it helps to improve the power of prediction of machine learning algorithms by creating
the features using the raw data that facilitate the machine learning process.
50
Features in machine learning is very important, being building a block of datasets, the quality of
the features in your dataset has major impact on the quality of the insights you will get while
using the dataset for machine learning. However, depending on the different business problems
in different industries it is not necessary the features should be same features, so here we
need to strongly understand the business goal of your data science project.
Where on the other hand, using the “feature selection” and “feature engineering” process you
can improve the quality of your dataset’s features, which a very tedious and difficult process. It
these techniques are working well, you will get optimal dataset with all of the important features,
that bearing on your specific business problem leads to the best possible model development
and the most beneficial visual perception.
Universal Selection
Feature Importance
Feature Creation: Creating features involves creating new variables which will be most helpful
for our model. This can be adding or removing some features.
Feature Extraction: Feature extraction is the process of extracting features from a data set to
identify useful information. Without distorting the original relationships or significant information,
this compresses the amount of data into manageable quantities for algorithms to process.
Exploratory Data Analysis: Exploratory data analysis (EDA) is a powerful and simple tool that
can be used to improve your understanding of your data, by exploring its properties. The technique
51
is often applied when the goal is to create new hypotheses or find patterns in the data. It’s often
used on large amounts of qualitative or quantitative data that haven’t been analyzed before.
Some of the techniques listed may work better with certain algorithms or datasets, while others
may be useful in all situations.
Imputation
When it comes to preparing data for machine learning, missing values are one of the most
typical issues. Human errors, data flow interruptions, privacy concerns, and other factors could
all contribute to missing values. Missing values have an impact on the performance of machine
learning models for whatever cause. The main goal of imputation is to handle these missing
values. There are two types of imputation:
Numerical Imputation: To figure out what numbers should be assigned to people currently in
the population, we usually use data from completed surveys or censuses. These data sets can
include information about how many people eat different types of food, whether they live in a city
or country with a cold climate, and how much they earn every year. That is why numerical
imputation is used to fill gaps in surveys or censuses when certain pieces of information are
missing.
Categorical Imputation: When dealing with categorical columns, replacing missing values
with the highest value in the column is a smart solution. However, if you believe the values in the
column are evenly distributed and there is no dominating value, imputing a category like “Other”
would be a better choice, as your imputation is more likely to converge to a random selection in
this scenario.
52
Handling Outliers
Outlier handling is a technique for removing outliers from a dataset. This method can be used
on a variety of scales to produce a more accurate data representation. This has an impact on
the model’s performance. Depending on the model, the effect could be large or minimal; for
example, linear regression is particularly susceptible to outliers. This procedure should be
completed prior to model training. The various methods of handling outliers include:
Removal: Outlier-containing entries are deleted from the distribution. However, if there are
outliers across numerous variables, this strategy may result in a big chunk of the datasheet
being missed.
Replacing values: Alternatively, the outliers could be handled as missing values and replaced
with suitable imputation.
Capping: Using an arbitrary value or a value from a variable distribution to replace the maximum
and minimum values.
Log Transform
Log Transform is the most used technique among data scientists. It’s mostly used to turn a
skewed distribution into a normal or less-skewed distribution. We take the log of the values in a
column and utilize those values as the column in this transform. It is used to handle confusing
data, and the data becomes more approximative to normal applications.
One-hot encoding
Scaling
Feature scaling is one of the most pervasive and difficult problems in machine learning, yet it’s
one of the most important things to get right. In order to train a predictive model, we need data
with a known set of features that needs to be scaled up or down as appropriate. This blog post
will explain how feature scaling works and why it’s important as well as some tips for getting
started with feature scaling.
After a scaling operation, the continuous features become similar in terms of range. Although
this step isn’t required for many algorithms, it’s still a good idea to do so. Distance-based
algorithms like k-NN and k-Means, on the other hand, require scaled continuous features as
model input. There are two common ways for scaling:
Normalization: All values are scaled in a specified range between 0 and 1 via normalization (or
min-max normalization). This modification has no influence on the feature’s distribution; however,
it does exacerbate the effects of outliers due to lower standard deviations. As a result, it is
advised that outliers be dealt with prior to normalization.
It helps you construct meaningful features for machine learning and predictive
modelling by combining your raw data with what you know about your data.
It provides APIs to verify that only legitimate data is utilised for calculations, preventing
label leakage in your feature vectors.
54
Feature tools includes a low-level function library that may be layered to generate
features.
Its AutoML library (EvalML) helps you build, optimize, and evaluate machine learning
pipelines.
AutoFeat
AutoFeat helps to perform Linear Prediction Models with Automated Feature Engineering and
Selection. AutoFeat allows you to select the units of the input variables in order to avoid the
construction of physically nonsensical features.
TsFresh
OneBM
OneBM interacts directly with a database’s raw tables. It slowly joins the tables, taking different
paths on the relational tree. It recognises simple data types (numerical or categorical) and
complicated data types (set of numbers, set of categories, sequences, time series, and texts)
in the joint results and applies pre-defined feature engineering approaches to the supplied types.
ExploreKit
Based on the idea that extremely informative features are typically the consequence of
manipulating basic ones, ExploreKit identifies common operators to alter each feature
independently or combine multiple of them. Instead of running feature selection on all developed
features, which can be quite huge, meta learning is used to rank candidate features.
55
3.6 Summary
Machine learning is all about using the right features to build the right models that achieve the
right tasks. Features in machine learning is very important, being building a block of datasets.
Tops Methods of Feature Selection in ML are Universal Selection, Correlation Matrix with
Heatmap, and Feature Importance. Models form the central concept in machine learning as
they are what is being learned from the data, in order to solve a given task.
3.7 Keywords
LESSON – 4
4.1 Introduction
4.2 Classification
4.9 Summary
4.10 Keywords
4.1 Introduction
In this lesson and the next we take a bird’s-eye view of the wide range of different tasks that can
be solved with machine learning techniques. ‘Task’ here refers to whatever it is that machine
learning is intended to improve performance. Since this is a classification task, we need to
learn an appropriate classifier from training data. Many different types of classifiers exist: linear
classifiers, Bayesian classifiers, distance-based classifiers, to name a few. For each of these
tasks we will discuss what it is, what variants exist, how performance at the task could be
assessed, and how it relates to other tasks.
57
The objects of interest in machine learning are usually referred to as instances. The set of all
possible instances is called the instance space. To illustrate, X could be the set of all possible
e-mails. We furthermore distinguish between the label space L and the output space Y. The
label space is used in supervised learning to label the examples. In order to achieve the task
under consideration we need a model: a mapping from the instance space to the output space.
For instance, in classification the output space is a set of classes, while in regression it is the
set of real numbers. In order to learn such a model we require a training set Tr of labelled
instances (x, l (x)), also called examples, where l :X ’!L is a labelling function.
The most commonly encountered machine learning scenario is where the label space coincides
with the output space. That is, Y = L and we are trying to learn an approximation l^ : X ’!L to the
true labelling function l , which is only known through the labels it assigned to the training data.
This scenario covers both classification and regression. In cases where the label space and
the output space differ, this usually serves the purpose of learning a model that outputs more
than just a label – for instance, a score for each possible label. In this case we have Y = Rk ,
with k = |L| the number of labels.
4.3 Classification
Classification is the most common task in machine learning. A classifier is a mapping c^:X ’!C
, where C = {C1, C2, . . . ,Ck } is a finite and usually small set of class labels. We will sometimes
also use Ci to indicate the set of examples of that class. We use the ‘hat’ to indicate that c^(x)
is an estimate of the true but unknown function c(x). Examples for a classifier take the form (x,
c(x)), where x “ X is an instance and c(x) is the true class of the instance. Learning a classifier
58
involves constructing the function c^ such that it matches c as closely as possible (and not just
on the training set, but ideally on the entire instance space X).
In the simplest case we have only two classes which are usually referred to as positive and
negative, •”and _, or +1 and “1. Two-class classification is often called binary classification (or
concept learning, if the positive class can be meaningfully called a concept). Spam e-mail
filtering is a good example of binary classification, in which spam is conventionally taken as the
positive class, and ham as the negative class (clearly, positive here doesn’t mean ‘good’!).
Other examples of binary classification include medical diagnosis (the positive class here is
having a particular disease) and credit card fraud detection.
Binary classification refers to those classification tasks that have two class labels.
Examples include:
Typically, binary classification tasks involve one class that is the normal state and another
class that is the abnormal state. For example “not spam” is the normal state and “spam” is the
abnormal state. Another example is “cancer not detected” is the normal state of a task that
involves a medical test and “cancer detected” is the abnormal state. The class for the normal
state is assigned the class label 0 and the class with the abnormal state is assigned the class
label 1.
Logistic Regression
k-Nearest Neighbors
Decision Trees
59
Naive Bayes
Multi-class classification refers to those classification tasks that have more than two class
labels.
Examples include:
Face classification.
Unlike binary classification, multi-class classification does not have the notion of normal and
abnormal outcomes. Instead, examples are classified as belonging to one among a range of
known classes. The number of class labels may be very large on some problems. For example,
a model may predict a photo as belonging to one among thousands or tens of thousands of
faces in a face recognition system.
Problems that involve predicting a sequence of words, such as text translation models, may
also be considered a special type of multi-class classification. Each word in the sequence of
words to be predicted involves a multi-class classification where the size of the vocabulary
defines the number of possible classes that may be predicted and could be tens or hundreds of
thousands of words in size.
It is common to model a multi-class classification task with a model that predicts a Multinoulli
probability distribution for each example. The Multinoulli distribution is a discrete probability
distribution that covers a case where an event will have a categorical outcome, e.g. K in {1, 2,
3, …, K}. For classification, this means that the model predicts the probability of an example
belonging to each class label. Many algorithms used for binary classification can be used for
multi-class classification.
60
k-Nearest Neighbors.
Decision Trees.
Naive Bayes.
Random Forest.
Gradient Boosting.
Algorithms that are designed for binary classification can be adapted for use for multi-class
problems. This involves using a strategy of fitting multiple binary classification models for each
class vs. all other classes (called one-vs-rest) or one model for each pair of classes (called
one-vs-one).
One-vs-Rest: Fit one binary classification model for each class vs. all other classes.
One-vs-One: Fit one binary classification model for each pair of classes.
Binary classification algorithms that can use these strategies for multi-class classification
include:
Logistic Regression.
Multi-label classification refers to those classification tasks that have two or more class labels,
where one or more class labels may be predicted for each example. Consider the example
of photo classification, where a given photo may have multiple objects in the scene and a
model may predict the presence of multiple known objects in the photo, such as “bicycle,”
“apple,” “person,” etc.
This is unlike binary classification and multi-class classification, where a single class label is
predicted for each example. It is common to model multi-label classification tasks with a model
61
that predicts multiple outputs, with each output taking predicted as a Bernoulli probability
distribution. This is essentially a model that makes multiple binary classification predictions for
each example.
Classification algorithms used for binary or multi-class classification cannot be used directly
for multi-label classification. Specialized versions of standard classification algorithms can be
used, so-called multi-label versions of the algorithms, including:
Imbalanced classification refers to classification tasks where the number of examples in each
class is unequally distributed. Typically, imbalanced classification tasks are binary classification
tasks where the majority of examples in the training dataset belong to the normal class and a
minority of examples belong to the abnormal class.
Examples include:
Fraud detection.
Outlier detection.
These problems are modeled as binary classification tasks, although may require specialized
techniques. Specialized techniques may be used to change the composition of samples in the
training dataset by Undersampling the majority class or oversampling the minority class.
Examples include:
Random Undersampling.
SMOTE Oversampling.
62
Specialized modeling algorithms may be used that pay more attention to the minority class
when fitting the model on the training dataset, such as cost-sensitive machine learning
algorithms.
Examples include:
Understanding True Positive, True Negative, False Positive and False Negative in a Confusion
Matrix
The actual value was positive and the model predicted a positive value
The actual value was negative and the model predicted a negative value
The actual value was negative but the model predicted a positive value
The actual value was positive but the model predicted a negative value
For example, suppose we had a classification dataset with 1000 data points. We fit a classifier
on it and get the below confusion matrix (Figure 4.2) :
64
True Positive (TP) = 560; meaning 560 positive class data points were correctly classified
by the model
True Negative (TN) = 330; meaning 330 negative class data points were correctly
classified by the model
False Positive (FP) = 60; meaning 60 negative class data points were incorrectly classified
as belonging to the positive class by the model
False Negative (FN) = 50; meaning 50 positive class data points were incorrectly classified
as belonging to the negative class by the model
Before we answer this question, let’s think about a hypothetical classification problem.
Let’s say you want to predict how many people are infected with a contagious virus in times
before they show the symptoms, and isolate them from the healthy population (ringing any
bells, yet? ). The two values for our target variable would be: Sick and Not Sick.
Our dataset is an example of an imbalanced dataset. There are 947 data points for the negative
class and 3 data points for the positive class. This is how we’ll calculate the accuracy:
65
96%! Not bad! But it is giving the wrong idea about the result. Our model is saying “I can predict
sick people 96% of the time”. However, it is doing the opposite. It is predicting the people who
will not get sick with 96% accuracy while the sick are spreading the virus! Do you think this is a
correct metric for our model given the seriousness of the issue? Shouldn’t we be measuring
how many positive cases we can predict correctly to arrest the spread of the contagious virus?
Or maybe, out of the correctly predicted cases, how many are positive cases to check the
reliability of our model?
66
This is where we come across the dual concept of Precision and Recall.
Precision tells us how many of the correctly predicted cases actually turned out to be positive.
Recall tells us how many of the actual positive cases we were able to predict correctly with our
model. And here’s how we can calculate Recall(Figure 4.4):
We can easily calculate Precision and Recall for our model by plugging in the values into the
above questions:
67
50% percent of the correctly predicted cases turned out to be positive cases. Whereas 75% of
the positives were successfully predicted by our model. Precision is a useful metric in cases
where False Positive is a higher concern than False Negatives. Precision is important in music
or video recommendation systems, e-commerce websites, etc. Wrong results could lead to
customer churn and be harmful to the business.
Recall is a useful metric in cases where False Negative trumps False Positive. Recall is important
in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive
cases should not go undetected! In our example, Recall would be a better metric because we
don’t want to accidentally discharge an infected person and let them mix with the healthy
population thereby spreading the contagious virus. Now you can understand why accuracy
was a bad metric for our model. But there will be cases where there is no clear distinction
between whether Precision is more important or Recall. What should we do in those cases?
We combine them!
F1-Score
In practice, when we try to increase the precision of our model, the recall goes down, and vice-
versa. The F1-score captures both the trends in a single value:
F1-score is a harmonic mean of Precision and Recall, and so it gives a combined idea
about these two metrics. It is maximum when Precision is equal to Recall. But there is a catch
here. The interpretability of the F1-score is poor. This means that we don’t know what our
classifier is maximizing – precision or recall? So, we use it in combination with other evaluation
metrics which gives us a complete picture of the result.
68
Example:
Let’s start with an example confusion matrix for a binary classifier (though it can easily be
extended to the case of more than two classes):
69
There are two possible predicted classes: “yes” and “no”. If we were predicting
the presence of a disease, for example, “yes” would mean they have the disease,
and “no” would mean they don’t have the disease.
The classifier made a total of 165 predictions (e.g., 165 patients were being
tested for the presence of that disease).
Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55
times.
In reality, 105 patients in the sample have the disease, and 60 patients do not.
Let’s now define the most basic terms, which are whole numbers (not rates):
true positives (TP): These are cases in which we predicted yes (they have the
disease), and they do have the disease.
true negatives (TN): We predicted no, and they don’t have the disease.
false positives (FP): We predicted yes, but they don’t actually have the disease.
(Also known as a “Type I error.”)
false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a “Type II error.”)
70
This is a list of rates that are often computed from a confusion matrix for a binary classifier:
True Positive Rate: When it’s actually yes, how often does it predict yes?
False Positive Rate: When it’s actually no, how often does it predict yes?
True Negative Rate: When it’s actually no, how often does it predict no?
Prevalence: How often does the yes condition actually occur in our sample?
Null Error Rate: This is how often you would be wrong if you always predicted the
majority class. (In our example, the null error rate would be 60/165=0.36 because if you
always predicted yes, you would only be wrong for the 60 “no” cases.) This can be a
useful baseline metric to compare your classifier against. However, the best classifier
for a particular application will sometimes have a higher error rate than the null error
rate, as demonstrated by the Accuracy Paradox.
F Score: This is a weighted average of the true positive rate (recall) and precision.
ROC Curve: This is a commonly used graph that summarizes the performance
of a classifier over all possible thresholds. It is generated by plotting the True
Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the
threshold for assigning observations to a given class.
The ROC curve (Figure 4.6) is plotted with TPR against the FPR where TPR is on the y-axis
and FPR is on the x-axis.
An excellent model has AUC near to the 1 which means it has a good measure of separability.
A poor model has an AUC near 0 which means it has the worst measure of separability. In fact,
it means it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is
0.5, it means the model has no class separation capacity whatsoever.
Red distribution curve is of the positive class (patients with disease) and the green distribution
curve is of the negative class(patients with no disease).
73
This is an ideal situation (Figure 4.7). When two curves don’t overlap at all means model has an
ideal measure of separability. It is perfectly able to distinguish between positive class and negative
class.
When two distributions overlap (Figure 4.8), we introduce type 1 and type 2 errors. Depending
upon the threshold, we can minimize or maximize them. When AUC is 0.7, it means there is a
70% chance that the model will be able to distinguish between positive class and negative
class.
74
This is the worst situation(Figure 4.9) .When AUC is approximately 0.5, the model has no
discrimination capacity to distinguish between positive class and negative class.
Sensitivity and Specificity are inversely proportional to each other. So, when we increase
Sensitivity, Specificity decreases, and vice versa. Sensitivity, Specificity and Sensitivity,
Specificity
When we decrease the threshold, we get more positive values thus it increases the sensitivity
and decreasing the specificity. Similarly, when we increase the threshold, we get more negative
values thus we get higher specificity and lower sensitivity. As we know FPR is 1 - specificity. So
when we increase TPR, FPR also increases and vice versa. TPR +, FPR and TPR, FPR
Scoring is widely used in machine learning to mean the process of generating new values,
given a model and some new input. The generic term “score” is used, rather than “prediction,”
because the scoring process can generate so many different types of values:
A probability value, indicating the likelihood that a new input belongs to some existing
category.
The new data that you provide as input generally needs to have the same columns that were
used to train the model, minus the label, or outcome column. Columns that are used solely as
identifiers are usually excluded when training a model, and thus should be excluded when
scoring as well.
However, identifiers such as primary keys can easily be re-combined with the scoring dataset
later, by using the Add Columns module. Before you perform scoring on your dataset, always
76
check for missing values and nulls. When data used as input for scoring has missing values,
the missing values are used as inputs. Because nulls are propagated, the result is usually a
missing value.
Machine Learning Studio (classic) provides many different scoring modules. You select one
depending on the type of model you are using, or the type of scoring task you are performing:
Assign Data to Clusters: Assigns data to clusters by using an existing trained clustering
model.
Use this module if you want to cluster new data based on an existing K-Means clustering
model.
This module replaces the Assign to Clusters (deprecated) module, which has been
deprecated but is still available for use in existing experiments.
Score Matchbox Recommender: Scores predictions for a dataset by using the Matchbox
recommender.
Use this module if you want to generate recommendations, find related items or
users, or predict ratings.
Use this module for all other regression and classification models, as well as
some anomaly detection models.
The ranking model purposes to rank, i.e., producing a permutation of items in new, unseen lists
in a similar way to rankings in the training data. Ranking is a central part of many information
retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, and
online advertising. Training data consists of queries and documents matching them together
with relevance degree of each match. It may be prepared manually by human assessors (or
raters, as Google calls them), who check results for some queries and determine relevance of
each result. It is not feasible to check the relevance of all documents, and so typically a technique
called pooling is used — only the top few documents, retrieved by some existing ranking models
are checked. Alternatively, training data may be derived automatically by analyzing clickthrough
logs (i.e. search results which got clicks from users), query chains, or such search engines’
features as Google’s Search Wiki.
Training data is used by a learning algorithm to produce a ranking model which computes the
relevance of documents for actual queries. Typically, users expect a search query to complete
in a short time (such as a few hundred milliseconds for web search), which makes it impossible
to evaluate a complex ranking model on each document in the corpus, and so a two-phase
scheme is used.
First, a small number of potentially relevant documents are identified using simpler retrieval
models which permit fast query evaluation, such as the vector space model, boolean model.
This phase is called top- k document retrieval and many heuristics were proposed in the literature
to accelerate it, such as using a document’s static quality score and tiered indexes. In the
second phase, a more accurate but computationally expensive machine-learned model is used
to re-rank these documents.
Learning to rank algorithms have been applied in areas other than information retrieval:
In software engineering, learning-to-rank methods have been used for fault localization.
Class probability estimation is obviously more difficult than classification. Given a way of
generating class probabilities, classification error is minimized as long as the correct class is
predicted with maximum probability. However, a method for classification does not imply a
method of generating accurate probability estimates: the estimates that yield the correct
classification may be quite poor when assessed according to the quadratic or informational
loss.
Consider the case of probability estimation for a dataset with two classes. If the predicted
probabilities are on the correct side of the 0.5 threshold commonly used for classification, no
classification errors will be made. However, this does not mean that the probability estimates
themselves are accurate. They may be systematically too optimistic—too close to either 0 or
1—or too pessimistic—not close enough to the extremes. This type of bias will increase the
measured quadratic or informational loss, and will cause problems when attempting to minimize
the expected cost of classifications based on a given cost matrix.
As with classifiers, we can now ask the question of how good these class probability estimators
are. A slight complication here is that, as already remarked, we do not have access to the true
probabilities. One trick that is often applied is to define a binary vector (I [c(x) =C1], . . . , I [c(x)
=Ck ]), which has the i -th bit set to 1 if x’s true class is Ci and all other bits set to 0, and use
these as the ‘true’ probabilities. We can then define the squared error (SE) of the predicted
probability vector pˆ(x) = pˆ1(x), . . . ,pˆk (x) as
and the mean squared error (MSE) as the average squared error over all instances in the test
set:
79
4.9 Summary
Classification is the most common task in machine learning. Binary classification refers to
those classification tasks that have two class labels. Multi-class classification refers to those
classification tasks that have more than two class labels. Multi-label classification refers to
those classification tasks that have two or more class labels, where one or more class labels
may be predicted for each example. The performance of such classifiers can be summarised
by means of a table known as a contingency table or confusion matrix.
4.10 Keywords
LESSON – 5
CONCEPT LEARNING
Structure
5.1 Introduction
5.4 Learnability
5.5 Summary
5.6 Keywords
5.1 Introduction
In this lesson we consider methods for learning logical expressions or concepts from examples,
which lies at the basis of both tree models and rule models. In concept learning we only learn a
description for the positive class, and label everything that doesn’t satisfy that description as
negative. We will pay particular attention to the generality ordering that plays an important role
in logical models. Inducing general functions from specific training examples is a main issue of
machine learning. The Concept Learning is Acquiring the definition of a general category from
given sample positive and negative training examples of the category. Concept Learning can
seen as a problem of searching through a predefined space of potential hypotheses for the
hypothesis that best fits the training examples.
The simplest concept learning setting is where we restrict the logical expressions describing
concepts to conjunctions of literals. In most supervised machine learning algorithm, our main
goal is to find out a possible hypothesis from the hypothesis space that could possibly map out
the inputs to the proper outputs. The hypothesis space has a general-to-specific ordering of
hypotheses, and the search can be efficiently organized by taking advantage of a naturally
occurring structure over the hypothesis space.
The following Figure 5.1, shows the common method to find out the possible hypothesis from
the Hypothesis space:
Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the
machine learning algorithm would determine the best possible (only one) which would best
describe the target function or the outputs.
82
Hypothesis (h):
A hypothesis is a function that best describes the target in supervised machine learning. The
hypothesis that an algorithm would come up depends upon the data and also depends upon
the restrictions and bias that we have imposed on the data. To better understand the Hypothesis
Space and Hypothesis consider the following coordinate that shows the distribution of some
data:
Say suppose we have test data for which we have to determine the outputs or results. The test
data is as shown below:
The way in which the coordinate would be divided depends on the data, algorithm and constraints.
All these legal possible ways in which we can divide the coordinate plane to predict the outcome
of the test data composes of the Hypothesis Space.
• Each hypothesis will be a vector of six constraints, specifying the values of the six attributes
A hypothesis:
<?, ?, ?, ?, ?, ?>
<0, 0, 0, 0, 0, 0>
• EnjoySport concept learning task requires learning the sets of days for which EnjoySport=yes,
describing this set by a conjunction of constraints over the instance attributes.
Given
– Training Examples D : positive and negative examples of the target function Determine
Many algorithms for concept learning organize the search through the hypothesis space by
relying on a general-to-specific ordering of hypotheses. By taking advantage of this naturally
occurring structure over the hypothesis space, we can design learning algorithms that
exhaustively search even infinite hypothesis spaces without explicitly enumerating every
hypothesis.
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
• Now consider the sets of instances that are classified positive by hl and by h2.
positive.
– In fact, any instance classified positive by hl will also be classified positive by h2.
More-General-Than Relation
For any instance x in X and hypothesis h in H, we say that x satisfies h if and only if h(x) = 1.
More-General-Than-Or-Equal Relation:
h1 is more-general-than h2 ( h1 > h2) if and only if h1e”h2 is true and h2e”h1 is false. We also
say h2 is more-specific-than h1.
86
5.5 Learnability
The downside of a more expressive concept language is that it may be harder to learn. The field
of computational learning theory studies exactly this question of learnability. To kick things off
we need a learning model: a clear statement of what we mean if we say that a concept language
is learnable. One of the most common learning models is the model of probably approximately
correct (PAC) learning. PAC-learnability means that there exists a learning algorithm that gets
it mostly right, most of the time. The model makes an allowance for mistakes on non-typical
examples: hence the ‘mostly right’ or ‘approximately correct’. The model also makes an
allowance for sometimes getting it completely wrong.
The only realistic expectation of a good learner is that with high probability it will learn a close
approximation to the target concept. The only realistic expectation of a good learner is that with
high probability it will learn a close approximation to the target concept. In Probably Approximately
Correct (PAC) learning, one requires that given small parameters and δ, With probability at
least 1 - δ , a learner produces a hypothesis with error at most. The only reason we can hope
for this is the consistent distribution assumption.
Consider a concept class C defined over an instance space X (containing instances of length
n), and a learner L using a hypothesis space H. The concept class C is PAC learnable by L
using H if for all f C, for all distribution D over X, and fixed 0<, δ < 1, given m examples
sampled independently according to D, the algorithm L produces, with probability at least (1- δ),
a hypothesis h H that has error at most , where m is polynomial in 1/, 1/ δ, n and size(H).
approximate f ?
– Is there an efficient algorithm that can process the sample and produce a
good hypothesis h ?
87
5.6 Summary
The Concept Learning is Acquiring the definition of a general category from given sample positive
and negative training examples of the category. A hypothesis is a function that best describes
the target in supervised machine learning. The field of computational learning theory studies
exactly this question of learnability. One of the most common learning models is the model of
probably approximately correct (PAC) learning.
5.7 Keywords
LESSON – 6
TREE MODELS
Structure
6.1 Introduction
6.7 Summary
6.8 Keywords
6.1 Introduction
Tree models are among the most popular models in machine learning. Trees are expressive
and easy to understand, and of particular appeal to computer scientists due to their recursive
‘divide-and-conquer’ nature. A feature tree is a tree such that each internal node (the nodes that
are not leaves) is labelled with a feature, and each edge emanating from an internal node is
labelled with a literal. The set of literals at a node is called a split. Each leaf of the tree represents
a logical expression, which is the conjunction of literals encountered on the path from the root
of the tree to the leaf. The extension of that conjunction (the set of instances covered by it) is
called the instance space segment associated with the leaf. Let us discuss in detail in this
lesson.
89
BestSplit(D,F) returns the best set of literals to be put at the root of the tree.
Algorithm:6.1
Step 2. S BestSplit(D,F) ;
Step 6. end
Step 7. return a tree whose root is labelled with S and whose children are Ti
The above Algorithm is a divide-and-conquer algorithm: it divides the data into subsets, builds a
tree for each of those and then combines those subtrees into a single tree. Divide-and-conquer
algorithms are a tried-and-tested technique in computer science. They are usually implemented
recursively, because each subproblem (to build a tree for a subset of the data) is of the same
form as the original problem.
This works as long as there is a way to stop the recursion, which is what the first line of the
algorithm does. However, it should be noted that such algorithms are greedy: whenever there is
a choice (such as choosing the best split), the best alternative is selected on the basis of the
information then available, and this choice is never reconsidered. This may lead to sub-optimal
choices. An alternative would be to use a backtracking search algorithm which can return an
optimal solution, at the expense of increased computation time and memory requirements.
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome. In a Decision tree, there are two
nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any
decision and have multiple branches, whereas Leaf nodes are the output of those decisions
and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset. It is a
graphical representation for getting all the possible solutions to a problem/decision based on
given conditions. It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure. In order to build
a tree, we use the CART algorithm, which stands for Classification and Regression Tree
algorithm. A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
91
Below diagram (Figure 6.1) explains the general structure of a decision tree:
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand. The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below Algorithm 6.2 :
Algorithm 6.2
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels.
The next decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider
the below figure 6.2:
93
While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
Information Gain:
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
94
Suppose our entire population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t.
Now we have two features to predict whether he/she will go to the gym or not.
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use information gain
to decide which feature should be the root node and which feature should be placed after the
split.
Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
Our parent entropy was near 0.99 and after looking at this value of information gain, we can say
that the entropy of the dataset will decrease by 0.37 if we make “Energy” as our root node.
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
Where,
o P(no)= probability of no
Now we know what entropy is and what is its formula, Next, we need to know that how exactly
does it work in this algorithm. Entropy basically measures the impurity of a node. Impurity is the
degree of randomness; it tells how random our data is. A pure sub-split means that either you
should be getting “yes”, or you should be getting “no”. Suppose a feature has 8 “yes” and 4 “no”
initially, after the first split the left node gets 5 ‘yes’ and 2 ‘no’ whereas right node gets 3 ‘yes’ and
2 ‘no’. We see here the split is not pure, why? Because we can still see some negative classes
in both the nodes. In order to make a decision tree, we need to calculate the impurity of each
split, and when the purity is 100%, we make it as a leaf node.
96
To check the impurity of feature 2 (Figure 6.4) and feature 3 we will take the help for Entropy
formula.
For feature 3,
97
We can clearly see from the tree itself that left node has low entropy or more purity than right
node since left node has a greater number of “yes” and it is easy to decide here. Always remember
that the higher the Entropy, the lower will be the purity and the higher will be the impurity.
Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree. A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
98
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Probability estimation trees (PETs) generalize classification trees in that they assign class
probability distributions instead of class labels to examples that are to be classified. It has
further been shown that the use of probability correction improves the performance of PETs.
Tree induction is one of the most effective and widely used methods for building classification
models. However, many applications require cases to be ranked by the probability of class
membership. Probability estimation trees (PETs) have the same attractive features as
classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on
large data sets).
All probabilities add to 1.0 (which is always a good check) . The probability of getting at least
one Head from two tosses is 0.25+0.25+0.25 = 0.75 ... and more
Reduction in Variance is a method for splitting the node used when the target variable is
continuous, i.e., regression problems. It is so-called because it uses variance as a measure for
deciding the feature on which node is split into child nodes.
Variance is used for calculating the homogeneity of a node. If a node is entirely homogeneous,
then the variance is zero. Here are the steps to split a decision tree using reduction in variance:
1. For each split, individually calculate the variance of each child node
2. Calculate the variance of each split as the weighted average variance of child nodes
101
6.7 Summary
Tree models are among the most popular models in machine learning. Trees are expressive
and easy to understand, and of particular appeal to computer scientists due to their recursive
‘divide-and-conquer’ nature. Decision Tree is a Supervised learning technique that can be used
for both classification and Regression problems, but mostly it is preferred for solving
Classification problems. There are two popular techniques for ASM, which are Information Gain
and Gini Index. Pruning is a process of deleting the unnecessary nodes from a tree in order to
get the optimal decision tree. Probability estimation trees (PETs) generalize classification trees
in that they assign class probability distributions instead of class labels to examples that are to
be classified. Reduction in Variance is a method for splitting the node used when the target
variable is continuous, i.e., regression problems.
6.8 Keywords
Tree Models, Decision Trees, Ranking, Probability Estimation Trees, Variance Induction, Attribute
selection measure (ASM)
LESSON – 7
RULE MODELS
Structure
7.1 Introduction
7.6 Summary
7.7 Keywords
7.1 Introduction
Rule models are the second major type of logical machine learning models. Generally speaking,
they offer more flexibility than tree models: for instance, while decision tree branches are mutually
exclusive, the potential overlap of rules may give additional information. This flexibility comes at
a price, however: while it is very tempting to view a rule as a single, independent piece of
information, this is often not adequate because of the way the rules are learned. Particularly in
supervised learning, a rule model is more than just a set of rules: the specification of how the
rules is to be combined to form predictions is a crucial part of the model.
There are essentially two approaches to supervised rule learning. One is inspired by decision
tree learning: find a combination of literals – the body of the rule, which is what we previously
called a concept – that covers a sufficiently homogeneous set of examples, and find a label to
put in the head of the rule. The second approach goes in the opposite direction: first select a
class you want to learn, and then find rule bodies that cover (large subsets of) the examples of
that class. The first approach naturally leads to a model consisting of an ordered sequence of
rules – a rule list.
103
The key idea of this kind of rule learning algorithm is to keep growing a conjunctive rule body by
adding the literal that most improves its homogeneity. A decision rule is a simple IF-THEN
statement consisting of a condition (also called antecedent) and a prediction. For example: IF it
rains today AND if it is April (condition), THEN it will rain tomorrow (prediction). A single decision
rule or a combination of several rules can be used to make predictions.
Decision rules follow a general structure: IF the conditions are met THEN make a certain
prediction. Decision rules are probably the most interpretable prediction models. Their IF-THEN
structure semantically resembles natural language and the way we think, provided that the
condition is built from intelligible features, the length of the condition is short (small number of
features=value pairs combined with an AND) and there are not too many rules. In programming,
it is very natural to write IF-THEN rules. New in machine learning is that the decision rules are
learned through an algorithm.
Imagine using an algorithm to learn decision rules for predicting the value of a house (low,
medium or high). One decision rule learned by this model could be: If a house is bigger than
100 square meters and has a garden, then its value is high. More formally: IF size>100 AND
garden=1 THEN value=high.
The two conditions are connected with an ‘AND’ to create a new condition. Both must
be true for the rule to apply.
104
A decision rule uses at least one feature=value statement in the condition, with no upper limit on
how many more can be added with an ‘AND’. An exception is the default rule that has no explicit
IF-part and that applies when no other rule applies, but more about this later. The usefulness of
a decision rule is usually summarized in two numbers: Support and accuracy.
Support or coverage of a rule: The percentage of instances to which the condition of a rule
applies is called the support. Take for example the rule size=big AND location=good THEN
value=high for predicting house values. Suppose 100 of 1000 houses are big and in a good
location, then the support of the rule is 10%. The prediction (THEN-part) is not important for the
calculation of support.
Accuracy or confidence of a rule: The accuracy of a rule is a measure of how accurate the
rule is in predicting the correct class for the instances to which the condition of the rule applies.
For example: Let us say of the 100 houses, where the rule size=big AND location=good THEN
value=high applies, 85 have value=high, 14 have value=medium and 1 has value=low, then the
accuracy of the rule is 85%.
Usually there is a trade-off between accuracy and support: By adding more features to the
condition, we can achieve higher accuracy, but lose support. To create a good classifier for
predicting the value of a house you might need to learn not only one rule, but maybe 10 or 20.
Then things can get more complicated and you can run into one of the following problems:
Rules can overlap: What if I want to predict the value of a house and two or more rules apply
and they give me contradictory predictions?
No rule applies: What if I want to predict the value of a house and none of the rules apply?
There are two main strategies for combining multiple rules: Decision lists (ordered) and decision
sets (unordered). Both strategies imply different solutions to the problem of overlapping rules.
A decision list introduces an order to the decision rules. If the condition of the first rule is true for
an instance, we use the prediction of the first rule. If not, we go to the next rule and check if it
applies and so on. Decision lists solve the problem of overlapping rules by only returning the
prediction of the first rule in the list that applies. A decision set resembles a democracy of the
105
rules, except that some rules might have a higher voting power. In a set, the rules are either
mutually exclusive, or there is a strategy for resolving conflicts, such as majority voting, which
may be weighted by the individual rule accuracies or other quality measures. Interpretability
suffers potentially when several rules apply.
Both decision lists and sets can suffer from the problem that no rule applies to an instance.
This can be resolved by introducing a default rule. The default rule is the rule that applies when
no other rule applies. The prediction of the default rule is often the most frequent class of the
data points which are not covered by other rules. If a set or list of rules covers the entire feature
space, we call it exhaustive. By adding a default rule, a set or list automatically becomes
exhaustive.
The algorithms are chosen to cover a wide range of general ideas for learning rules, so all three
of them represent very different approaches.
1. OneR learns rules from a single feature. OneR is characterized by its simplicity, interpretability
and its use as a benchmark.
2. Sequential covering is a general procedure that iteratively learns rules and removes the
data points that are covered by the new rule. This procedure is used by many rule learning
algorithms.
3. Bayesian Rule Lists combine pre-mined frequent patterns into a decision list using Bayesian
statistics. Using pre-mined patterns is a common approach used by many rule learning
algorithms
The OneR algorithm suggested by Holte (1993) is one of the simplest rule induction
algorithms. From all the features, OneR selects the one that carries the most information
about the outcome of interest and creates decision rules from this feature.
Despite the name OneR, which stands for “One Rule”, the algorithm generates more than
one rule: It is actually one rule per unique feature value of the selected best feature. A better
name would be OneFeatureRules.
106
o Create a cross table between the feature values and the (categorical)
outcome.
o For each value of the feature, create a rule which predicts the most frequent
class of the instances that have this particular feature value (can be read
from the cross table).
Sequential Covering
Sequential covering is a general procedure that repeatedly learns a single rule to create a
decision list (or set) that covers the entire dataset rule by rule. Many rule-learning algorithms
are variants of the sequential covering algorithm. This chapter introduces the main recipe
and uses RIPPER, a variant of the sequential covering algorithm for the examples. The
idea is simple: First, find a good rule that applies to some of the data points. Remove all
data points which are covered by the rule.
A data point is covered when the conditions apply, regardless of whether the points are
classified correctly or not. Repeat the rule-learning and removal of covered points with the
remaining points until no more points are left or another stop condition is met. The result is
a decision list.
This approach of repeated rule-learning and removal of covered data points is called
“separate-and-conquer”. Suppose we already have an algorithm that can create a single
rule that covers part of the data. The sequential covering algorithm for two classes (one
positive, one negative) works like this:
107
2. Learn a rule r.
3. While the list of rules is below a certain quality threshold (or positive examples
are not yet covered):
Pre-mine frequent patterns from the data that can be used as conditions for the decision rules.
Learn a decision list from a selection of the pre-mined rules. A specific approach using this
recipe is called Bayesian Rule Lists or BRL for short. BRL uses Bayesian statistics to learn
decision lists from frequent patterns which are pre-mined with the FP-tree algorithm.
The goal of the BRL algorithm is to learn an accurate decision list using a selection of the
pre-mined conditions, while prioritizing lists with few rules and short conditions. BRL
addresses this goal by defining a distribution of decision lists with prior distributions for the
length of conditions (preferably shorter rules) and the number of rules (preferably a shorter
list).
The posteriori probability distribution of lists makes it possible to say how likely a decision
list is, given assumptions of shortness and how well the list fits the data. Our goal is to find
the list that maximizes this posterior probability.
Since it is not possible to find the exact best list directly from the distributions of lists, BRL
suggests the following recipe:
108
1) Generate an initial decision list, which is randomly drawn from the priori distribution.
2) Iteratively modify the list by adding, switching or removing rules, ensuring that the resulting
lists follow the posterior distribution of lists.
3) Select the decision list from the sampled lists with the highest probability according to
the posteriori distribution.
One of the most expressive and human readable representations for learned hypotheses
is sets of production rules (if-then rules). Rules can be derived from other representations
(e.g., decision trees) or they can be learned directly. Here, we are concentrating on the
direct method. An important aspect of direct rule-learning algorithms is that they can learn
sets of first-order rules which have much more representational power than the propositional
rules that can be derived from decision trees. Propositional Logic does not include variables
and thus cannot express general relations among the values of the attributes.
This rule (which you cannot write in Propositional Logic) applies to any family!
First order logic is much more expressive than propositional logic – i.e. it allows a finer-grain of
specification and reasoning when representing knowledge. In the context of machine learning,
consider learning the relational concept daughter (x, y) defined over pairs of persons x, y, where
– persons are represented by attributes: Name, Mother, Father, Male, Female
Daughter1,2 = T>
first order rule learner can learn the rule: IF Father(y, x)’”Female(y) THEN Daughter(x, y).
7.6 Summary
Rule models are the second major type of logical machine learning models. The key idea of this
kind of rule learning algorithm is to keep growing a conjunctive rule body by adding the literal
that most improves its homogeneity. Decision lists solve the problem of overlapping rules by
only returning the prediction of the first rule in the list that applies. Sequential covering is a
general procedure that repeatedly learns a single rule to create a decision list (or set) that
covers the entire dataset rule by rule. A specific approach using this recipe is called Bayesian
Rule Lists or BRL for short. BRL uses Bayesian statistics to learn decision lists from frequent
patterns which are pre-mined with the FP-tree algorithm.
7.7 Keywords
Rule models, Decision rules, learning models, Support, Accuracy, Decision lists, OneR
2. Write notes on
LESSON – 8
LINEAR MODELS
Structure
8.1 Introduction
8.7 Summary
8.8 Keywords
8.1 Introduction
Linearity plays a fundamental role in mathematics and related disciplines, and the mathematics
of linear models is well-understood. In machine learning, linear models are of particular interest
because of their simplicity. Here are a couple of manifestations of this simplicity. Linear models
are parametric, meaning that they have a fixed form with a small number of numeric parameters
that need to be learned from data. This is different from tree or rule models, where the structure
of the model (e.g., which features to use in the tree, and where) is not fixed in advance. Linear
models are stable, which is to say that small variations in the training data have only limited
impact on the learned model. A linear model is a model that assumes the data is linearly separable.
Tree models tend to vary more with the training data, as the choice of a different split at the root
of the tree typically means that the rest of the tree is different as well. Linear models are less
likely to overfit the training data than some other models, largely because they have relatively
few parameters. The flipside of this is that they sometimes lead to underfitting: e.g., imagine
111
you are learning where the border runs between two countries from labelled samples, then a
linear model is unlikely to give a good approximation.
The last two points can be summarised by saying that linear models have low variance but high
bias. Such models are often preferable when you have limited data and want to avoid overfitting.
High variance–low bias models such as decision trees are preferable if data is abundant but
underfitting is a concern.
It is usually a good idea to start with simple, high-bias models such as linear models and only
move on to more elaborate models if the simpler ones appear to be underfitting. Linear models
exist for all predictive tasks, including classification, probability estimation and regression. Linear
regression, in particular, is a well-studied problem that can be solved by the least-squares
method. In the field of machine learning, the goal of statistical classification is to use an object’s
characteristics to identify which class (or group) it belongs to. A linear classifier achieves this
by making a classification decision based on the value of a linear combination of the
characteristics.
We’ll explore two types of linear models: Linear regression, which is used for regression
(numerical predictions), and Logistic regression, which is used for classification (categorical
predictions).
Linear regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
112
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable. The linear regression model provides
a sloped straight line representing the relationship between the variables. Consider the below
Figure 8.1,
y= a0+a1x+ ε
ε = random error
113
The values for x and y variables are training datasets for Linear Regression model representation.
Linear regression can be further divided into two types of the algorithm:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis,
then such a relationship is termed as a Positive linear relationship. It is depicted as below,
If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship. It is depicted as below,
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error. The different values for weights or the coefficient of lines (a0, a1) gives a
different line of regression, so we need to calculate the best values for a0 and a1 to find the best
fit line, so to calculate this we use cost function.
Cost function
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
115
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
Where,
Yi = Actual value
Gradient Descent:
Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function. It is done by a random selection of values of coefficient and then iteratively update
the values to reach the minimum cost function.
The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be achieved
by below method:
R-squared method:
o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
Below are some important assumptions of Linear Regression. These are some formal checks
while building a Linear Regression model, which ensures to get the best possible result from
the given dataset.
Linear regression assumes the linear relationship between the dependent and
independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
Linear regression assumes that the error term should follow the normal distribution
pattern. If error terms are not normally distributed, then confidence intervals will become
either too wide or too narrow, which may cause difficulties in finding coefficients.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be
any correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.
117
Simple Linear Regression is a type of Regression algorithms that models the relationship
between a dependent variable and a single independent variable. The relationship shown by a
Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple
Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/
real value. However, the independent variable can be measured on continuous or categorical
values.
Model the relationship between the two variables. Such as the relationship between Income
and expenditure, experience and Salary, etc. Forecasting new observations. Such as Weather
forecasting according to temperature, Revenue of a company according to the investments in
a year, etc.
The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
In the previous topic, we have learned about Simple Linear Regression, where a single
Independent/Predictor(X) variable is used to model the response variable (Y). But there may be
various cases in which the response variable is affected by more than one predictor variable;
for such cases, the Multiple Linear Regression algorithm is used. Moreover, Multiple Linear
118
Regression is an extension of Simple Linear regression as it takes more than one predictor
variable to predict the response variable. We can define it as: Multiple Linear Regression is one
of the important regression algorithms which models the linear relationship between a single
dependent continuous variable and more than one independent variable.
In Multiple Linear Regression, the target variable(Y) is a linear combination of multiple predictor
variables x1, x2, x3, ..., xn. Since it is an enhancement of Simple Linear Regression, so the
same is applied for the multiple linear regression equation, the equation becomes:
Where,
Y= Output/Response variable
A linear relationship should exist between the Target and predictor variables.
dependent variables. The least-squares method provides the overall rationale for the placement
of the line of best fit among the data points being studied.
This method of regression analysis begins with a set of data points to be plotted on an x- and y-
axis graph. An analyst using the least-squares method will generate a line of best fit that explains
the potential relationship between independent and dependent variables. The least-squares
method provides the overall rationale for the placement of the line of best fit among the data
points being studied. The most common application of this method, which is sometimes referred
to as “linear” or “ordinary,” aims to create a straight line that minimizes the sum of the squares
of the errors that are generated by the results of the associated equations, such as the squared
residuals resulting from differences in the observed value, and the value anticipated, based on
that model.
The line of best fit determined from the least square’s method has an equation that tells the
story of the relationship between the data points. Line of best-fit equations may be determined
by computer software models, which include a summary of outputs for analysis, where the
coefficients and summary outputs explain the dependence of the variables being tested.
If the data shows a leaner relationship between two variables, the line that best fits this linear
relationship is known as a least-squares regression line, which minimizes the vertical distance
from the data points to the regression line. The term “least squares” is used because it is the
smallest sum of squares of errors, which is also called the “variance.” In regression analysis,
dependent variables are illustrated on the vertical y-axis, while independent variables are
illustrated on the horizontal x-axis. These designations will form the equation for the line of best
fit, which is determined from the least-squares method.
In contrast to a linear problem, a non-linear least-squares problem has no closed solution and
is generally solved by iteration. The discovery of the least-squares method is attributed to Carl
Friedrich Gauss, who discovered the method in 1795.
120
An example of the least-squares method is an analyst who wishes to test the relationship
between a company’s stock returns, and the returns of the index for which the stock is a
component. In this example, the analyst seeks to test the dependence of the stock returns on
the index returns. To achieve this, all of the returns are plotted on a chart. The index returns are
then designated as the independent variable, and the stock returns are the dependent variable.
The line of best fit provides the analyst with coefficients explaining the level of dependence. The
least-squares method is a mathematical technique that allows the analyst to determine the
best way of fitting a curve on top of a chart of data points.
It is widely used to make scatter plots easier to interpret and is associated with regression
analysis. These days, the least-squares method can be used as part of most statistical software
programs. The least-squares method is used in a wide variety of fields, including finance and
investing. For financial analysts, the method can help to quantify the relationship between two
or more variables—such as a stock’s share price and its earnings per share (EPS). By
performing this type of analysis, investors may attempt to forecast the future behavior of stock
prices or other factors.
To illustrate, consider the case of an investment considering whether to invest in a gold mining
company. The investor might wish to know how sensitive the company’s stock price is to changes
in the market price of gold. To study this, the investor could use the least-squares method to
trace the relationship between those two variables over time onto a scatter plot. This analysis
could help the investor predict the degree to which the stock’s price would likely rise or fall for
any given increase or decrease in the price of gold.
Consider an example. Tom who is the owner of a retail shop, found the price of different T-shirts
vs the number of T-shirts sold at his shop over a period of one week.
Table 8.1
Let us use the concept of least squares regression to find the line of best fit for the above data.
Once you substitute the values, it should look something like this:
122
Table 8.2
Let’s construct a graph that represents the y=mx + c line of best fit(Figure 8.4) :
Now Tom can use the above equation to estimate how many T-shirts of price $8 can he sell at
the retail shop.
This comes down to 13 T-shirts! That’s how simple it is to make predictions using Linear
Regression.
A linear classifier that will achieve perfect separation on linearly separable data is the perceptron,
originally proposed as a simple neural network. The perceptron iterates over the training set,
123
updating the weight vector every time it encounters an incorrectly classified example. For
example, let xi be a misclassified positive example, then we have yi =+1 and w·xi < t . We
therefore want to find w’ such that w’·xi >w·xi , which moves the decision boundary towards and
hopefully past xi . This can be achieved by calculating the new weight vector as w’ = w+ηxi ,
where 0 < η 1 is the learning rate.
learning rate η.
2 converged false;
4 converged true;
5 for i = 1 to |D| do
7 then
8 w w+ηyi xi ;
10 end
11 end
12 end
124
The algorithm iterates through the training examples until all examples are correctly classified.
The algorithm can easily be turned into an online algorithm that processes a stream of examples,
updating the weight vector only if the last received example is misclassified. The perceptron is
guaranteed to converge to a solution if the training data is linearly separable, but it won’t converge
otherwise.
The key point of the perceptron algorithm is that, every time an example xi is misclassified, we
add yi xi to the weight vector. After training has completed, each example has been misclassified
zero or more times – denote this number αi for example xi . Using this notation the weight
vector can be expressed as
“Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used
for both classification or regression challenges. However, it is mostly used in classification
problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space
(where n is a number of features you have) with the value of each feature being the value of a
particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates
the two classes very well. Support Vectors are simply the coordinates of individual observation.
The SVM classifier is a frontier that best segregates the two classes (hyper-plane/ line).
Let’s understand:
Identify the right hyper-plane (Scenario-1) in Figure 8.5: Here, we have three hyper-planes (A,
B, and C). Now, identify the right hyper-plane to classify stars and circles.
We need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane
which segregates the two classes better”. In this scenario, hyper-plane “B” has excellently
performed this job.
Identify the right hyper-plane (Scenario-2) ) in Figure 8.6: Here, we have three hyper-planes (A,
B, and C) and all are segregating the classes well. Now, How can we identify the right hyper-
plane?
Here, maximizing the distances between nearest data point (either class) and hyper-plane will
help us to decide the right hyper-plane. This distance is called as Margin. Let’s look at the below
Figure 8.7:
Above, you can see that the margin for hyper-plane C is high as compared to both A and B.
Hence, we name the right hyper-plane as C. Another lightning reason for selecting the hyper-
plane with higher margin is robustness. If we select a hyper-plane having low margin then there
is high chance of miss-classification.
Identify the right hyper-plane (Scenario-3) in Figure 8.8, Hint: Use the rules as discussed in
previous section to identify the right hyper-plane.
But, here is the catch, SVM selects the hyper-plane which classifies the classes accurately prior
to maximizing margin. Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A.
Can we classify two classes (Scenario-4)? in Figure 8.9 : Unable to segregate the two classes
using a straight line, as one of the stars lies in the territory of other(circle) class as an outlier.
The SVM algorithm has a feature to ignore outliers and find the hyper-plane that has the maximum
margin. Hence, we can say, SVM classification is robust to outliers.
127
8.5 Summary
Linear models are parametric, meaning that they have a fixed form with a small number of
numeric parameters that need to be learned from data. Linear regression, which is used for
regression and Logistic regression, which is used for classification. The least-squares method
is a form of mathematical regression analysis used to determine the line of best fit for a set of
data, providing a visual demonstration of the relationship between the data points. A linear
classifier that will achieve perfect separation on linearly separable data is the perceptron, originally
proposed as a simple neural network. “Support Vector Machine” (SVM) is a supervised machine
learning algorithm that can be used for both classification or regression challenges. However, it
is mostly used in classification problems
8.6 Keywords
Linear Model, Linear Regression, Logistic Regression, Linear Models, Least square methods,
Support Vector Machines.
1. Identify the basics of choosing between linear regression and logistic regression for solving
machine learning problems.
5. What are the assumptions of linear regression and multiple linear regression.
128
LESSON - 9
DISTANCE-BASED MODELS
Structure
9.1 Introduction
9.7 Summary
9.8 Keywords
9.1 Introduction
Any forms of learning are based on generalizing from training data to unseen data by exploiting
the similarities between the two. With grouping models such as decision trees these similarities
take the form of an equivalence relation or partition of the instance space: two instances are
similar whenever they end up in the same segment of this partition. In this chapter we consider
learning methods that utilize more graded forms of similarity. There are many different ways in
which similarity can be measured, and in this section we take a look at the most important of
them. We a discuss two key concepts in distance-based machine learning: neighbors and
exemplars. we consider what is perhaps the best-known distance-based learning method: the
nearest-neighbour classifier and K- means clustering, hierarchical clustering by constructing
dendrograms.
129
Distance metrics are a key part of several machine learning algorithms. These distance metrics
are used in both supervised and unsupervised learning, generally to calculate the similarity
between data points. An effective distance metric improves the performance of our machine
learning model, whether that’s for classification tasks or clustering. Hence, we can calculate
the distance between points and then define the similarity between them. The four types of
Distance Metrics in Machine Learning are,
Euclidean Distance
Manhattan Distance
Minkowski Distance
Hamming Distance
Euclidean Distance
Euclidean Distance represents the shortest distance between two points. Most machine learning
algorithms including K-Means use this distance metric to measure the similarity between
observations. Let’s say we have two points as shown below: A (p1, p2) and B (q1, q2). So, the
Euclidean Distance between these two points A and B will be:
We use this formula when we are dealing with 2 dimensions. We can generalize this for an n-
dimensional space as:
Where,
n = number of dimensions
Manhattan Distance
Manhattan Distance is the sum of absolute differences between points across all the dimensions.
Note that Manhattan Distance is also known as city block distance.
Where,
n = number of dimensions
Minkowski Distance
Minkowski Distance is the generalized form of Euclidean and Manhattan Distance. The formula
for Minkowski Distance is given as:
131
Hamming Distance
Hamming Distance measures, the similarity between two strings of the same length. The
Hamming Distance between two strings of the same length is the number of positions at which
the corresponding characters are different.
Let’s understand the concept using an example. Let’s say we have two strings:
Since the length of these strings is equal, we can calculate the Hamming Distance. We will go
character by character and match the strings. The first character of both the strings (e and m
respectively) is different. Similarly, the second character of both the strings (u and a) is different.
and so on.
Look carefully – seven characters are different whereas two characters (the last two characters)
are similar:
Hence, the Hamming Distance here will be 7. Note that larger the Hamming Distance between
two strings, more dissimilar will be those strings (and vice versa).
The two most important of these are: formulating the model in terms of a number of prototypical
instances or exemplars, and defining the decision rule in terms of the nearest exemplars or
neighbours. The arithmetic mean minimizes squared Euclidean distance. The arithmetic mean
μ of a set of data points D in a Euclidean space is the unique point that minimizes the sum of
squared Euclidean distances to those data points. Notice that minimizing the sum of squared
Euclidean distances of a given set of points is the same as minimizing the average squared
Euclidean distance.
132
In certain situations, it makes sense to restrict an exemplar to be one of the given data points.
In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that
doesn’t have to occur in the data. Finding a medoid requires us to calculate, for each data point,
the total distance to all other data points, in order to choose the point that minimizes it. Regardless
of the distance metric used, this is an O(n2) operation for n points, so for medoid there is no
computational reason to prefer one distance metric over another.
Once we have determined the exemplars, the basic linear classifier constructs the decision
boundary as the perpendicular bisector of the line segment connecting the two exemplars. An
alternative, distance-based way to classify instances without direct reference to a decision
boundary is by the following decision rule. if x is nearest to μ•” then classify it as positive,
otherwise as negative; or equivalently, classify an instance to the class of the nearest exemplar.
So the basic linear classifier can be interpreted from a distance-based perspective as
constructing exemplars that minimize squared Euclidean distance within each class, and then
applying a nearest-exemplar decision rule.
metric, or medoid that find the most centrally located data point; and
distance-based decision rules, which take a vote among the k nearest exemplars.
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection. K Nearest Neighbour is a simple algorithm that
stores all the available cases and classifies the new data or case based on a similarity measure.
It is mostly used to classifies a data point based on how its neighbours are classified. K-Nearest
Neighbour(K-NN) is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. K-NN algorithm assumes the similarity between the new case/data and
133
available cases and put the new case into the category that is most similar to the available
categories. The algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
It can be used for Regression as well as for Classification but mostly it is used for the
Classification problems. It is a non-parametric algorithm, which means it does not make any
assumption on underlying data. It is also called a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset. K-NN algorithm at the training phase just
stores the dataset and when it gets new data, then it classifies that data into a category that is
much similar to the new data.
KNN can be used for both classification and regression predictive problems. However, it is
more widely used in classification problems in the industry. To evaluate any technique, we
generally look at 3 important aspects:
Calculation time
Predictive Power
Let’s take a simple case to understand this algorithm. Suppose there are two categories, i.e.,
Category A and Category B, and we have a new data point x1, so this data point will lie in which
of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of
K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram Figure 9.1,
134
The K-NN working can be explained on the basis of the below algorithm:
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
Suppose we have a new data point and we need to put it in the required category. Consider the
below Figure 9.2 :
135
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points(Figure 9.3) . The Euclidean
distance is the distance between two points. It can be calculated as:
By calculating the Euclidean distance, we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below Figure 9.4:
136
As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A. Below are some points to remember while selecting the value of K in the
K-NN algorithm:
There is no particular way to determine the best value for “K”, so we need to try some
values to find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
Large values for K are good, but it may find some difficulties.
It is simple to implement.
Always needs to determine the value of K which may be complex some time.
137
The computation cost is high because of calculating the distance between the data points for
all the training samples.
K-Means Clustering
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number
of pre-defined clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties. It allows us to cluster
the data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training. It is a centroid-based algorithm,
where each cluster is associated with a centroid. The main aim of this algorithm is to minimize
the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
138
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster. Hence each cluster has data points with some
commonalities, and it is away from other clusters.
The below Figure 9.5 explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the new closest
centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below in Figure 9.6:
Let’s take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below Figure 9.7:
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the below
Figure 9.8,
140
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let’s color them as
blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid.
To choose the new centroids, we will compute the center of gravity of these centroids, and will
find new centroids as below:
141
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
142
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
We will repeat the process by finding the center of gravity of centroids, so the new centroids will
be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:
143
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some different
ways to find the optimal number of clusters, but here we are discussing the most appropriate
method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS
(for 3 clusters) is given below:
In the above formula of WCSS, “Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the
distances between each data point and its centroid within a cluster1 and the same for the other
two terms. To measure the distance between data points and centroid, we can use any method
such as Euclidean distance or Manhattan distance.
144
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges
from 1-10).
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster
145
is distinct from each other cluster, and the objects within each cluster are broadly similar to
each other. Clustering is basically a technique that groups similar data points such that the
points in the same group are more similar to each other than the points in the other groups. The
group of similar data points is called a Cluster.
Hierarchical clustering is one of the popular and easy to understand clustering technique. This
clustering technique is divided into two types:
Agglomerative
Divisive
Agglomerative Hierarchical Clustering Technique: In this technique, initially each data point
is considered as an individual cluster. At each iteration, the similar clusters merge with other
clusters until one cluster or K clusters are formed.
Repeat: Merge the two closest clusters and update the proximity matrix
Key operation is the computation of the proximity of two clusters. To understand better let’s see
a pictorial representation of the Agglomerative Hierarchical clustering Technique. Let’s say we
have six data points {A,B,C,D,E,F}.
Step- 1: In the initial step, we calculate the proximity of individual points and consider all the six
data points as individual clusters as shown in the Figure 9.16 below.
146
Step- 2: In step two, similar clusters are merged together and formed as a single cluster.
Let’s consider B,C, and D,E are similar clusters that are merged in step two. Now, we’re
left with four clusters which are A, BC, DE, F.
Step- 3: We again calculate the proximity of new clusters and merge the similar clusters to
form new clusters A, BC, DEF.
Step- 4: Calculate the proximity of the new clusters. The clusters DEF and BC are similar
and merged together to form a new cluster. We’re now left with two clusters A, BCDEF.
Step- 5: Finally, all the clusters are merged together and form a single cluster.
In simple words, we can say that the Divisive Hierarchical clustering is exactly the opposite of
the Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we consider
all the data points as a single cluster and in each iteration, we separate the data points from the
cluster which are not similar. Each data point which is separated is considered as an individual
cluster. In the end, we’ll be left with n clusters. As we’re dividing the single clusters into n
clusters, it is named as Divisive Hierarchical clustering.
Space complexity: The space required for the Hierarchical clustering Technique is very high
when the number of data points are high as we need to store the similarity matrix in the RAM.
The space complexity is the order of the square of n.
Time complexity: Since we’ve to perform n iterations and in each iteration, we need to update
the similarity matrix and restore the matrix, the time complexity is also very high. The time
complexity is the order of the cube of n.
2. All the approaches to calculate the similarity between clusters has its own disadvantages.
3. High space and time complexity for Hierarchical clustering. Hence this clustering
algorithm cannot be used when we have huge data.
9.7 Summary
Two key concepts in distance-based machine learning: neighbors and exemplars. Distance
metrics are a key part of several machine learning algorithms. These distance metrics are
used in both supervised and unsupervised learning, generally to calculate the similarity between
data points. K-Nearest Neighbors is one of the most basic yet essential classification algorithms
in Machine Learning. It can be used for Regression as well as for Classification but mostly it is
used for the Classification problems. It is a non-parametric algorithm. Distance based methods
optimize a global criterion based on the distance between the patterns. k-means, CLARA,
CLARANS are examples of distance based clustering method. K-Means Clustering is an
unsupervised learning algorithm that is used to solve the clustering problems in machine learning
or data science. K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Hierarchical clustering, also known as hierarchical
cluster analysis, is an algorithm that groups similar objects into groups called clusters.
9.8 Keywords
LESSON – 10
PROBABILISTIC MODELS
Structure
10.1 Introduction
10.5 Summary
10.6 Keywords
10.1 Introduction
We have already seen how probabilities can be useful to express a model’s expectation about
the class of a given instance. For example, a probability estimation tree attaches a class
probability distribution to each leaf of the tree, and each instance that gets filtered down to a
particular leaf in a tree model is labelled with that particular class distribution. Similarly, a calibrated
linear model translates the distance from the decision boundary into a class probability. One of
the most attractive features of the probabilistic perspective is that it allows us to view learning
as a process of reducing uncertainty. The key point is that probabilities do not have to be
interpreted as estimates of relative frequencies, but can carry the more general meaning of
(possibly subjective) degrees of belief.
The normal distribution is a core concept in statistics, the backbone of data science. While
performing exploratory data analysis. We can draw a connection between probabilistic and
geometric models by considering probability distributions defined over Euclidean spaces. The
most common such distributions are normal distributions, also called Gaussians; the most
important facts concerning univariate and multivariate normal distributions. We start by
considering the univariate, two-class case. Suppose the values of x “ R follow a mixture model:
i.e., each class has its own probability distribution (a component of the mixture model). We will
assume a Gaussian mixture model, which means that the components of the mixture are both
Gaussians. We thus have,
where μ•” and σ•” are the mean and standard deviation for the positive class, and μ and σ are
the mean and standard deviation for the negative class.
Normal Distribution is an important concept in statistics and the backbone of Machine Learning.
A Data Scientist needs to know about Normal Distribution when they work with Linear Models
(perform well if the data is normally distributed), Central Limit Theorem, and exploratory data
analysis. As discovered by Carl Friedrich Gauss, Normal Distribution/Gaussian Distribution is
a continuous probability distribution. It has a bell-shaped curve that is symmetrical from the
mean point to both halves of the curve, which is presented in Figure 10.1.
Mathematical Definition:
A continuous random variable “x” is said to follow a normal distribution with parameter μ(mean)
and σ(standard deviation), if it’s probability density function is given by,
The simplest case of the normal distribution, known as the Standard Normal Distribution, has
an expected value of μ(mean) 0 and σ(s.d.) 1, and is described by this probability density
function,
2. It is a continuous distribution.
152
3. It is symmetrical about the mean. Each half of the distribution is a mirror image of the
other half.
5. It is unimodal.
Area Properties:
The normal distribution carries with it assumptions and can be completely specified by two
parameters: the mean and the standard deviation. If the mean and standard deviation are known,
you can access every data point on the curve.
The empirical rule is a handy quick estimate of the data’s spread given the mean and standard
deviation of a data set that follows a normal distribution. It states that(Figure 10.2):
95% — (μ±1.96σ)
99% — (μ±2.75σ)
In Machine Learning, data satisfying Normal Distribution is beneficial for model building. It makes
math easier. Models like LDA, Gaussian Naive Bayes, Logistic Regression, Linear Regression,
etc., are explicitly calculated from the assumption that the distribution is a bivariate or multivariate
normal. Also, Sigmoid functions work most naturally with normally distributed data.
Many natural phenomena in the world follow a log-normal distribution, such as financial
data and forecasting data. By applying transformation techniques, we can convert the data into
a normal distribution. Also, many processes follow normality, such as many measurement errors
in an experiment, the position of a particle that experiences diffusion, etc. So it’s better to
critically explore the data and check for the underlying distributions for each variable before
going to fit the model. Normality is an assumption for the ML models. It is not mandatory that
data should always follow normality. ML models work very well in the case of non-normally
distributed data also. Models like decision tree, XgBoost, don’t assume any normality and work
on raw data as well. Also, linear regression is statistically effective if only the model errors are
Gaussian, not exactly the entire dataset.
Let’s see a few different ways to check the normality of the distribution that we have,
Histogram
A Histogram visualizes the distribution of data over a continuous interval. Each bar (Figure
10.3) in a histogram represents the tabulated frequency at each interval/bin. In simple words,
height represents the frequency for the respective bin (interval).
KDE Plots
A density plot is a smoothed, continuous version of a histogram (Figure 10.4) estimated from
the data. The most common form of estimation is known as kernel density estimation (KDE). In
this method, a continuous curve (the kernel) is drawn at every individual data point and all of
these curves are then added together to make a single smooth density estimation.
Q_Q Plot
Quantiles are cut points dividing the range of a probability distribution into continuous intervals
with equal probabilities or dividing the observations in a sample in the same way. The plot is
presented in Figure 10.5.
In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli
distribution, multinoulli distribution) is a discrete probability distribution that describes the possible
results of a random variable that can take on one of K possible categories, with the probability
of each category separately specified.
The Bernoulli distribution, named after the Swiss seventeenth century mathematician Jacob
Bernoulli, concerns Boolean or binary events with two possible outcomes: success or 1, and
failure or 0. A Bernoulli distribution has a single parameter θ which gives the probability of
success: hence P(X = 1) = θ and P(X = 0) = 1"θ. The Bernoulli distribution has expected value
E[X] = θ and variance E[(X “E[X])]= θ(1”θ).
The binomial distribution arises when counting the number of successes S in ‘n’ independent
Bernoulli trials with the same parameter θ. It is described by,
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features,
all of these properties independently contribute to the probability that this fruit is an apple and
that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
Let’s understand it using an example (Figure 10.6). Consider a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify
whether players will play or not based on weather condition. Let’s follow the below steps to
perform it.
157
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcome of prediction.
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
Pros:
It is easy and fast to predict class of test data set. It also performs well in multi class
prediction
158
Cons:
If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to make
a prediction. This is often known as “Zero Frequency”. To solve this, we can use the
smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
Suppose you are dealing with a four-class classification problem with classes A, B, C and D. If
you have a sufficiently large and representative training sample of size n, you can use the
relative frequencies in the sample nA, . . . ,nD to estimate the class prior pˆA = nA/n, . . . ,pˆD =
nD/n, as we have done many times before.6 Conversely, if you know the prior and want to know
the most likely class distribution in a random sample of n instances, you would use the prior to
calculate expected values E[nA] =pA ·n, . . . ,E[nD] = pD ·n. So, complete knowledge of one
allows us to estimate or infer.
the other. However, sometimes we have a bit of knowledge about both. For example, we may
know that pA = 1/2 and that C is twice as likely as B, without knowing the complete prior.
Expectation-Maximization
Under the domain of statistics, Maximum Likelihood Estimation is the approach of estimating
the parameters of a probability distribution through maximizing the likelihood function to make
the observed data most probable for the statistical modelling. There is a limitation with MLE, it
considers that data is complete and fully observable, and assumes that all the model-associated
variables are present already. Instead, in most of the cases, some relevant variables might be
hidden that makes inconsistencies. Such unobserved or hidden data variables are known as
Latent variables.
Probability density estimation is the forming of the estimates on the basis of observed data
that incorporates picking a probability distribution function and the parameters of that function
to explain the joint probability of the observed data.
Convergence is simply the instinct on the basis of probability, suppose there is a very small
difference of probability between the two random variables, then it is said to be converged.
Hare, convergence implies the values match with each other.
160
A latent variable model consists of observable and unobservable variables. Observed variables
are ones that can be measured or recorded, and latent/ hidden variables are those that can’t be
observed directly instead need to be inferred from the observed variables.
Sometimes our data has multiple distributions or it has multiple peaks. It does not always have
one peak, and one can notice that by looking at the data set. It will look like there are multiple
peaks happening here and there. There are two peak points and the data seems to be going up
and down twice or maybe three times or four times. But if there are Multiple Gaussian distributions
that can represent this data, then we can build what we called a Gaussian Mixture Model.
10.5 Summary
Probabilities can be useful to express a model’s expectation about the class of a given instance.
The normal distribution is a core concept in statistics, the backbone of data science. Normal
Distribution/Gaussian Distribution is a continuous probability distribution. It has a bell-shaped
curve that is symmetrical from the mean point to both halves of the curve. A Histogram visualizes
the distribution of data over a continuous interval. Each bar in a histogram represents the
tabulated frequency at each interval/bin. In probability theory and statistics, a categorical
distribution (also called a generalized Bernoulli distribution, Multinoulli distribution) is a discrete
probability distribution that describes the possible results of a random variable. Naive Bayes
classifiers are a collection of classification algorithms based on Bayes’ Theorem. Gaussian
Mixture Model or Mixture of Gaussian as it is sometimes called, is not so much a model as it is
a probability distribution.
161
10.6 Keywords
Normal distribution, Histogram, KDE Plots, Q_Q Plot, Bernoulli distribution, Naive Bayes model
LESSON - 11
FEATURES
Structure
11.1 Introduction
11.6 Summary
11.7 Keywords
11.1 Introduction
Features, also called attributes, are defined as mappings fi :X ’!Fi from the instance space X to
the feature domain Fi . We can distinguish features by their domain: common feature domains
include real and integer numbers, but also discrete sets such as colours, the Booleans, and so
on. We can also distinguish features by the range of permissible operations. For example, we
can calculate a group of people’s average age but not their average blood type, so taking the
average value is an operation that is permissible on some features but not on others.
Although many data sets come with pre-defined features, they can be manipulated in many
ways. For example, we can change the domain of a feature by rescaling or discretization; we
can select the best features from a larger set and only work with the selected ones; or we can
combine two or more features into a new feature. In fact, a model itself is a way of constructing
a new feature that solves the task at hand.
163
Consider two features, one describing a person’s age and the other their house number. Both
features map into the integers, but the way we use those features can be quite different.
Calculating the average age of a group of people is meaningful, but an average house number
is probably not very useful! In other words, what matters is not just the domain of a feature, but
also the range of permissible operations. These, in turn, depend on whether the feature values
are expressed on a meaningful scale. Despite appearances, house numbers are not really
integers but ordinals: we can use them to determine that number 10’s neighbours are number
8 and number 12, but we cannot assume that the distance between 8 and 10 is the same as
the distance between 10 and 12. Because of the absence of a linear scale it is not meaningful
to add or subtract house numbers, which precludes operations such as averaging.
Calculations on features
Let’s take a closer look at the range of possible calculations on features, often referred to as
aggregates or statistics. Three main categories are statistics of central tendency, statistics of
dispersion and shape statistics. Each of these can be interpreted either as a theoretical property
of an unknown population or a concrete property of a given sample – here we will concentrate
on sample statistics. Starting with statistics of central tendency, the most important ones are
the median, which is the middle value if we order the instances from lowest to
highest feature value; and the mode, which is the majority value or values.
164
Of these statistics, the mode is the one we can calculate whatever the domain of the feature:
so, for example, we can say that the most frequent blood type in a group of people is O+. In
order to calculate the median, we need to have an ordering on the feature values: so we can
calculate both the mode and the median house number in a set of addresses.1 In order to
calculate the mean, we need a feature expressed on some scale: most often this will be a
linear scale for which we calculate the familiar arithmetic mean. It is often suggested that the
median tends to lie between the mode and the mean, but there are plenty of exceptions to this
‘rule’. The famous statistician Karl Pearson suggested a more specific rule of thumb (with
therefore even more exceptions): the median tends to fall one-third of the way from mean to
mode.
The second kind of calculation on features are statistics of dispersion or ‘spread’. Two well-
known statistics of dispersion are the variance or average squared deviation from the (arithmetic)
mean, and its square root, the standard deviation. Variance and standard deviation essentially
measure the same thing, but the latter has the advantage that it is expressed on the same
scale as the feature itself. For example, the variance of the body weight in kilograms of a group
of people is measured in kg2 , whereas the standard deviation is measured in kilograms. The
absolute difference between the mean and the median is never larger than the standard deviation
– this is a consequence of Chebyshev’s inequality, which states that at most 1/k2 of the values
are more than k standard deviations away from the mean.
A simpler dispersion statistic is the difference between maximum and minimum value, which is
called the range. A natural statistic of central tendency to be used with the range is the midrange
point, which is the mean of the two extreme values. These definitions assume a linear scale but
can be adapted to other scales using suitable transformations. For example, for a feature
expressed on a logarithmic scale, such as frequency, we would take the ratio of the highest and
lowest frequency as the range, and the harmonic mean of these two extremes as the midrange
point. Other statistics of dispersion include percentiles.
The p-th percentile is the value such that p per cent of the instances fall below it. If we have 100
instances, the 80th percentile is the value of the 81st instance in a list of increasing values. If p
is a multiple of 25 the percentiles are also called quartiles, and if it is a multiple of 10 the
percentiles are also called deciles. Note that the 50th percentile, the 5th decile and the second
quartile are all the same as the median. Percentiles, deciles and quartiles are special cases of
165
quantiles. Once we have quantiles we can measure dispersion as the distance between different
quantiles. For instance, the interquartile range is the difference between the third and first quartile
(i.e., the 75th and 25th percentile). The skew and ‘peakedness’ of a distribution can be measured
by shape statistics such as skewness and kurtosis. The main idea is to calculate the third and
fourth central moment of the sample.
Given these various statistics we can distinguish three main kinds of feature: those with a
meaningful numerical scale, those without a scale but with an ordering, and those without
either. We will call features of the first type quantitative; they most often involve a mapping into
the reals (another term in common use is ‘continuous’). Even if a feature maps into a subset of
the reals, such as age expressed in years, the various statistics such as mean or standard
deviation still require the full scale of the reals.
Features with an ordering but without scale are called ordinal features. The domain of an ordinal
feature is some totally ordered set, such as the set of characters or strings. Even if the domain
of a feature is the set of integers, denoting the feature as ordinal means that we have to dispense
with the scale, as we did with house numbers. Another common example are features that
express a rank order: first, second, third, and so on. Ordinal features allow the mode and
median as central tendency statistics, and quantiles as dispersion statistics.
Features without ordering or scale are called categorical features (or sometimes ‘nominal’
features). They do not allow any statistical summary except the mode. One subspecies of the
categorical features is the Boolean feature, which maps into the truth values true and false.
Models treat these different kinds of feature in distinct ways. First, consider tree models such
as decision trees. A split on a categorical feature will have as many children as there are feature
values. Ordinal and quantitative features, on the other hand, give rise to a binary split, by selecting
a value v0 such that all instances with a feature value less than or equal to v0 go to one child,
and the remaining instances to the other child. It follows that tree models are insensitive to the
scale of quantitative features.
For example, whether a temperature feature is measured on the Celsius scale or on the
Fahrenheit scale will not affect the learned tree. Neither will switching from a linear scale to a
logarithmic scale have any effect: the split threshold will simply be logv0 instead of v0. In general,
166
tree models are insensitive to monotonic transformations on the scale of a feature, which are
those transformations that do not affect the relative order of the feature values. In effect, tree
models ignore the scale of quantitative features, treating them as ordinal. The same holds for
rule models.
Structured features
It is usually tacitly assumed that an instance is a vector of feature values. In other words, the
instance space is a Cartesian product of d feature domains: X =F1 ×. . .×Fd . This means that
there is no other information available about an instance apart from the information conveyed
by its feature values. Identifying an instance with its vector of feature values is what computer
scientists call an abstraction, which is the result of filtering out unnecessary information.
Representing an e-mail as a vector of word frequencies is an example of an abstraction.
However, sometimes it is necessary to avoid such abstractions, and to keep more information
about an instance than can be captured by a finite vector of feature values. For example, we
could represent an e-mail as a long string; or as a sequence of words and punctuation marks;
or as a tree that captures the HTML mark-up; and so on. Features that operate on such structured
instance spaces are called structured features.
Structured features can be constructed either prior to learning a model, or simultaneously with
it. The first scenario is often called propositionalisation because the features can be seen as a
translation from first-order logic to propositional logic without local variables. The main challenge
with propositionalisation approaches is how to deal with combinatorial explosion of the number
of potential features. Notice that features can be logically related: e.g., the second clause above
covers a subset of the instances covered by the first one. It is possible to exploit this if structured
feature construction is integrated with model building, as in inductive logic programming.
Feature transformations aim at improving the utility of a feature by removing, changing or adding
information. We could order feature types by the amount of detail they convey: quantitative
features are more detailed than ordinal ones, followed by categorical features, and finally Boolean
features. The best-known feature transformations are those that turn a feature of one type into
another of the next type down this list. But there are also transformations that change the scale
of quantitative features, or add a scale (or order) to ordinal, categorical and Boolean features.
167
The simplest feature transformations are entirely deductive, in the sense that they achieve a
well-defined result that doesn’t require making any choices. Binarisation transforms a categorical
feature into a set of Boolean features, one for each value of the categorical feature. This loses
information since the values of a single categorical feature are mutually exclusive, but is
sometimes needed if a model cannot handle more than two feature values. Unordering trivially
turns an ordinal feature into a categorical one by discarding the ordering of the feature values.
This is often required since most learning models cannot handle ordinal features directly.
Discretisation transforms a quantitative feature into an ordinal feature. Each ordinal value is
referred to as a bin and corresponds to an interval of the original quantitative feature. Again, we
can distinguish between supervised and unsupervised approaches. Unsupervised discretisation
methods typically require one to decide the number of bins beforehand. A simple
method that often works reasonably well is to choose the bins so that each bin has approximately
the same number of instances: this is referred to as equal-frequency discretisation.
168
splitting bins, whereas agglomerative methods proceed by initially assigning each instance to
its own bin and successively merging bins. In either case an important role is played by the
stopping criterion, which decides whether a further split or merge is worthwhile.
Thresholding and discretisation are feature transformations that remove the scale of a quantitative
feature. We now turn our attention to adapting the scale of a quantitative feature, or adding a
scale to an ordinal or categorical feature. If this is done in an unsupervised fashion it is usually
called normalisation, whereas calibration refers to supervised approaches taking in the (usually
binary) class labels. Feature normalisation is often required to neutralise the effect of different
quantitative features being measured on different scales.
Sometimes feature normalisation is understood in the stricter sense of expressing the feature
on a [0,1] scale. This can be achieved in various ways. If we know the feature’s highest and
lowest values h and l , then we can simply apply the linear scaling f ’! ( f “l )/(h “l ). We sometimes
have to guess the value of h or l , and truncate any value outside [l ,h]. For example, if the
feature measures age in years, we may take l = 0 and h = 100, and truncate any f > h to 1.
We will assume a binary classification context, and so a natural choice for the calibrated feature’s
scale is the posterior probability of the positive class, conditioned on the feature’s value. This
169
has the additional advantage that models that are based on such probabilities, such as naive
Bayes, do not require any additional training once the features are calibrated.
Incomplete features
At the end of this section on feature transformations we briefly consider what to do if we don’t
know a feature’s value for some of the instances. Missing feature values at training time are
trickier to handle. First of all, the very fact that a feature value is missing may be correlated with
the target variable. For example, the range of medical tests carried out on a patient is likely to
depend on their medical history. For such features it may be best to have a designated ‘missing’
value so that, for instance, a tree model can split on it.
However, this would not work for, say, a linear model. In such cases we can complete the
feature by ‘filling in’ the missing values, a process known as imputation. For instance, in a
classification problem we can calculate the per-class means, medians or modes over the
observed values of the feature and use this to impute the missing values. A somewhat more
sophisticated method takes feature correlation into account by building a predictive model for
each incomplete feature and uses that model to ‘predict’ the missing value.
The previous section on feature transformation makes it clear that there is a lot of scope in
machine learning to play around with the original features given in the data. We can take this
one step further by constructing new features from several original features. we can construct
a new feature from two Boolean or categorical features by forming their Cartesian product.
For example, if we have one feature Shape with values Circle, Triangle and Square, and another
feature Colour with values Red, Green and Blue, then their Cartesian product would be the
feature (Shape,Colour) with values (Circle,Red), (Circle,Green), (Circle,Blue), (Triangle,Red),
and so on. The effect that this would have depends on the model being trained. Constructing
Cartesian product features for a naive Bayes classifier means that the two original features are
no longer treated as independent, and so this reduces the strong bias that naive Bayes models
have. This is not the case for tree models, which can already distinguish between all possible
pairs of feature values. On the other hand, a newly introduced Cartesian product feature may
incur a high information gain, so it can possibly affect the model learned.
170
There are many other ways of combining features. For instance, we can take arithmetic or
polynomial combinations of quantitative features. One attractive possibility is to first apply concept
learning or subgroup discovery, and then use these concepts or subgroups as new Boolean
features.
Once we have constructed new features it is often a good idea to select a suitable subset of
them prior to learning. Not only will this speed up learning as fewer candidate features need to
be considered, it also helps to guard against overfitting. There are two main approaches to
feature selection. The filter approach scores features on a particular metric and the top-scoring
features are selected. Many of the metrics we have seen so far can be used for feature scoring,
including information gain, the χ2 statistic, the correlation coefficient, to name just a few. An
interesting variation is provided by the Relief feature selection method, which repeatedly samples
a random instance x and finds its nearest hit h (instance of the same class) as well as its
nearest miss m (instance of opposite class). The i -th feature’s score is then decreased by
Dis(xi ,hi )2 and increased by Dis(xi ,mi )2, where Dis is some distance measure (e.g., Euclidean
distance for quantitative features, Hamming distance for categorical features). The intuition is
that we want to move closer to the nearest hit while differentiating from the nearest miss.
We can also view feature construction and selection from a geometric perspective, assuming
quantitative features. To this end we represent our data set as a matrix X with n data points in
rows and d features in columns, which we want to transform into a new matrix W with n rows
and r columns by means of matrix operations. To simplify matters a bit, we assume that X is
zero-centred and that W= XT for some d-by-r transformation matrix T. For example, feature
scaling corresponds to T being a d-by-d diagonal matrix; this can be combined with feature
selection by removing some of T’s columns. A rotation is achieved by T being orthogonal, i.e.,
TTT = I.
One of the best-known algebraic feature construction methods is principal component analysis
(PCA). Principal components are new features constructed as linear combinations of the given
features. The first principal component is given by the direction of maximum variance in the
data; the second principal component is the direction of maximum variance orthogonal to the
first component, and so on. PCA can be explained in a number of different ways: here, we will
171
derive it by means of the singular value decomposition (SVD). Any n-by-d matrix can be uniquely
written as a product of three matrices with special properties:
11.6 Summary
The features are the descriptive attributes, and the label is what you’re attempting to predict or
forecast. Features with an ordering but without scale are called ordinal features. Features without
ordering or scale are called categorical features. Structured features can be constructed either
prior to learning a model, or simultaneously with it. Feature transformations aim at improving
the utility of a feature by removing, changing or adding information. Thresholding and
discretisation are feature transformations that remove the scale of a quantitative feature.
11.7 Keywords
LESSON – 12
MODEL ENSEMBLES
Structure
12.1 Introduction
12.4 Boosting
12.5 Summary
12.6 Keywords
12.1 Introduction
Combinations of models are generally known as model ensembles. They are among the most
powerful techniques in machine learning, often outperforming other methods. This comes at
the cost of increased algorithmic and model complexity. The main motivations came from
computational learning theory on the one hand, and statistics on the other. It is a well-known
statistical intuition that averaging measurements can lead to a more stable and reliable estimate
because we reduce the influence of random fluctuations in single measurements.
So if we were to build an ensemble of slightly different models from the same training data, we
might be able to similarly reduce the influence of random fluctuations in single models. The key
question here is how to achieve diversity between these different models. As we shall see, this
can often be achieved by training models on random subsets of the data, and even by constructing
them from random subsets of the available features. In essence, ensemble methods in machine
learning have the following two things in common:
they construct multiple, diverse predictive models from adapted versions of the training
data (most often reweighted or resampled);
173
they combine the predictions of these models in some way, often by simple averaging
or voting (possibly weighted).
It should, however, also be stressed that these commonalities span a very large and diverse
space, and that we should correspondingly expect some methods to be practically very different
even though superficially similar. For example, it makes a big difference whether the way in
which training data is adapted for the next iteration takes the predictions of the previous models
into account or not.
Bagging, short for ‘bootstrap aggregating’, is a simple but highly effective ensemble method
that creates diverse models on different random samples of the original data set. These samples
are taken uniformly with replacement and are known as bootstrap samples. Because samples
are taken with replacement the bootstrap sample will in general contain duplicates, and hence
some of the original data points will be missing even if the bootstrap sample is of the same size
as the original data set. This is exactly what we want, as differences between the bootstrap
samples will create diversity among the models in the ensemble.
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is
commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data
in a training set is selected with replacement—meaning that the individual data points can be
chosen more than once. After several data samples are generated, these weak models are
then trained independently, and depending on the type of task—regression or classification, for
example—the average or majority of those predictions yield a more accurate estimate. As a
note, the random forest algorithm is considered an extension of the bagging method, using
both bagging and feature randomness to create an uncorrelated forest of decision trees.
174
Ensemble learning gives credence to the idea of the “wisdom of crowds,” which suggests that
the decision-making of a larger group of people is typically better than that of an individual
expert. Similarly, ensemble learning refers to a group (or ensemble) of base learners, or models,
which work collectively to achieve a better final prediction. A single model, also known as a
base or weak learner, may not perform well individually due to high variance or high bias.
However, when weak learners are aggregated, they can form a strong learner, as their
combination reduces bias or variance, yielding better model performance.
Ensemble methods are frequently illustrated using decision trees as this algorithm can be
prone to overfitting (high variance and low bias) when it hasn’t been pruned and it can also lend
itself to underfitting (low variance and high bias) when it’s very small, like a decision stump,
which is a decision tree with one level. Remember, when an algorithm overfits or underfits to its
training set, it cannot generalize well to new datasets, so ensemble methods are used to
counteract this behavior to allow for generalization of the model to new datasets. While decision
trees can exhibit high variance or high bias, it’s worth noting that it is not the only modeling
technique that leverages ensemble learning to find the “sweet spot” within the bias-variance
tradeoff.
In 1996, Leo Breiman introduced the bagging algorithm, which has three basic steps:
Parallel training: These bootstrap samples are then trained independently and in parallel with
each other using weak or base learners.
There are a number of key advantages and challenges that the bagging method presents when
used for classification or regression problems. The key benefits of bagging include:
Ease of implementation: Python libraries such as scikit-learn (also known as sklearn) make
it easy to combine the predictions of base learners or estimators to improve model performance.
Reduction of variance: Bagging can reduce the variance within a learning algorithm. This is
particularly helpful with high-dimensional data, where missing values can lead to higher variance,
making it more prone to overfitting and preventing accurate generalization to new datasets.
Loss of interpretability: It’s difficult to draw very precise business insights through bagging
because due to the averaging involved across predictions. While the output is more precise
than any individual data point, a more accurate or complete dataset could also yield more
precision within a single classification or regression model.
Computationally expensive: Bagging slows down and grows more intensive as the number
of iterations increase. Thus, it’s not well-suited for real-time applications. Clustered systems or
a large number of processing cores are ideal for quickly creating bagged ensembles on large
test sets.
Less flexible: As a technique, bagging works particularly well with algorithms that are less
stable. One that are more stable or subject to high amounts of bias do not provide as much
benefit as there’s less variation within the dataset of the model. As noted in the Hands-On
Guide to Machine Learning (link resides outside of IBM), “bagging a linear regression model will
effectively just return the original predictions for large enough b.”
Applications of Bagging
The bagging technique is used across a large number of industries, providing insights for both
real-world value and interesting perspectives, such as in the GRAMMY Debates with Watson.
Healthcare: Bagging has been used to form medical data predictions. For example, research
shows that ensemble methods have been used for an array of bioinformatics problems, such
176
as gene and/or protein selection to identify a specific trait of interest. More specifically, this
research delves into its use to predict the onset of diabetes based on various risk predictors.
IT: Bagging can also improve the precision and accuracy in IT systems, such as ones network
intrusion detection systems. Meanwhile, this research looks at how bagging can improve the
accuracy of network intrusion detection—and reduce the rates of false positives.
Environment: Ensemble methods, such as bagging, have been applied within the field of
remote sensing. More specifically, this research shows how it has been used to map the types
of wetlands within a coastal landscape.
Finance: Bagging has also been leveraged with deep learning models in the finance industry,
automating critical tasks, including fraud detection, credit risk evaluations, and option pricing
problems. This research (link resides outside IBM) demonstrates how bagging among other
machine learning techniques have been leveraged to assess loan default risk. This study
highlights how bagging helps to minimize risk by to prevent credit card fraud within banking and
financial institutions.
Random Forest
Random forest is an ensemble model using bagging as the ensemble method and decision
tree as the individual model. Random Forest is a popular machine learning algorithm that belongs
to the supervised learning technique. It can be used for both Classification and Regression
problems in ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, “Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset.” Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm(Figure 12.1):
177
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
178
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below Figure 12.2 :
There are mainly four sectors where Random Forest mostly used:
Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
Land Use: We can identify the areas of similar land use by this algorithm.
It enhances the accuracy of the model and prevents the overfitting issue.
Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.
12.4 Boosting
Boosting is an ensemble modeling technique that attempts to build a strong classifier from the
number of weak classifiers. It is done by building a model by using weak models in series.
Firstly, a model is built from the training data. Then the second model is built which tries to
correct the errors present in the first model. This procedure is continued and models are added
until either the complete training data set is predicted correctly or the maximum number of
models are added.
There are several boosting algorithms. The original ones, proposed by Robert Schapire and
Yoav Freund were not adaptive and could not take full advantage of the weak learners. Schapire
and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious
Gödel Prize. AdaBoost was the first really successful boosting algorithm developed for the
180
purpose of binary classification. AdaBoost is short for Adaptive Boosting and is a very popular
boosting technique that combines multiple “weak classifiers” into a single “strong classifier”.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points
Goto step 5
else
Goto step 2
5. End
Bagging and Boosting, both being the commonly used methods, have a universal similarity of
being classified as ensemble methods. Here we will explain the similarities between them.
Both make the final decision by averaging the N learners (or taking the majority of them i.e
Majority Voting).
AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an Ensemble
Method in Machine Learning. It is called Adaptive Boosting as the weights are re-assigned to
each instance, with higher weights assigned to incorrectly classified instances. Boosting is
used to reduce bias as well as variance for supervised learning. It works on the principle of
181
learners growing sequentially. Except for the first, each subsequent learner is grown from
previously grown learners. In simple words, weak learners are converted into strong ones. The
AdaBoost algorithm works on the same principle as boosting with a slight difference.
First, let us discuss how boosting works. It makes ‘n’ number of decision trees during the data
training period. As the first decision tree/model is made, the incorrectly classified record in the
first model is given priority. Only these records are sent as input for the second model. The
process goes on until we specify a number of base learners we want to create. Remember,
repetition of records is allowed with all boosting techniques.
The above Figure 12.3, shows how the first model is made and errors from the first model are
noted by the algorithm. The record which is incorrectly classified is used as input for the next
model. This process is repeated until the specified condition is met. As you can see in the
figure, there are ‘n’ number of models made by taking the errors from the previous model. This
is how boosting works. The models 1,2, 3,…, N are individual models that can be known as
decision trees. All types of boosting models work on the same principle.
Since we now know the boosting principle, it will be easy to understand the AdaBoost algorithm.
Let’s dive into AdaBoost’s working. When the random forest is used, the algorithm makes an ‘n’
number of trees. It makes proper trees that consist of a start node with several leaf nodes.
Some trees might be bigger than others, but there is no fixed depth in a random forest. With
AdaBoost, however, the algorithm only makes a node with two leaves, known as Stump.
182
12.5 Summary
Combinations of models are generally known as model ensembles. They are among the most
powerful techniques in machine learning, often outperforming other methods. Bagging, short
for ‘bootstrap aggregating’, is a simple but highly effective ensemble method that creates diverse
models on different random samples of the original data set. Random forest is an ensemble
model using bagging as the ensemble method and decision tree as the individual model. Boosting
is an ensemble modeling technique that attempts to build a strong classifier from the number of
weak classifiers.
12.6 Keywords
LESSON – 13
13.1 Introduction
13.6 Summary
13.7 Keywords
13.1 Introduction
Machine learning experiments pose questions about models that we try to answer by means
ofmeasurements on data. The following are common examples of the types of question we are
interested in:
Which of these models has the best performance on data from domain D?
184
Which of these learning algorithms gives the best model on data from domain D?
To get insight on how to measure the project and interpret the same.
A good starting point for our measurements is the evaluation measures. The appropriateness
of any of these for our purposes depends on how we define performance in relation to the
question the experiment is designed to answer: let’s call it our experimental objective. It is
important not to confuse performance measures and experimental objectives: the former is
something we can measure, while the latter is what we are really interested in. There is often a
discrepancy between the two.
In machine learning the situation is usually more concrete, and our experimental objective –
accuracy, say – is something we can measure in principle, or at least estimate (since we’re
generally interested in accuracy on unseen data). However, there may be unknown factors we
have to account for. For example, the model may need to operate in different operating contexts
with different class distributions.
if you choose accuracy as your evaluation measure, you are making an implicit assumption
that the class distribution in the test set is representative for the operating context in which the
model is going to be deployed. Furthermore, if all you recorded in your experiments is accuracy,
you will not be able to switch to average recall later if you realise that you need to incorporate
varying class distributions.
185
In summary, your choice of evaluation measures should reflect the assumptions you are making
about your experimental objective as well as possible contexts in which your models operate.
We have looked at the following cases:
Accuracy is a good evaluation measure if the class distribution in your test set is
representative for the operating context.
Average recall is the evaluation measure of choice if all class distributions are equally
likely.
Precision and recall shift the focus from classification accuracy to a performance analysis
ignoring the true negatives.
Predicted positive rate and AUC are relevant measures in a ranking context.
The question of ‘how to measure it’ thus seems to have a very straightforward answer: construct
the contingency table from a test set and perform the relevant calculations. However, two issues
demand our attention: (i) which data to base our measurements on, and (ii) how to assess the
inevitable uncertainty associated with every measurement.
model on the remaining k “1 folds and evaluate it on the test fold. This process is repeated k
times until each fold has been used for testing once.
This may seem curious at first since we are evaluating k models rather than a single one, but
this makes sense if we are evaluating a learning algorithm (whose output is a model, so we
want to average over models) rather than a single model (whose outputs are instance labels,
so we want to average over those). By averaging over training sets we get a sense of the
variance of the learning algorithm (i.e., its dependence on variations in the training data), although
it should be noted that the training sets in cross-validation have considerable overlap and are
clearly not independent. Once we are satisfied with the performance of our learning algorithm,
we can run it over the entire data set to obtain a single model.
If we expect the learning algorithm to be sensitive to the class distribution we should apply
stratified cross-validation: this aims at achieving roughly the same class distribution in each
fold. Cross-validation runs can be repeated for different random partitions into folds and the
results averaged again to further reduce variance in our estimates: this is referred to as, e.g.,
10
times 10-fold cross-validation. It should be kept in mind that this leads increasingly to
independence assumptions being violated – if we take this too far our accuracy estimate will
overfit the given data and not be representative for new data.
Once we have estimates of a relevant evaluation measure for our models or learning algorithms
we can use them to select the best one. The fundamental issue here is how to deal with the
inherent uncertainty in these estimates. We will discuss two key concepts: confidence intervals
and significance tests. An understanding of these concepts is necessary if you want to appreciate
current practice in interpreting results from machine learning experiments – however, it is good
to realise that current practice is coming under increasing scrutiny.
187
It should also be noted that the methods described here represent only a tiny fraction of the vast
spectrum of possibilities. Suppose our estimate aˆ follows a normal distribution around the true
mean a with standard deviation σ. Assuming for the moment that we know these parameters,
we can calculate for any interval the likelihood of the estimate falling in the interval, by calculating
the area under the normal density function in that interval.
For example, the likelihood of obtaining an estimate within ±1 standard deviation around the
mean is 68%. Thus, if we take 100 estimates from independent test sets, we expect 68 of them
to be within one standard deviation on either side of the mean – or equivalently, we expect the
true mean to fall within one standard deviation on either side of the estimate in 68 cases. This
is called the 68% confidence interval of the estimate.
For two standard deviations the confidence level is 95% – these values can be looked up in
probability tables or calculated using statistical packages such as Matlab or R. Notice that
confidence intervals for normally distributed estimates are symmetric because the normal
distribution is symmetric, but this is not generally the case: e.g., the binomial distribution is
asymmetric (except for p = 1/2). Notice also that, in case of symmetry, we can easily change
the interval into a one-sided interval: for example, we expect the mean to be more than one
standard deviation above the estimate in 16 cases out of 100, which gives a one-sided 84%
confidence interval from minus infinity to the mean plus one standard deviation.
More generally, in order to construct confidence intervals we need to know (i) the sampling
distribution of the estimates, and (ii) the parameters of that distribution. We saw previously that
accuracy estimated from a single test set with n instances follows a scaled binomial distribution
with variance aˆ(1" aˆ)/n. This would lead to asymmetric confidence intervals, but the skew in
the binomial distribution is only really noticeable if na(1"a) < 5: if that is not the case the normal
distribution is a good approximation for the binomial one. So, we use the binomial expression
for the variance and use the normal distribution to construct the confidence intervals. Notice
that confidence intervals are statements about estimates rather than statements about the true
value of the evaluation measure.
188
13.6 Summary
13.7 Keywords
8. Define Boosting.
Section – C (3 x 10 = 30 Marks)
18. Write in detail any five real time applications of Machine learning.
22. Explain indetail about bagging and boosting algorithm in machine learning.
SPCA 202N
MASTER OF
COMPUTER APPLICATIONS
SECOND YEAR
THIRD SEMESTER
PRACTICAL-V
WELCOME
Warm Greetings.
I invite you to join the CBCS in Semester System to gain rich knowledge leisurely at
your will and wish. Choose the right courses at right times so as to erect your flag of
success. We always encourage and enlighten to excel and empower. We are the cross
bearers to make you a torch bearer to have a bright future.
DIRECTOR
(i)
MASTER OF COMPUTER APPLICATIONS CORE PAPER - XII
SECOND YEAR - THIRD SEMESTER
PRACTICAL-V: MACHINE
LEARNING LAB
COURSE WRITER
Dr. S. SASIKALA
Associate Professor in Computer Science
Institute of Distance Education
University of Madras
Chepauk, Chennai - 600 005.
Dr. S. SASIKALA
Associate Professor in Computer Science
Institute of Distance Education
University of Madras
Chepauk, Chennai - 600 005.
(ii)
MASTER COMPUTER APPLICATIONS
SECOND YEAR
THIRD SEMESTER
PRACTICAL- V
SYLLABUS
Course Objective: To introduce the basic concepts and techniques of Machine Learning.
To develop skills of using recent machine learning software for solving practical problems
Course Outcomes: After successful completion of this course, student will be able to
ability to identify the characteristics of datasets and compare the trivial data and big data
Machine learning platform: WEKA machine learning workbench, R platform, Python Scipy.
Machine Learning Library: scikit-learn in Python, JSAT in Java, Accord Framework in .NET
(iii)
1. Data Preprocessing:
a. Data Cleaning
b. Data Transformation
c. Data Reduction
d. Feature extraction
2. Supervised learning:
3. Unsupervised learning:
a. Regression
b. K-Means clustering
c. Hierarchical clustering
Mini Project: Application of Data Preprocessing techniques and Machine Learning techniques
on a data set selected from UCI repository / Kaggle / Government and submission of a
report.
(iv)
INSTITUTE OF DISTANCE EDUCATION
RECORD OF PRACTICALS
2020-2021
Practical - V
Name :
Enrolment Number :
Group No :
UNIVERSITY OF MADRAS
CHENNAI - 600 005
(v)
INSTITUTE OF DISTANCE EDUCATION
UNIVERSITY OF MADRAS
CHENNAI - 600 005.
Degree Course in the Institute of Distance Education, University of Madras during the year
LEARNING LAB
Date: Co-ordinator
Submitted for Third Year M.C.A. Degree Course Practical Examination held on
Date: Examiners
1. Name:
Signature:
2. Name:
Signature:
(vi)
MASTER COMPUTER APPLICATIONS
PRACTICAL- V
SCHEME OF LESSONS
DATA PREPROCESSING
1 Data Cleaning 10
2 Data Transformation 16
3 Data Reduction 21
4 Feature Extraction 28
SUPERVISED LEARNING
UNSUPERVISED LEARNING
8 Regression 47
9 K-Means Clustering 53
10 Hierarchical Clustering 60
(vii)
1
Python Language
Python is meant for all those who wants to roll out a blistering career in the field of programming.
With its increasing demand and adoption in top MNCs because of its credibility and smooth
deliver ability, Python has helped young aspiring coders form a firm base in this field.
Developed in the year 1980, it is developed, supported, managed and monitored by the Python
Software Foundation (PSF).
Python is flexible & compatible with and runs smoothly on almost all the major operating systems
like iOS, Windows, Linux and .net. Thus merging easily with the back end processes of
organizations worldwide.
Python has some exceptional and amazing features, which no other programming languages
have that makes it an exclusive language to learn.
So, let’s move ahead and look at some of these exceptional features of Python that
makes Programming with Python for beginners an easy task which one could ever imagine.
Features of Python
The concepts, logic and features being unchanged, it is always on the students whether he
wants to learn a subject in an easy way or the hard way. Similarly, Python learn hard way can
break the interest of the student and ultimately can force him to opt of the field.
2
Python can be easily downloaded and installed on any major OS for free from its official website
and has a free Open Source License (OSL), which is also stands valid for commercial purposes.
Even though Python is suitable for beginners, it is considered as an advanced coding languages
whose most of the instructions closely resembles the English Language.
It helps a professional to makes its way convenient with expertise on software other than
Python with its resemblance to object oriented structures with other software.
Python being an interpreted language, the code gets checked at the time of execution and
then runs on the system followed by others.
The detailed listings of the packages catered by Python language is provided in the Python
Package Index. To include modules like GUI, Test, Automation, DB, Networking, Web
Development, Image Processing, Text Processing, etc., Python consists of several standard
libraries, which plays the following role-
• Hadoop- With the help of Hadoop, Python employs Pydoop library to render support to
Big Data processing.
• Web Development- Frameworks such as Django, Pylons and Flask, which are coded
in Python are considered to be more stable for developing websites.
3
• Automated Testing- Automated testing tools like Selenium and Splinter have application
programming interface that is capable of executing on Python. A developer can also
test on cross platforms and cross browsers with the help of Pytest.
• Graphics- By using Python’s Tkinter library, GUI applications can be written and run
effortlessly.
• Image Processing- PIL- Python image processing library supports imaging files from
various formats.
Python’s support to scientific libraries is improving on data processing levels rapidly. Python
helps in clearing the blocks formed in statistical data modeling by using its Numpy, Scipy,
Pandas and Matplotlib, which are described below-
• Pandas- Python’sPandas are helpful in delivering data frame functionality and data
munging. It also supports SQL database, CSV, Excel and Text files.
Pythons comes with a mix of procedural and OOP coding patterns, which allows its users to
code both in procedural function and Object Oriented Programming function.
In Python, a coder and write long lines of the program in procedural pattern with a mix of code
and data to feed. OOP pattern involves programming with class, objects and methods that
opens the road to inheritance, abstraction and polymorphism functional behaviors.
• The structural unit of an object, which consists of grouped data and which functions
with reuse capability is called class. Class’s functional process is known as a method.
4
• The purpose of the object is to create a class instance during run-time or when the
code is made operational.
• A process that is used in class to hide complex procedures and to simplify its appearance
is defined as an abstraction.
• The subclass, which inherits and uses the functions and attributes of the primary or
parent class deploys a phenomenon called Inheritance to reuse code
• The time when inheritance is used, polymorphism is employed which helps the inherited
class to perform the same functions of the parent class differently.
Installation on Windows
Double-click the executable file, which is downloaded; the following window will open. Select
Customize installation and proceed. Click on the Add Path check box, it will set the Python path
automatically.
For all recent versions of Python, the recommended installation options include Pip and IDLE.
Older versions might not include such additional features.
Now, try to run python on the command prompt. Type the command python -version in case of
python3.
Step – 4: The next dialog will prompt you to select whether to Disable path length limit.
Choosing this option will allow Python to bypass the 260-character MAX_PATH limit. Effectively,
it will enable Python to use long path names.
The Disable path length limit option will not affect any other system settings. Turning it on
will resolve potential name length issues that may arise with Python projects developed in
Linux.
b) Double-click python.exe.
If you opted to install an older version of Python, it is possible that it did not come with Pip
preinstalled. Pip is a powerful package management system for Python software packages.
Thus, make sure that you have it installed. We recommend using Pip for most Python packages,
especially when working in virtual environments.
3. Enter pip -V in the console. If Pip was installed successfully, you should see the following
output:
8
Packages installation:
Package: A package contains all the files you need for a module.
Navigate your command line to the location of Python’s script directory, and type the following:
Example:
Download a Package
Open the command line interface and tell PIP to download the package you want.
Example
Remove a Package
Example
LESSON -1
DATA CLEANING
AIM:
To perform various data cleaning tasks such as handling missing values, duplicate values,
irrelevant data and manual errors on the given “Employee Details” dataset in Python.
INPUT DATAFRAME:
CODING:
“””
“””
import numpy as np
import pandas as pd
employee=pd.read_csv(r’D:\Machine Learning\DataCleaning.CSV’)
print(employee)
“””
“””
11
employee.drop_duplicates(subset=”Name”,keep=’first’,inplace=True)
print (employee)
“””
“””
missing=employee.isnull()
print(missing)
“””
“””
employee=employee.dropna(axis=0)
print(employee)
“””
“””
del employee[‘Sr.No’]
“””
“””
employee[‘Project’]=employee[‘Project’].str.replace(‘mobile’,’Mobile’)
“””
RENAMING COLUMNS
“””
employee.columns=[‘EmployeeName’,’Address’,’Mobile’,’Domain’,’E-mailid’]
print(employee)
12
OUTPUT:
“””
“””
13
“””
“””
“””
“””
14
“””
“””
“””
“””
15
“””
RENAMING COLUMNS
“””
16
LESSON - 2
DATA TRANSFORMATION
AIM:
To perform data transformation tasks such as converting categorical data into numeric format
on the given “Employee Details” dataset in Python.
INPUT DATAFRAME:
CODING:
import numpy as np
import pandas as pd
df=pd.read_csv(r’d:\Machine Learning\Student.CSV’,header=0)
print(df)
“””
“””
df_categorial=df.select_dtypes(exclude=[np.number])
print(df_categorial)
“””
“””
x =df_categorial[‘Grade’].unique()
print(x)
“””
“””
y1=df_categorial[‘Grade’].value_counts()
print(y1)
y2 =df_categorial[‘Gender’].value_counts()
print(y2)
“””
Replace
“””
print(df)
18
df.Employed.replace({“yes”:1,”no”:0},inplace=True)
print(df)
df.head()
OUTPUT:
“””
READ DATA
“””
“””
“””
“””
“””
“””
“””
2nd Class 80
3rd Class 80
1st Class 72
Male 136
Female 96
“””
Replace
“””
0 1 19 Male 1 yes
1 2 20 Female 2 no
2 3 18 Male 1 no
20
3 4 21 Female 2 no
4 5 19 Male 1 no
0 1 19 Male 1 1
1 2 20 Female 2 0
2 3 18 Male 1 0
3 4 21 Female 2 0
4 5 19 Male 1 0
LESSON - 3
DATA REDUCTION
AIM:
To perform data reduction using Principal Component Analysis (PCA) on the given “Iris” dataset
in Python.
INPUT DATAFRAME:
CODING:
import pandas as pd
df =pd.read_csv(r”D:\Machine Learning\iris.csv”)
print(df.head)
labels =df[‘Species’]
x = df.drop([“Id”,”Species”],axis =1)
print(x)
plt.scatter(x[“SepalWidthCm”],x[“PetalLengthCm”])
plt.show
x = df.loc[:, variables].values
y = df.loc[:,[“Species”]].values
x = StandardScaler().fit_transform(x)
x = pd.DataFrame(x)
x.head()
pca = PCA()
23
x_pca = pca.fit_transform(x)
x_pca = pd.DataFrame(x_pca)
x_pca.head()
explained_variance = pca.explained_variance_ratio_
explained_variance
x_pca[“Species”]=y
x_pca.columns = [“PC1”,”PC2",”PC3",”PC4",’Species’]
x_pca.head()
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(x_pca.loc[indicesToKeep, “PC1”]
, x_pca.loc[indicesToKeep, “PC2”]
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
24
OUTPUT:
“””
READ DATA
“””
“””
LABEL DATA
“””
148
“””
PLOT DATA
“””
“””
TRANSFORM DATA
“””
0 1 2 3
0 1 2 3
“””
CALCULATING VARIANCE
“””
0.727705
0.230305
0.0368383
0.00515193
27
“””
“””
28
LESSON - 4
FEATURE EXTRACTION
AIM:
CODING:
import numpy as np
im = np.zeros((256, 256))
im[64:-6, 64:-4] = 1
plt.imshow(im)
plt.show()
im = ndimage.gaussian_filter(im, 8)
plt.imshow(im)
plt.show
plt.imshow(sob)
plt.show
plt.figure(figsize=(16, 5))
plt.subplot(141)
plt.imshow(im, cmap=plt.cm.gray)
plt.axis(‘off’)
29
plt.title(‘square’, fontsize=20)
plt.subplot(142)
plt.imshow(sx)
plt.axis(‘off’)
plt.subplot(143)
plt.imshow(sob)
plt.axis(‘off’)
#im += 0.07*np.random.random(im.shape)
plt.subplot(144)
plt.imshow(sob)
plt.axis(‘off’)
plt.show()
OUTPUT:
30
LESSON - 5
To demonstrate the working of Decision Tree algorithm to classify the “Balance Scale”
dataset in Python.
INPUT DATAFRAME:
B 1 1 1 1
R 1 1 1 2
R 1 1 1 3
R 1 1 1 4
R 1 1 1 5
R 1 1 2 1
R 1 1 2 2
R 1 1 2 3
R 1 1 2 4
R 1 1 2 5
R 1 1 3 1
R 1 1 3 2
R 1 1 3 3
R 1 1 3 4
R 1 1 3 5
R 1 1 4 1
R 1 1 4 2
R 1 1 4 3
R 1 1 4 4
31
CODING:
import numpy as np
import pandas as pd
def importdata():
return balance_data
def splitdataset(balance_data):
X = balance_data.values[:, 1:5]
Y = balance_data.values[:, 0]
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
clf_entropy = DecisionTreeClassifier(
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
y_pred = clf_object.predict(X_test)
print(“Predicted values:”)
print(y_pred)
33
return y_pred
print(“Confusion Matrix: “,
confusion_matrix(y_test, y_pred))
print (“Accuracy : “,
accuracy_score(y_test,y_pred)*100)
print(“Report : “,
classification_report(y_test, y_pred))
def main():
# Building Phase
data = importdata()
# Operational Phase
cal_accuracy(y_test, y_pred_gini)
cal_accuracy(y_test, y_pred_entropy)
34
if __name__==”__main__”:
main()
OUTPUT:
READ DATA
0 B 1 1 1 1
1 R 1 1 1 2
2 R 1 1 1 3
3 R 1 1 1 4
4 R 1 1 1 5
Predicted values:
[‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’
‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’
‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’
‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’
‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’
‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’
‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’
‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’
‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’
‘L’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’
Confusion Matrix:
[[ 0 6 7]
[ 0 67 18]
[ 0 19 71]]
Accuracy : 73.40425531914893
Predicted values:
[‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’
‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’
‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’
‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’
‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’
‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘L’ ‘R’ ‘L’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’
‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’
‘L’ ‘L’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’
‘L’ ‘R’ ‘R’ ‘L’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’ ‘L’ ‘R’ ‘R’ ‘R’ ‘R’ ‘R’
‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘R’ ‘L’ ‘L’ ‘L’ ‘L’ ‘L’ ‘R’
Confusion Matrix:
[[ 0 6 7]
[ 0 63 22]
[ 0 20 70]]
Accuracy : 70.74468085106383
LESSON - 6
AIM:
INPUT DATAFRAME:
CODING:
import numpy as np
Parameters
—————
Returns
———
xx, yy : ndarray
“””
39
return xx, yy
Parameters
—————
clf: a classifier
“””
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
return out
iris = datasets.load_iris()
# Take the first two features. We could avoid this by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target
40
# we create an instance of SVM and fit out data. We do not scale our
svm.LinearSVC(C=C, max_iter=10000),
plt.subplots_adjust(wspace=0.4, hspace=0.4)
cmap=plt.cm.coolwarm, alpha=0.8)
41
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel(‘Sepal length’)
ax.set_ylabel(‘Sepal width’)
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
plt.show()
OUTPUT:
42
LESSON - 7
To implement Multilayer Perceptron and use it to classify the “Bank Notes” dataset in
Python.
INPUT DATAFRAME:
CODING:
import numpy as np
import pandas as pd
bnotes.head()
bnotes.tail()
bnotes.shape
bnotes.isnull().sum()
print(bnotes.head())
print(bnotes[‘Class’].unique())
bnotes.describe(include=’all’)
x=bnotes.drop(‘Class’,axis=1)
y=bnotes[‘Class’]
print(x.head(2))
print(y.head(2))
print(x_train.shape)
44
print(y_train.shape)
mlp = MLPClassifier(hidden_layer_sizes=(3,2),max_iter=500,activation=’relu’)
mlp.fit(x_train,y_train)
#prediting the data and calculating confusion matrix and classification error
pred=mlp.predict(x_test)
confusion_matrix(y_test,pred)
print(classification_report(y_test,pred))
OUTPUT:
READ DATA
(1372, 5)
45
NULL VALUE
Image.Var 0
Image.Skew 0
Image.Curt 0
Entropy 0
Class 0
CLASS LABEL
[0 1]
0 0
1 0
SHAPE
(960, 4)
(960,)
PREDICTED VALUES
[1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 1 0 1
0010000101000001000001000110010010110
0000001001011010010100110000111111010
0001111100001000100111001001010111000
1111010010000111101101111101010001000
0010011001111101110010001110100011100
0110000100010001110000100010010010101
46
1100001011110010010111001100001000010
1011011110100001111000110101000101000
0110100010001011010100000010011010001
1100001010000011001010011001100000011
0 0 0 1 1]
CONFUSION MATRIX
[[240 1]
[ 0 171]]
CLASSIFICATION REPORT
LESSON - 8
REGRESSION
AIM:
To implement Linear Regression to predict the salary value from the “Salary” dataset in
Python.
INPUT DATAFRAME:
YearsExperience Salary
1.1 39343
1.3 46205
1.5 37731
2 43525
2.2 39891
2.9 56642
3 60150
3.2 54445
3.2 64445
3.7 57189
3.9 63218
48
4 55794
4 56957
4.1 57081
4.5 61111
4.9 67938
5.1 66029
5.3 83088
5.9 81363
CODING:
import numpy as np
import pandas as pd
Salary_Data.head()
Salary_Data.tail()
Salary_Data.shape
Salary_Data.isnull().sum()
X = Salary_Data.iloc[:,:-1].values
y= Salary_Data.iloc[:,1].values
49
print(X)
print(y)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
regressor = LinearRegression()
regressor.fit(X_train,y_train)
y_pred=regressor.predict(X_test)
print(y_pred)
print(regressor.coef_)
print(regressor.intercept_)
# calculate r2 values
n =r2_score(y_test,y_pred)
print(n)
plt.scatter(X_test,y_test,color=’red’)
plt.plot(X_test,y_pred,color=’blue’)
plt.show()
OUTPUT:
0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0
50
25 9.0 105582.0
26 9.5 116969.0
27 9.6 112635.0
28 10.3 122391.0
29 10.5 121872.0
Out[3]: (30, 2)
YearsExperience 0
Salary 0
[[ 1.1]
[ 1.3]
[ 1.5]
[ 2. ]
[ 2.2]
[ 2.9]
[ 3. ]
[ 3.2]
[ 3.2]
[ 3.7]
[ 3.9]
[ 4. ]
[ 4. ]
[ 4.1]
[ 4.5]
51
[ 4.9]
[ 5.1]
[ 5.3]
[ 5.9]
[ 6. ]
[ 6.8]
[ 7.1]
[ 7.9]
[ 8.2]
[ 8.7]
[ 9. ]
[ 9.5]
[ 9.6]
[10.3]
[10.5]]
115249.56285456 107799.50275317]
[9312.57512673]
26780.099150628186
52
Calculate r2 values
0.988169515729126
53
LESSON - 9
K-MEANS CLUSTERING
AIM:
INPUT DATAFRAME:
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
6 Female 22 17 76
7 Female 35 18 6
8 Female 23 18 94
9 Male 64 19 3
10 Female 30 19 72
11 Male 67 19 14
12 Female 35 19 99
13 Female 58 20 15
14 Female 24 20 77
15 Male 37 20 13
16 Male 22 20 79
17 Female 35 21 35
18 Male 20 21 66
19 Male 52 23 29
54
CODING:
import pandas as pd
import numpy as np
df = pd.read_csv(r’D:\Machine Learning\Mall_Customers.csv’)
df.head()
df.info()
df.isnull().sum()
x = df.iloc[:,[3,4]].values
print(x)
wcss=[]
for i in range(1,11):
kmeans=KMeans(n_clusters=i,init=’k-means++’,random_state=42)
kmeans.fit(x)
wcss.append(kmeans.inertia_)
sns.set()
plt.plot(range(1,11),wcss)
plt.xlabel(‘No of Clusters’)
plt.ylabel(‘wcss’)
plt.show()
55
y =kmeans.fit_predict(x)
print(y)
plt.figure(figsize=(8,8))
plt.scatter(x[y==0,0],x[y==0,1],s=50,c=”green”,label=”cluster1")
plt.scatter(x[y==1,0],x[y==1,1],s=50,c=”red”,label=”cluster2")
plt.scatter(x[y==2,0],x[y==2,1],s=50,c=”yellow”,label=”cluster3")
plt.scatter(x[y==3,0],x[y==3,1],s=50,c=”violet”,label=”cluster4")
plt.scatter(x[y==4,0],x[y==4,1],s=50,c=”blue”,label=”cluster5")
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],s=100,c=”cyan”,label=”centroid”)
plt.title(“customer Group”)
plt.xlabel(“Annual Income”)
plt.ylabel(“Spending”)
plt.show()
OUTPUT:
[[ 15 39]
[ 15 81]
[ 16 6]
[ 16 77]
[ 17 40]
[ 17 76]
[ 18 6]
[ 18 94]
[ 19 3]
[ 19 72]
[ 19 14]
[ 19 99]
[ 20 15]
[ 20 77]
[ 20 13]
[ 20 79]
[ 21 35]
[ 21 66]
[ 23 29]
[ 23 98]
[ 24 35]
[ 24 73]
[ 25 5]
[ 25 73]
[103 69]
57
[113 8]
[113 91]
[120 16]
[120 79]
[126 28]
[126 74]
[137 18]
[137 83]]
[269981.28]
[269981.28, 181363.59595959596]
1 181363.59595959596
106348.37306211118
73679.78903948834
44448.45544793371
37265.86520484347
30241.34361793659
25336.946861471864
21850.165282585633
19634.55462934998
[3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3
1313130310000000000000000000000000000
0000000000000000000000000000000000000
59
0000000000002420242420242424242024242
4242424242424242424242424242424242424
2 4 2 4 2 4 2 4 2 4 2 4 2 4 2]
LESSON - 10
HIERARCHICAL CLUSTERING
AIM:
INPUT DATAFRAME:
Marks StudentID
18 A1
22 A2
43 A3
42 A4
27 A5
25 A6
CODING:
# Importing Modules
import pandas as pd
seeds_df = pd.read_csv(r’d:\StudentInf.csv’)
# Remove the grain species from the DataFrame, save for later
varieties = list(seeds_df.pop(‘StudentID’))
print(varieties)
samples = seeds_df.values
61
“””
“””
“””
and leaf_font_size=6.
“””
dendrogram(mergings,
labels=varieties,
leaf_rotation=90,
leaf_font_size=6,
plt.title(‘Cluster’)
plt.xlabel(‘students Id’)
plt.ylabel(‘Marks’)
plt.show()
OUTPUT:
Student Id List
We can choose a threshold value, along the tallest vertical line, which best divides the two blue
lines to form clusters. In this graph, a threshold value of 12 will form two clusters. Cluster 1
includes A1 and A2 while Cluster 2 is made up of A3, A4, A5 and A6 respectively.