ML Unit 2 Part 1
ML Unit 2 Part 1
and Evaluation
Introduction
● The objective is to introduce the basic concepts of learning.
● In this regard, the information shared concerns the aspects of
model selection and application.
● It also imparts knowledge regarding how to judge the
effectiveness of the model in doing a specific learning task,
● The basic learning process, irrespective of the fact
that the learner is a human or a machine, can be
divided into three parts:
1. Data Input
2. Abstraction
3. Generalization
● The machine can also use the same input data
● abstraction is a significant step as it represents raw input data in
a summarized and structured format, such that a meaningful
insight is
obtained from the data.
● This structured representation of raw input data to the meaningful
pattern is called a model.
● The model might have different forms.
● It might be a mathematical equation, it might be a graph or tree
structure etc...
● The process of assigning a model, and fitting a specific model to a
data set is called model training.
● To generate actionable insight from broad-based knowledge is
very difficult. This is where generalization comes into play.
The models which are used for prediction of target features of categorical value
are known as classification models.
Predictive models may also be used to predict numerical values of the target
feature based on the predictor features. Below are some examples:
4. Prediction of revenue growth in the succeeding year
5. Prediction of rainfall amount in the coming monsoon
3. Prediction of potential flu patients and demand for flu shots next winter
● The models which are used for prediction
of the numerical value of the target
feature of a data instance are known as
regression models.
● Linear Regression and Logistic Regression
models are popular regression models.
Descriptive models
● Models for unsupervised learning or descriptive models
are used to describe a data set or gain insight from a
data set
● Descriptive models which group together similar data
instances (clusters) & Instances having a similar value of
the different features are called clustering models.
● Most popular model for clustering is k-Means.
● Descriptive models related to pattern discovery is used
for market basket analysis of transactional data.
TRAINING A MODEL (FOR SUPERVISED
LEARNING)
● Holdout method
● K-fold Cross-validation method
● Bootstrap sampling
● Lazy vs. Eager learner
Holdout method
In this method input labeled data is hold for validating the trained model is known as holdout method.
● model is trained using the labeled input data
● Here due to non availability of the test data part of the input
data is held back (that is how the name holdout originates) for
evaluation of the model.
● This subset of the input data is used as the test data for
evaluating the performance of a trained model.
● In general 70%–80% of the input data (which is obviously
labelled) is used for model training. The remaining 20%–30% is
used as test data for validation of the performance of the model.
● A different proportion of dividing the input data into training and
test data is also acceptable.
● Random numbers are used to assign data items to the partitions
Drawback
● Problem in this method is that the division of
data of different classes into the training and
test data may not be proportionate
● Due to random sampling, the whole data is
broken into several homogenous groups
K-fold Cross-validation method
• k-fold cross-validation, is the method where the data set is divided
into k-completely distinct or non-overlapping random partitions
called folds.
• Multiple holdouts have been drawn, the training and test data are
more likely to contain representative data from all classes and
resemble the original input data closely
• The value of ‘k’ in k-fold cross-validation can be set to any number
• Two approaches which are extremely popular:
1. 10-fold cross-validation (10-fold CV)
2. Leave-one-out cross-validation (LOOCV)
● 10-fold cross-validation is the most popular approach.
● In this approach, for each of the 10-folds, each comprising
of approximately 10% of the data,
● one of the folds is used as the test data for validating model
performance trained based on the remaining 9 folds (or 90%
of the data)
● This is repeated 10 times, once for each of the 10 folds
being used as the test data and the remaining folds as the
training data
● The average performance across all folds is being reported
Bootstrap sampling
● Bootstrapping is a popular way to identify training and test data sets from
the input data set.
● uses the technique of Simple Random Sampling with Replacement
(SRSWR)
● Bootstrapping randomly picks data instances from the input
● data set,
● with the possibility of the same data instance to be picked multiple times.
● This technique is particularly useful in case of input data sets of small size,
i.e. having very less number of data instances.
BOOTSTRAPING
Difference Between Bootstrapping & Cross Validation
Lazy vs. Eager learner
● It uses General Principal of Machine Learning
● it tries to construct a generalized, input-independent
target function during the model training phase.
● In this model two types of Learners are found
1)Lazy Lerner & 2) Eager Lerner
● Lazy Lerner uses Typical Steps
● Here abstraction and generalization comes up with a trained model at the
end of training Phase.
● Hence, when the test data comes in for classification, the eager learner is
ready with the model and doesn’t need to refer back to the training data.
● Eager learners take more time in the learning phase than the lazy learners.
● Where On the Other side Lazy learning
● Completely skips the abstraction and generalization processes
● lazy learner doesn’t ‘learn’ anything.
● It uses the training data in exact, and uses the knowledge to classify
the unlabelled test data.
● It is also known as rote learning.
● Due to its heavy dependency on the given training data instance, it is also
known as instance learning.
● They are also called non-parametric learning. Lazy learners take very little
time in training.
MODEL REPRESENTATION AND
INTERPRETABILITY
● key consideration in learning the target function from the training data is the
extent of generalization.
● Fitness of a target function approximated by a learning algorithm determines how
correctly able to classify
● Underfitting: Underfitting Happens due to unavailability of sufficient data results
in both poor performance with training data as well as poor generalization to test
data.
● A typical case of under fitting may occur when trying to represent a non-linear
data with a linear model
● Underfitting can be avoided by using
● 1.More training data
● 2. Reducing features by effective feature selection
● Over fitting refers to a situation where the model has been designed in
such a way that it emulates the training data too closely.
● any specific deviation in the training data, like noise or outliers, gets
embedded in the model.
● Over fitting, in many cases, occur as a result of trying to fit an excessively
complex model to closely match the training data.
● It had poor generalization and hence poor performance with test data set.
Over fitting can be avoided by
● 1. using re-sampling techniques like k-fold cross validation
● 2. hold back of a validation data set
● 3. remove the nodes which have little or no predictive power for the given
● machine learning problem.
Bias – variance trade-off
● This error in learning can be of two types
● errors due to ‘bias’ and error due to ‘variance’.
● Let’s try to understand each of them in details.
Errors due to ‘bias’
● Errors due to bias arise from simplifying assumptions made by the Model
to make the target function less complex or easier to learn.
Errors due to ‘Variance’
● Errors due to variance occur from difference in training data sets used to
train the model.
EVALUATING PERFORMANCE OF A MODEL
● To Evaluate Performance of an model three Models
are being used
● Cricket Match
LOSS
● Normally target feature based on the values of other
● features like
whether the team won the toss,
number of spinners in the team,
number of wins the team had in the tournament, etc
● To evaluate the performance of the model, the number of correct
classifications or predictions made by the model has to be recorded.
● Based on the number of correct and incorrect classifications or
predictions made by a model, the accuracy of the model is calculated.
● e.g. if in 99 out of 100 games what the model has predicted is same as
what the outcome has been, then the model accuracy is said to be 99%
There are four possibilities with regards to the cricket
match
win/loss prediction:
● Confusion Matrix: A matrix containing correct and incorrect predictions in the form of TPs, FPs, FNs
and TNs is known as confusion matrix.
Specificity
Specificity is also another good measure to indicate a good balance of a model being
excessively conservative or excessively aggressive
There are two other performance measures of a supervised learning model which are similar to sensitivity
and specificity.
These are precision and recall
Precision
● precision gives the proportion of positive predictions which are truly
Positive.
■
● For a certain value of x, say X-Cap, the value of y is predicted as
● Y-Cap whereas the actual value of y is Y (say) then
the distance between the actual value and the fitted or predicted value,
i.e. Y is known as residual.
● If the Residual value is less then the regression model fitted is well
● And also R-squared is a good measure to evaluate the model fitness.
● It is also known as the coefficient of determination
● The R-squared value lies between 0 to 1 (0%–100%) with a larger
value representing a better fit.
Unsupervised learning - clustering
● Clustering, by nature, is very subjective and whether the cluster is good
or bad is open or interpretations.
● It was noted, ‘clustering is in the eye of the beholder’.
● This stems from the two inherent challenges which lie in the process of
clustering: