0% found this document useful (0 votes)
8 views

ML Unit 2 Part 1

Uploaded by

jkdprince3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ML Unit 2 Part 1

Uploaded by

jkdprince3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Unit-2-Modelling

and Evaluation
Introduction
● The objective is to introduce the basic concepts of learning.
● In this regard, the information shared concerns the aspects of
model selection and application.
● It also imparts knowledge regarding how to judge the
effectiveness of the model in doing a specific learning task,
● The basic learning process, irrespective of the fact
that the learner is a human or a machine, can be
divided into three parts:

1. Data Input
2. Abstraction
3. Generalization
● The machine can also use the same input data
● abstraction is a significant step as it represents raw input data in
a summarized and structured format, such that a meaningful
insight is
obtained from the data.
● This structured representation of raw input data to the meaningful
pattern is called a model.
● The model might have different forms.
● It might be a mathematical equation, it might be a graph or tree
structure etc...
● The process of assigning a model, and fitting a specific model to a
data set is called model training.
● To generate actionable insight from broad-based knowledge is
very difficult. This is where generalization comes into play.

Generalization searches through the huge set of abstracted


knowledge in to a small and manageable set of key findings.
A heuristic search is an example:
SELECTING A MODEL
● In machine learning paradigm, the potential causes of disturbance
are called as predictors, attributes, features, independent variables,
or simply variables.
Eg: average income of the local population, weapon sales, the inflow
of immigrants, etc. are input variables.
● The number of criminal incidents is an output variable (also called
response or dependent variable)
● Input variables can be denoted by X, while individual input variables
are represented as X , X , X ,…, X and output variable by symbol Y.
● The relationship between X and Y is represented in the general
form: Y = f (X) +e,
● where ‘f ’ is the target function and ‘e’ is a random error term.
● The most important factors are

(i) the kind of problem we want to solve using machine


learning and

(ii) the nature of the underlying data.

. Machine learning algorithms are broadly of two types:


models for supervised learning, which primarily focus on
solving predictive problems and
models for unsupervised learning, which solve descriptive
problems
Predictive models: Models for supervised learning or
predictive models & are try to predict certain value using the
values in an input data set.

Below are some examples:


1. Predicting win/loss in a cricket match
2. Predicting whether a transaction is fraud
3. Predicting whether a customer may move to another product

The models which are used for prediction of target features of categorical value
are known as classification models.
Predictive models may also be used to predict numerical values of the target
feature based on the predictor features. Below are some examples:
4. Prediction of revenue growth in the succeeding year
5. Prediction of rainfall amount in the coming monsoon
3. Prediction of potential flu patients and demand for flu shots next winter
● The models which are used for prediction
of the numerical value of the target
feature of a data instance are known as
regression models.
● Linear Regression and Logistic Regression
models are popular regression models.
Descriptive models
● Models for unsupervised learning or descriptive models
are used to describe a data set or gain insight from a
data set
● Descriptive models which group together similar data
instances (clusters) & Instances having a similar value of
the different features are called clustering models.
● Most popular model for clustering is k-Means.
● Descriptive models related to pattern discovery is used
for market basket analysis of transactional data.
TRAINING A MODEL (FOR SUPERVISED
LEARNING)

● Holdout method
● K-fold Cross-validation method
● Bootstrap sampling
● Lazy vs. Eager learner
Holdout method

In this method input labeled data is hold for validating the trained model is known as holdout method.
● model is trained using the labeled input data
● Here due to non availability of the test data part of the input
data is held back (that is how the name holdout originates) for
evaluation of the model.
● This subset of the input data is used as the test data for
evaluating the performance of a trained model.
● In general 70%–80% of the input data (which is obviously
labelled) is used for model training. The remaining 20%–30% is
used as test data for validation of the performance of the model.
● A different proportion of dividing the input data into training and
test data is also acceptable.
● Random numbers are used to assign data items to the partitions
Drawback
● Problem in this method is that the division of
data of different classes into the training and
test data may not be proportionate
● Due to random sampling, the whole data is
broken into several homogenous groups
K-fold Cross-validation method
• k-fold cross-validation, is the method where the data set is divided
into k-completely distinct or non-overlapping random partitions
called folds.
• Multiple holdouts have been drawn, the training and test data are
more likely to contain representative data from all classes and
resemble the original input data closely
• The value of ‘k’ in k-fold cross-validation can be set to any number
• Two approaches which are extremely popular:
1. 10-fold cross-validation (10-fold CV)
2. Leave-one-out cross-validation (LOOCV)
● 10-fold cross-validation is the most popular approach.
● In this approach, for each of the 10-folds, each comprising
of approximately 10% of the data,
● one of the folds is used as the test data for validating model
performance trained based on the remaining 9 folds (or 90%
of the data)
● This is repeated 10 times, once for each of the 10 folds
being used as the test data and the remaining folds as the
training data
● The average performance across all folds is being reported
Bootstrap sampling
● Bootstrapping is a popular way to identify training and test data sets from
the input data set.
● uses the technique of Simple Random Sampling with Replacement
(SRSWR)
● Bootstrapping randomly picks data instances from the input
● data set,
● with the possibility of the same data instance to be picked multiple times.
● This technique is particularly useful in case of input data sets of small size,
i.e. having very less number of data instances.
BOOTSTRAPING
Difference Between Bootstrapping & Cross Validation
Lazy vs. Eager learner
● It uses General Principal of Machine Learning
● it tries to construct a generalized, input-independent
target function during the model training phase.
● In this model two types of Learners are found
1)Lazy Lerner & 2) Eager Lerner
● Lazy Lerner uses Typical Steps
● Here abstraction and generalization comes up with a trained model at the
end of training Phase.
● Hence, when the test data comes in for classification, the eager learner is
ready with the model and doesn’t need to refer back to the training data.
● Eager learners take more time in the learning phase than the lazy learners.
● Where On the Other side Lazy learning
● Completely skips the abstraction and generalization processes
● lazy learner doesn’t ‘learn’ anything.
● It uses the training data in exact, and uses the knowledge to classify
the unlabelled test data.
● It is also known as rote learning.
● Due to its heavy dependency on the given training data instance, it is also
known as instance learning.
● They are also called non-parametric learning. Lazy learners take very little
time in training.
MODEL REPRESENTATION AND
INTERPRETABILITY
● key consideration in learning the target function from the training data is the
extent of generalization.
● Fitness of a target function approximated by a learning algorithm determines how
correctly able to classify
● Underfitting: Underfitting Happens due to unavailability of sufficient data results
in both poor performance with training data as well as poor generalization to test
data.
● A typical case of under fitting may occur when trying to represent a non-linear
data with a linear model
● Underfitting can be avoided by using
● 1.More training data
● 2. Reducing features by effective feature selection
● Over fitting refers to a situation where the model has been designed in
such a way that it emulates the training data too closely.
● any specific deviation in the training data, like noise or outliers, gets
embedded in the model.
● Over fitting, in many cases, occur as a result of trying to fit an excessively
complex model to closely match the training data.
● It had poor generalization and hence poor performance with test data set.
Over fitting can be avoided by
● 1. using re-sampling techniques like k-fold cross validation
● 2. hold back of a validation data set
● 3. remove the nodes which have little or no predictive power for the given
● machine learning problem.
Bias – variance trade-off
● This error in learning can be of two types
● errors due to ‘bias’ and error due to ‘variance’.
● Let’s try to understand each of them in details.
Errors due to ‘bias’
● Errors due to bias arise from simplifying assumptions made by the Model
to make the target function less complex or easier to learn.
Errors due to ‘Variance’
● Errors due to variance occur from difference in training data sets used to
train the model.
EVALUATING PERFORMANCE OF A MODEL
● To Evaluate Performance of an model three Models
are being used

1) Supervised learning - classification


2) Supervised learning – regression
3) Unsupervised learning - clustering
Supervised learning - classification
● Major task of the Supervised Learning is
to classify the model
WIN

● Cricket Match
LOSS
● Normally target feature based on the values of other
● features like
 whether the team won the toss,
 number of spinners in the team,
 number of wins the team had in the tournament, etc
● To evaluate the performance of the model, the number of correct
classifications or predictions made by the model has to be recorded.
● Based on the number of correct and incorrect classifications or
predictions made by a model, the accuracy of the model is calculated.
● e.g. if in 99 out of 100 games what the model has predicted is same as
what the outcome has been, then the model accuracy is said to be 99%
There are four possibilities with regards to the cricket
match
win/loss prediction:

● 1. the model predicted win and the team won(TP)


model predicted win and the team won is a case where the model has correctly classified data instances as the class of
interest.

● 2. the model predicted win and the team lost(TN)


● model predicted win and the team lost is a case where the model incorrectly classified data instances as the class of
interest.

● 3. the model predicted loss and the team won(FP)


model predicted loss and the team won is a case where the model has incorrectly classified as not the class of interest

● 4. the model predicted loss and the team lost(FN)


model predicted loss and the team lost is a case where the model has correctly classified as not
the class of interest model predicted loss and the team lost(FN)
A confusion
matrix represents the
prediction summary in
matrix form. It shows
how many prediction
are correct and
incorrect per class.
Model Accuracy
● For any classification model, model accuracy is given by total
number of correct classifications

● Confusion Matrix: A matrix containing correct and incorrect predictions in the form of TPs, FPs, FNs
and TNs is known as confusion matrix.

For the Given Confusion Matrix


Calculate the Model accuracy
Error Rate:
● The percentage of misclassifications is indicated using error rate
which is measured as

● For the Above Given Problem Error Rate is

● Kappa Value of a model indicates the adjusted value the model


accuracy. It is calculated using the formula below:
Sensitivity
sensitivity of a model measures the proportion of TP examples or positive cases which were
correctly classified. It is measured as

Specificity
Specificity is also another good measure to indicate a good balance of a model being
excessively conservative or excessively aggressive

There are two other performance measures of a supervised learning model which are similar to sensitivity
and specificity.
These are precision and recall
Precision
● precision gives the proportion of positive predictions which are truly
Positive.

● Precision indicates the reliability of a model in predicting a class of interest.


When the model is related to win / loss prediction of cricket, precision
indicates how often it predicts the win correctly.
● Recall: Recall indicates the proportion of correct prediction of positives to the
total number of positives.
F-measure
● F-measure is another measure of model performance which combines
the precision and recall. It takes the harmonic mean of precision and
recall as calculated as
Receiver operating characteristic (ROC)
curves
● visualization is an easier and more effective way to understand the model
performance.
● It also helps in comparing the efficiency of two models.
● Receiver Operating Characteristic (ROC) curve helps in visualizing the
performance of a classification model.
 It shows the efficiency of a model.
 In the ROC curve, the FP rate is plotted (in the horizontal axis) against true positive
rate (in the vertical axis) at different classification thresholds.
Supervised learning – regression
● Regression model which ensures that the difference between predicted and
actual values

● Lets Consider the simple real estate


● value Prediction Model where
● Linear Regression model is applied &


● For a certain value of x, say X-Cap, the value of y is predicted as
● Y-Cap whereas the actual value of y is Y (say) then
the distance between the actual value and the fitted or predicted value,
i.e. Y is known as residual.
● If the Residual value is less then the regression model fitted is well
● And also R-squared is a good measure to evaluate the model fitness.
● It is also known as the coefficient of determination
● The R-squared value lies between 0 to 1 (0%–100%) with a larger
value representing a better fit.
Unsupervised learning - clustering
● Clustering, by nature, is very subjective and whether the cluster is good
or bad is open or interpretations.
● It was noted, ‘clustering is in the eye of the beholder’.
● This stems from the two inherent challenges which lie in the process of
clustering:

1. It is generally not known how many clusters can be formulated from a


particular data set
2. Even if the number of clusters is given, the same number of clusters can
be formed with different groups of data instances.

You might also like