0% found this document useful (0 votes)
2 views70 pages

ML Unit IV.pptx

The document outlines the syllabus for a course on Performance Analysis and Model Evaluation in machine learning, covering topics such as model evaluation strategies, metrics for classification and regression, and techniques for handling class imbalance. It emphasizes the importance of proper data splitting, validation strategies, and evaluation metrics like confusion matrix, precision, recall, and AUC-ROC. Additionally, it discusses bias and variance in model performance and the significance of using multiple metrics for comprehensive evaluation.

Uploaded by

yashikov99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views70 pages

ML Unit IV.pptx

The document outlines the syllabus for a course on Performance Analysis and Model Evaluation in machine learning, covering topics such as model evaluation strategies, metrics for classification and regression, and techniques for handling class imbalance. It emphasizes the importance of proper data splitting, validation strategies, and evaluation metrics like confusion matrix, precision, recall, and AUC-ROC. Additionally, it discusses bias and variance in model performance and the significance of using multiple metrics for comprehensive evaluation.

Uploaded by

yashikov99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

ML-UNIT IV

SCHOOL OF COMPUTER ENGINEERING & TECHNOLOGY

1
Syllabus
Performance Analysis and Model Evaluation
Model Evaluation and selection, bias, variance,
ensemble classifiers, Bagging and Boosting,
Training Vs. Testing samples, Positive vs. Negative class,
Confusion Matrix for Model Evaluation, Model Selection,
Implementation and evaluation using scikit-learn library,
improving classification accuracy of Class Imbalanced Data

2
Model Evaluation and selection
• To properly evaluate your machine learning models and select the best one,
you need a good validation strategy and evaluation metrics
• A good validation (evaluation) strategy is basically how you split your data to
estimate future test performance. It could be as simple as a train-test split or
a complex stratified k-fold strategy
What is model evaluation?
• Model evaluation is a process of assessing the model’s performance on a
chosen evaluation setup.
• It is done by calculating quantitative performance metrics like F1 score or
RMSE or assessing the results qualitatively by the subject matter experts

3
Model Evaluation and selection
How to evaluate machine learning models and select the best one?
• Step 1: Choose a proper validation strategy.
• Step 2: Choose the right evaluation metric. So calculate multiple metrics and make your
decisions based on that. Sometimes you need to combine classic ML metrics with a subject
matter expert evaluation. And that is ok.
• Step 3: Keep track of your experiment results. Whether you use a spreadsheet or a
dedicated experiment tracker, make sure to log all the important metrics, learning curves,
dataset versions, and configurations.
• Step 4: Compare experiments and pick a winner. Regardless of the metrics and validation
strategy you choose, at the end of the day, you want to find the best model. But no model is
really best, but some are good enough.

4
Model selection in machine learning
Resampling methods
• Resampling methods- rearranging data samples to inspect if the model performs well on data samples that
it has not been trained on.
• In other words, resampling helps us understand if the model will generalize well.
Random Split
• Random Splits are used to randomly sample a percentage of data into training, testing, and preferably
validation sets.
Training set: A set of examples used for learning, that is to fit the parameters of the classifier.
Validation set: A set of examples used to tune the parameters of a classifier, for example to choose the
number of hidden units in a neural network. The validation dataset may also play a role in other forms of
model preparation, such as feature selection.
Test set: A set of examples used only to assess the performance of a fully-specified classifier.
• The advantage of this method is that there is a good chance that the original population is well represented in
all the three sets.
• In more formal terms, random splitting will prevent a biased sampling of data.
5
Training/Validation/Test Data in machine learning
• Training data. This type of data builds up the machine learning algorithm. The data
scientist feeds the algorithm input data, which corresponds to an expected output. The
model evaluates the data repeatedly to learn more about the data’s behavior and then
adjusts itself to serve its intended purpose.
• Validation data. During training, validation data infuses new data into the model that it
hasn’t evaluated before. Validation data provides the first test against unseen data,
allowing data scientists to evaluate how well the model makes predictions based on the
new data. Not all data scientists use validation data, but it can provide some helpful
information to optimize hyperparameters, which influence how the model assesses data.
• Test data. After the model is built, testing data once again validates that it can make
accurate predictions. If training and validation data include labels to monitor performance
metrics of the model, the testing data should be unlabeled. Test data provides a final,
real-world check of an unseen dataset to confirm that the ML algorithm was trained
effectively.

6
Model selection in machine learning
Time-Based Split
• There are some types of data where random splits are not possible.
• For example, if we have to train a model for weather forecasting, we cannot randomly divide
the data into training and testing sets.
• This will jumble up the seasonal pattern! Such data is often referred to by the term, Time Series.
• In such cases, a time-wise split is used. The training set can have data for the last three years
and 10 months of the present year. The last two months can be reserved for the testing or
validation set.

7
Model selection in machine learning
K-Fold Cross-Validation
• The cross-validation technique works by randomly shuffling the dataset and then splitting it into
k groups.
• Thereafter, on iterating over each group, the group needs to be considered as a test set while
all other groups are clubbed together into the training set.
• The model is tested on the test group and the process continues for k groups.
• Thus, by the end of the process, one has k different results on k different test groups. The best
model can then be selected easily by choosing the one with the highest score.

8
Model selection in machine learning
Stratified K-Fold
• Similar to that of K-Fold cross-validation with one single point of difference – unlike in k-fold
cross-validation, the values of the target variable is taken into consideration in stratified k-fold.
• Ex. the target variable is a categorical variable with 2 classes, then stratified k-fold ensures that
each test fold gets an equal ratio of the two classes when compared to the training set.
• This makes the model evaluation more accurate and the model training less biased.

9
Model selection in machine learning
Bootstrap : Bootstrap is one of the most powerful ways to obtain a stabilized model.
• It is close to the random splitting technique since it follows the concept of random sampling.
• The first step is to select a sample size (which is usually equal to the size of the original
dataset).
• Thereafter, a sample data point must be randomly selected from the original dataset and added
to the bootstrap sample.
• After the addition, the sample needs to be put back into the original sample. This process needs
to be repeated for N times, where N is the sample size.
• Therefore, it is a resampling technique that creates the bootstrap sample by sampling data
points from the original dataset with replacement. This means that the bootstrap sample can
contain multiple instances of the same data point.
• The model is trained on the bootstrap sample and then evaluated on all those data points that did
not make it to the bootstrapped sample. These are called the out-of-bag samples.
10
How to evaluate ML models
Classification metrics: confusion matrix
A confusion matrix is a table that is often used to evaluate the performance of a classification model.
It displays the number of true positives, false positives, true negatives, and false negatives in a tabular
format. Here's an example of a confusion matrix for a binary classification problem:

11
How to evaluate ML models
confusion matrix : example of a pregnancy test, where an actual pregnant woman and a fat man
consult a doctor, and the test results are given in the below image.:

12
How to evaluate ML models
TP(True Positive): The woman is pregnant, and she is predicted as pregnant. Here P
represents positive prediction, and T shows that our prediction is actually true.
FP(False Positive): A fat man is predicted as pregnant, which is actually false. Here
P represents positive prediction, and F shows that our prediction is actually false.
This is also called a Type I error.
FN(False Negative): A woman who is actually pregnant is predicted as not pregnant.
Here N represents negative prediction, and F shows that our prediction is actually
false. This is also called a Type II error.
TN(True Negative): A fat man is predicted as not pregnant. Here N represents
Negative prediction, and T shows that our prediction is actually true.
13
Accuracy
Accuracy : is the simplest metric and can be defined as the number of test cases correctly classified divided by the total number
of test cases. A

• It can be applied to most generic problems but is not very useful when it comes to unbalanced datasets.
• For instance, if we are detecting frauds in bank data, the ratio of fraud to non-fraud cases can be 1:99. In such cases, if
accuracy is used, the model will turn out to be 99% accurate by predicting all test cases as non-fraud. The 99% accurate
model will be completely useless.
• If a model is poorly trained such that it predicts all the 1000 (say) data points as non-frauds, it will be missing out on the 10
fraud data points. If accuracy is measured, it will show that that model correctly predicts 990 data points and thus, it will have
an accuracy of (990/1000)*100 = 99%!
• This is why accuracy is a false indicator of the model’s health. Therefore, for such a case, a metric is required that can focus
on the ten fraud data points which were completely missed by the model.
14
Precision
Precision is the proportion of true positive predictions among all positive predictions made by the classifier.
Precision = TP / (TP + FP)
Precision is the metric used to identify the correctness of classification.
• Higher is the precision, which means better is the ability of the model to correctly classify the positive class.
• In the problem of predictive maintenance (where one must predict in advance when a machine needs to be
repaired), precision comes into play.
• The cost of maintenance is usually high and thus, incorrect predictions can lead to a loss for the company.
• In such cases, the ability of the model to correctly classify the positive class and to lower the number of false
positives is paramount!
• The ideal value of precision would be 1.0, which means that all of the positive predictions made by the model are
correct.
15
Recall & F1 score
Recall (also known as sensitivity or true positive rate-TPR) is the proportion of true positive predictions among all positive
instances in the dataset. It is calculated as: Recall = TP / (TP + FN)
True Positive Rate(TPR): True Positive/positive ( actual +vs in dataset)
In fraud problem, the recall value will be very useful in fraud cases because a high recall value will indicate that a lot of fraud
cases were identified out of the total number of frauds.
• The ideal value of recall would also be 1.0, which means that the model correctly identifies all positive cases in the dataset.
F1 score is the harmonic mean of precision and recall. It provides a single score that balances both precision and recall. It
balances out the strengths of each
F1 score = 2 * (Precision * Recall) / (Precision + Recall)
• It is useful in cases where both recall and precision can be valuable – like in the identification of plane parts that might require
repairing. Here, precision will be required to save on the company’s cost (because plane parts are extremely expensive) and
recall will be required to ensure that the machinery is stable and not a threat to human lives.

16
TPR, FPR, TNR, FNR
Rate is a measure factor in a confusion matrix. It has also 4 type TPR, FPR, TNR, FNR
• True Positive Rate(TPR): True Positive/positive ..recall (True positives from all actual positive from dataset )
True Positive Rate(TPR)=TP / (TP + FN)
• False Positive Rate(FPR): False Positive /Negative ( false positives from all actual negative from dataset
False Positive Rate(FPR) = FP/(FP+TN)
• False Negative Rate(FNR): False Negative/Positive (false negatives from all actual positive from dataset
False Negative Rate(FNR)= FN/(FN+TP)
• True Negative Rate(TNR): True Negative/Negative (True negatives from all actual negative from dataset)
True Negative Rate(TNR) = TN/(TN+FP)

For better performance, TPR, TNR should be high and FNR, FPR should be low

17
AUC-ROC
• ROC curve is a plot of true positive rate (recall) against false positive rate.
• AUC-ROC stands for Area Under the Receiver Operating Characteristics
• Higher the area, the better is the model performance. Used for only binary classification.
If the curve is somewhere near the 50% diagonal line, it suggests that the model randomly predicts the output variable
True Positive Rate(TPR)=TP / (TP + FN) False positive Rate(TNR) = FP/(TN+FP)

18
AUC-ROC
• The AUC ROC score is the area under the ROC curve and ranges from 0 to 1, with a score of 0.5 indicating a random classifier
• A score of 1.0 indicating a perfect classifier.
• A higher AUC ROC score indicates better performance of the model in distinguishing between positive and negative classes.

19
Regression metrics
Mean Squared Error or MSE
• MSE is a simple metric that calculates the difference between the actual value and the predicted value (error), squares it
and then provides the mean of all the errors.
• MSE is very sensitive to outliers and will show a very high error value even if a few outliers are present in the otherwise
well-fitted model predictions.

Root Mean Squared Error or RMSE


• RMSE is the root of MSE and is beneficial because it helps to bring down the scale of the errors closer to the actual
values, making it more interpretable.
.

20
Regression metrics
R-squared (R²) is a statistical measure that represents the proportion of the variance in the dependent variable (y) that can be
explained by the independent variable(s) (x) in a linear regression model. It is a measure of how well the model fits the data.
• In other words, R-squared measures how well the linear regression model fits the data by comparing the variance of the
predicted values to the variance of the actual values.
• The R² value ranges from 0 to 1, a value of 1 indicating that the model perfectly fits the data and a value of 0 indicating that
the model does not explain any of the variance in the data.
• An R² value of 0.5 indicates that 50% of the variability in the dependent variable is explained by the independent variable(s).
Limitation
• However, R-squared has some limitations. It assumes that the relationship between the independent and dependent
variables is linear, and it can be influenced by outliers in the data.
• Therefore, it should be used in conjunction with other metrics, such as MSE, to fully evaluate the performance of a regression
model.

21
Regression metrics
R-squared is calculated as follows:
R² = 1 - (SSres / SStot)

where SSres is the sum of squared residuals (the difference between the actual y values and the predicted y values-error), and
SStot is the total sum of squares (the difference between the actual y values and the mean of y (y-y mean).

22
Regression metrics
• Adjusted R-squared is a modified version of R-squared that takes into account the number of independent variables in a
linear regression model.
• While R-squared measures the proportion of variance in the dependent variable (y) that is explained by the independent
variables (x), it can be biased by the number of independent variables included in the model.
• As the number of independent variables increases, R-squared will generally increase, even if the additional variables do
not improve the model significantly.
Adjusted R-Square
• Adjusted R-squared solves this problem by penalizing the inclusion of unnecessary variables in the model..
• The adjusted R-squared formula is:
Adjusted R² = 1 - [(1-R²)(n-1)/(n-k-1)]
• where n is the number of observations and k is the number of independent variables in the model.

23
Bias and variance
Errors in Machine Learning
• Irreducible errors are errors which will always be present in a machine learning model, because of unknown variables, and
whose values cannot be reduced.
• Reducible errors are those errors whose values can be further reduced to improve a model. They are caused because our
model’s output function does not match the desired output function and can be optimized.

24
Bias and variance
What is Bias?
• Bias refers to the error that is introduced by approximating a real-world problem with a simplified model.
• A model with high bias is typically too simple and makes assumptions that do not true for the real-world problem.
• This can lead to underfitting, where the model is unable to capture the underlying patterns in the data and performs
poorly on both the training and testing data.
Bias is the difference between our actual and predicted values. Bias is the simple assumptions that our model makes about our
data to be able to predict new data.

25
Bias and variance
• When the Bias is high, assumptions made by our model are too basic, the model can’t capture the important features of our
data.
• This means that our model hasn’t captured patterns in the training data & hence cannot perform well on the testing data too.
• If this is the case, our model cannot perform on new data .
• This instance, where the model cannot find patterns in our training set and hence fails for both seen and unseen data, is
called Underfitting
The below figure shows an example of Underfitting. As we can see, the model has found no patterns in our data and the line of
best fit is a straight line that does not pass through any of the data points. The model has failed to train properly on the data given
and cannot predict new data either.

26
Bias and variance
What is Variance?
• Variance, on the other hand, refers to the amount by which the predictions of a model would change if it were trained on a
different set of data.
• A model with high variance is able to capture the noise or random fluctuations in the training data and tends to overfit.
• This can result in good performance on the training data but poor performance on new, unseen data.

27
Bias and variance
What is Variance?

• we can see that our model has learned extremely well for our training data, which has taught it to identify cats.
• But when given new data, such as the picture of a fox, our model predicts it as a cat, as that is what it has learned.
• This happens when the Variance is high, our model will capture all the features of the data given to it, including the
noise, will tune itself to the data, and predict it very well but when given new data, it cannot predict on it as it is too specific
to training data.

28
Bias and variance
• Hence, model will perform really well on testing data and get high accuracy but will fail to perform on new, unseen data.
• New data may not have the exact same features and the model won’t be able to predict it very well. This is called Overfitting.

29
Bias-Variance Tradeoff
• In general, the goal of model training is to achieve a balance between bias and variance.
• This is often referred to as the bias-variance tradeoff.

• In the figure, we can see that when bias is high, the error in both testing and training set is also high.
• If we have a high variance, the model performs well on the training set, we can see that the error is low, but gives
high error on the testing set.
• We can see that there is a region in the middle, where the error in both training and testing set is low and the bias
and variance is in perfect balance.
30
Bias-Variance Tradeoff

• The above bull’s eye graph helps explain bias and variance tradeoff better.
• The best fit is when the data is concentrated in the center, ie: at the bull’s eye. We can see that as we get farther and farther away
from the center, the error increases in our model.
• The best model is one where bias and variance are both low .

31
Bias-Variance Tradeoff
• A model with high bias and low variance may be too simple and fail to capture the complexity
of the underlying data, underfit.
• While a model with low bias and high variance may overfit the training data and fail to
generalize to new data.
• To improve the performance of a model, techniques such as regularization, cross-validation,
and ensemble methods can be used to reduce bias and variance and find the optimal balance
between the two.

32
Bias-Variance Tradeoff
• In machine learning, bias & variance are two sources of errors that can affect the performance of a
model.
• Bias refers to the error introduced by the model's assumptions about the data, while variance refers to the
error introduced by the model's sensitivity to fluctuations in the data.
To minimize bias and variance in a machine learning model, the following approaches can be taken:
• Increase the complexity of the model: If the model is too simple, it may have high bias and low
variance. Increasing the complexity of the model by adding more features or layers can help reduce bias.
• Regularization: Regularization techniques such as L1, L2, or dropout can be used to reduce the
complexity of the model and prevent overfitting.
• Feature selection: Feature selection techniques can be used to select the most relevant features for the
model, which can help reduce the noise in the data and improve the accuracy of the model. 33
Bias-Variance Tradeoff
• Cross-validation: Cross-validation can be used to evaluate the performance of the model and
identify the optimal hyperparameters.
• Ensemble methods: Ensemble methods such as bagging, boosting, or stacking can be used
to combine multiple models and reduce the variance of the model.
• Data augmentation: Data augmentation techniques can be used to increase the size &
diversity of the training data, which can help reduce overfitting and improve the accuracy of
the model.
Overall, minimizing bias and variance in a machine learning model requires a balance between
simplicity and complexity, as well as careful selection of features and regularization
techniques to prevent overfitting 34
Ensemble Method
Ensemble Method

L1
Ensemble Learning:
L2
Training
Multiple Machine
Data L* Learning models
L3 (classifiers) are
combined to solve a
L4
particular problem

37
Ensemble Method
▪ “A group of item viewed as a whole rather than individually”
▪ Ensemble method : not depend on just one model or algorithm for your output.
Rather consider many model at same time.
▪ L1,L2,L3, L4……..Ln : Base learner with different classification algorithm like
decision tree, KNN, naive bias, Linear regression, Logistic Regression…etc
▪ This situation is called as heterogeneous situation / Learner
▪ If you keep all L with same algorithm you can change data set and assign different
data set to every algorithm or L to gain heterogeneous behavior
▪ L* : Final strong learner or model
▪ L*: Resultant classifier output of all Lerner must have more predictive power so
called as strong classifier.
39
Ensemble Method
Original training data
D

Randomly chosen D3
D1 D2 Dn
sample data

classifier with
different algorithm C3
C1 C2 Cn
X
1

C*
final ensemble classifier

40
Bootstrap aggregation (Bagging)
Bootstrap aggregation (Bagging)
▪ D: Original training data( contains many sample, records)
▪ D1: Randomly chosen sample data record to make new dataset, with replacement ( any
one record may present repeatedly in records d2 d3 etc…i.e. multiple copies, or may be
only one record is present in any data set and not in any other data)
▪ They are also called as bootstrap sample data set, to train model as c1 …cn classifier
▪ C1: classifier with different algorithm , or model, weak learner, classifier
▪ C* : final ensemble classifier or model, as strong classifier because accuracy ,
predictive power, error rate less, so result of this classifier will be more accurate
precision

42
Bagging
▪ Reduces overfitting (variance)

▪ Normally uses one type of classifier

▪ Decision trees are popular

▪ Easy to parallelize

43
Bagging (Random Forest)
Random Forest Algorithm
• Ensemble method specifically designed for decision tree classifiers
• Random Forests grows many trees
• Ensemble of unpruned decision trees
• Each base classifier classifies a “new” vector of attributes from the original
data
• Final result on classifying a new instance: voting.
• Forest chooses the classification result having the most votes (over all the
trees in the forest)

45
RF

46
Boosting
• Boosting is an ensemble learning method that combines a set of weak learners into strong
learners to minimize training errors.
• In boosting, a random sample of data is selected, fitted with a model, &then trained sequentially.
• That is, each model tries to compensate for the weaknesses of its predecessor.
• Each classifier's weak rules are combined with each iteration to form one strict prediction rule.

47
Boosting
• In boosting, the weak learners are typically decision trees, but they can also be
other types of classifiers, such as linear models or neural networks.
• During the training process, the weak learners are assigned weights based on
their performance.
• When predicting the class label of a new instance, the outputs of all the weak
learners are combined in a weighted sum, with the weights determined by their
performance during training.
• The final output is the sign of the weighted sum, which can be interpreted as the
probability that the instance belongs to the positive class.

48
Boosting
Here's how the algorithm works:
• Step 1: The base algorithm reads the data and assigns equal weight to each
sample observation.

• Step 2: False predictions made by the base learner are identified. In the next
iteration, these false predictions are assigned to the next base learner with a
higher weightage on these incorrect predictions.

• Step 3: Repeat step 2 until the algorithm can correctly classify the output.

49
Boosting
Some popular boosting algorithms include
1. AdaBoost (Adaptive Boosting),
2. Gradient Boosting,
3. XGBoost.
• These algorithms differ in the way the weights are updated during training and
the loss function used to measure performance.
• Boosting has been shown to be highly effective in many classification and
regression tasks, and is widely used in practice..

50
AdaBoost (Adaptive Boosting)
• AdaBoost, short for Adaptive Boosting, is a machine learning algorithm used for classification
and regression tasks.
• It is an ensemble method that combines several "weak" learners (i.e., classifiers that perform
only slightly better than random guessing) into a "strong" ensemble classifier.

• The basic idea of AdaBoost is to iteratively train weak classifiers on the training data
• while adjusting the weights of the training examples to emphasize the examples that were
misclassified by the previous weak classifier.
• The final strong classifier is a weighted sum of the weak classifiers, where each weak
classifier is assigned a weight proportional to its performance.

51
AdaBoost (Adaptive Boosting)
Here are the steps involved in AdaBoost:
• Initialize the weights of the training examples to be equal.
• Train a weak classifier on the training data. A weak classifier is a classifier that performs only slightly
better than random guessing.
• Evaluate the performance of the weak classifier on the training data using a weighted error rate. The
weighted error rate is the sum of the weights of the misclassified examples divided by the sum of all
weights.
• Update the weights of the training examples based on the performance of the weak classifier. Increase
the weights of the misclassified examples and decrease the weights of the correctly classified examples.
• Repeat steps 2-4 for a fixed number of iterations or until the training error reaches a certain threshold.
• Combine the weak classifiers into a final strong classifier using a weighted sum of the weak classifiers,
where each weak classifier is assigned a weight proportional to its performance

52
AdaBoost (Adaptive Boosting)
The most important parameters are
base_estimator, n_estimators, and learning_rate

• base_estimator: It is a weak learner used to train the model.


It uses DecisionTreeClassifier as default weak learner for training purpose.
You can also specify different machine learning algorithms.

• n_estimators: Number of weak learners to train iteratively.

• learning_rate: It contributes to the weights of weak learners. It uses 1 as a default value.

53
AdaBoost (Adaptive Boosting)
Pros
• AdaBoost is easy to implement.
• It iteratively corrects the mistakes of the weak classifier and improves accuracy by combining weak
learners.
• You can use many base classifiers with AdaBoost. AdaBoost is not prone to overfitting.

Cons
• AdaBoost is sensitive to noise data.
• It is highly affected by outliers because it tries to fit each point perfectly.
• AdaBoost is slower compared to XGBoost.

54
Gradient Boosting
• Gradient Boosting is an ensemble learning technique that combines multiple
weak learners (usually decision trees) to build a strong model.

• The idea behind gradient boosting is to iteratively add new trees to the model,
each tree correcting the errors made by the previous ones.

• During each iteration, the algorithm fits a new tree to the residual errors of the
previous model.
• Gradient Boosting is particularly useful when dealing with complex datasets and
non-linear relationships between variables.

55
Gradient Boosting
Here are some of the most important parameters:
• n_estimators: This parameter controls the number of trees in the ensemble.
• learning_rate: This parameter controls the contribution of each tree to the final prediction.
• max_depth: This parameter controls the maximum depth of each tree in the ensemble.
Increasing this parameter can improve the model's performance, but may also increase the risk
of overfitting.
• loss: This parameter controls the loss function used to measure the quality of each split in the
trees.

56
XGBoost (Extreme Gradient Boosting)
• XGBoost (Extreme Gradient Boosting) is a decision tree-based ensemble machine learning
algorithm that uses gradient boosting to iteratively improve the performance of a model.

• XGBoost has several advantages over traditional gradient boosting algorithms.

• It includes a number of regularization techniques to prevent overfitting, making it more accurate


and less prone to overfitting on training data.

• It also includes a more efficient implementation of gradient boosting, making it faster and more
scalable.

57
XGBoost (Extreme Gradient Boosting)
Here's how it works:
• Initialization: The algorithm starts with a single decision tree, which serves as the initial model. The
output of this tree is used to make predictions on the training data.
• Calculation of Residuals: The difference between the predicted values and actual values for the
training data is calculated, which is called the residuals. These residuals are used to train the next
decision tree in the ensemble.
• Tree Construction: A new decision tree is constructed to predict the residuals calculated in the
previous step. This tree is fitted on the residuals instead of the actual target variable.
The tree is constructed in a greedy manner by recursively splitting the data into smaller subsets based on
the feature that provides the most information gain.
• Update Model: The output of the new decision tree is combined with the output of the previous tree
to produce a new set of predictions. This updated model is used to calculate the residuals for the next
iteration.
• Iteration: Steps 3 and 4 are repeated iteratively until the residuals can no longer be reduced or a
pre-defined maximum number of trees is reached. The final model is the sum of all the individual tree
58
models.
Comparison of Boosting Algorithms

59
Stacking
• Stacking is a popular ensemble learning technique used in machine learning to improve the
predictive performance of models.
• The idea behind stacking is to combine several base models, also known as level-0 models, to
form a new model, known as a level-1 or meta model.
• The base models are trained on the same dataset, and their outputs are then used as input
features to the meta model.

60
Stacking
The stacking process involves the following steps:
• Splitting the training data into two or more folds.
• Training several base models on each fold of the
training data.
• Using the base models to make predictions on the
validation data, which was not used during training.
• Using the predictions from the base models as
features to train a meta model on the validation data.
• Repeating steps 2-4 for each fold of the training data.
• Once the meta model has been trained on the
validation data, it can be used to make predictions on
new, unseen data.

61
Stacking
• Stacking can improve the predictive performance of models because it combines
the strengths of different models.
• For example, one base model may be good at predicting certain patterns in the data,
while another model may be better at predicting other patterns.
• By combining the predictions from multiple models, the meta model can make
more accurate predictions than any individual model.

One potential downside of stacking is that it can be computationally expensive,


especially if the base models are complex or if the dataset is large.
Additionally, if the base models are overfitting the training data, then the stacking
ensemble may also overfit the data.
To mitigate this, it is important to use a diverse set of base models and to tune the
hyperparameters of each model carefully.
62
Stacking for Deep Learning

63
Training Vs. Testing samples
• In machine learning, the process of building a predictive model involves dividing the available data into
two sets: a training set and a testing set.
• The training set is used to train the model, while the testing set is used to evaluate the performance of the
model.
• The training set is a subset of the available data that is used to train the model. The training data is used
to build the model, adjusting the parameters of the algorithm to find the best fit to the data. The goal is
to create a model that can generalize well to new data and make accurate predictions.

The testing set is a subset of the available data that is used to evaluate the performance of the model.
The testing data is used to evaluate how well the model can generalize to new data.
The goal is to measure the accuracy of the model's predictions on data that it has not seen before.
The testing data is kept separate from the training data to prevent the model from simply memorizing the
training data.

64
Positive vs. Negative class
• In machine learning, a binary classification problem involves dividing a dataset into two classes:
positive and negative.
• The positive class represents the target class that we want to identify or predict, while the
negative class represents all other classes or observations that are not of interest.
• For example, in a medical diagnosis problem, the positive class might represent patients who
have a particular disease, while the negative class represents patients who do not have the
disease.
• The positive and negative classes are often imbalanced, meaning that one class may have more
samples than the other.
• This can present a challenge in training a model, as the model may become biased towards the
majority class and not perform well on the minority class.

65
Positive vs. Negative class
• To address this issue, techniques such as oversampling the minority class,
undersampling the majority class, or using weighted loss functions can be
employed.
• It is also important to evaluate the model's performance using metrics such as
precision, recall, and F1-score, which take into account the class imbalance.
• In summary, the positive and negative classes represent the target and non-target
classes in a binary classification problem, and addressing class imbalance is an
important consideration when building a model.

66
Improving classification accuracy of Class Imbalanced Data
Improving classification accuracy of class imbalanced data is a challenging problem in
machine learning. When one class has a much smaller number of samples than the other
class, the model can become biased towards the majority class, resulting in poor
performance on the minority class.
Here are some techniques that can be used to improve the classification accuracy of class
imbalanced data:
• Resampling Techniques: This involves either oversampling the minority class by creating
synthetic samples, or undersampling the majority class by removing samples. This can be
done using methods such as random oversampling, Synthetic Minority Oversampling
Technique (SMOTE), Adaptive Synthetic (ADASYN), or Tomek Links: Tomek links are
used as an undersampling method and removes noisy and borderline majority class
examples.

67
Improving classification accuracy of Class Imbalanced Data
• Ensemble Techniques: Ensemble techniques such as Bagging, Boosting, and Stacking can
be used to combine multiple models to improve the classification accuracy. These techniques
can be particularly effective when the minority class is difficult to identify.
• Class Weighting: In this approach, a higher weight is given to the minority class during the
training process to make the model more sensitive to the minority class. This can be done by
adjusting the loss function or by setting the class weight parameter in the model.

68
Positive vs. Negative class
• Algorithm Selection: Certain algorithms perform better on imbalanced datasets
than others. Algorithms such as Random Forest, Gradient Boosting, and Support
Vector Machines (SVMs) are often used in class imbalance problems.
• Cost-Sensitive Learning: This approach involves assigning different costs to
misclassification of the minority and majority classes. This can be useful when the
cost of misclassifying the minority class is much higher than the majority class.
Cost-sensitive learning is a subfield of machine learning that takes the costs of
prediction errors (and potentially other costs) into account when training a
machine learning model. It is a field of study that is closely related to the field of
imbalanced learning that is concerned with classification on datasets with a
skewed class distribution.
In summary, improving the classification accuracy of class imbalanced data
requires careful consideration of the problem and the available techniques. A
69
combination of techniques may be necessary to achieve the best results.
70

You might also like