0% found this document useful (0 votes)
5 views

ML-classification models

The document provides an overview of various machine learning algorithms, focusing on Logistic Regression, Decision Trees, and Random Forest Classifier. It details the functionality, advantages, and disadvantages of each algorithm, including concepts like decision boundaries, cost functions, and model evaluation techniques. Additionally, it explains the importance of feature selection, handling of non-linearities, and the ensemble approach in Random Forest to improve prediction accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ML-classification models

The document provides an overview of various machine learning algorithms, focusing on Logistic Regression, Decision Trees, and Random Forest Classifier. It details the functionality, advantages, and disadvantages of each algorithm, including concepts like decision boundaries, cost functions, and model evaluation techniques. Additionally, it explains the importance of feature selection, handling of non-linearities, and the ensemble approach in Random Forest to improve prediction accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

MACHINE LEARNING ALGORITHMS

1) Logistic Regression:
• Logistic Regression is a classification algorithm used to predict the probability of
a specific outcome based on input features. For instance, in the context of
determining whether a student passes or fails an exam given the number of hours
spent studying, logistic regression analyses the relationship between the features
and the probability of passing.
• It is commonly used for two types of logistic regression: binary and multinomial.
Binary logistic regression classifies data into two categories, such as tumour being
malignant or benign, while multinomial logistic regression handles multiple
categories like distinguishing between cats, dogs, or sheep.
a) Logit function:

The logistic regression model can be represented using the logit or log-odds function. The
odds signify the ratio of the probability of success to the probability of failure.

b) Sigmoid:

By taking the inverse of the logit function, we get the sigmoid function, which produces an
S-shaped curve, always yielding probability values between 0 and 1. The algorithm
transforms its output into a probability value using the logistic sigmoid function. To map the
predicted values to probabilities, the sigmoid function is used, which maps any real value to
a value between 0 and 1.
c) Decision Boundary:

The decision boundary helps to differentiate between positive and negative classes by using
a threshold value. If the predicted probability is above the threshold, the instance is classified
into Class 1; otherwise, it belongs to Class 2.

d) Why not linear regression?

where, p(x) = p(y=1|x)

Logistic regression is preferred over linear regression for classification tasks, especially when
dealing with outliers in the data. It ensures that the predicted probabilities lie between 0 and
1, unlike linear regression, which may produce values outside this range.

Unlike linear regression, which uses Ordinary Least Square for parameter estimation, logistic
regression employs Maximum Likelihood Estimation, finding the regression coefficients that
maximize the probability of the observed data.

e) Cost function:

The main goal of Gradient descent is to minimize the cost value. i.e., min J(θ). Gradient
descent has an analogy in which we must imagine ourselves at the top of a mountain valley
and left stranded and blindfolded, our objective is to reach the bottom of the hill. Feeling the
slope of the terrain around you is what everyone would do. Well, this action is analogous to
calculating the gradient descent, and taking a step is analogous to one iteration of the
update to the parameters.

f) Model evaluation

In model evaluation, null deviance represents the response predicted by a model with just an
intercept, while model deviance indicates the response predicted by a model with
independent variables. The accuracy of the model can also be assessed using a Confusion
Matrix.

g) Multi-classification

For multi-class problems, logistic regression adopts a one v/s all approach. It turns the
problem into multiple binary classification problems, where each class is compared against
all the others. The classification with the highest probability value relative to the rest is
chosen as the final solution.
h) Advantages of Logistic Regression:
• Simplicity and Interpretability: Logistic regression is a straightforward algorithm
with easily interpretable results. The model's coefficients provide insight into the
importance and direction of each feature's impact on the prediction.
• Efficient and Fast: Logistic regression is computationally efficient and can handle
large datasets with relative ease. Training and prediction times are typically fast,
making it suitable for real-time and online applications.
• Probabilistic Output: Logistic regression predicts the probability of an instance
belonging to a particular class, providing a clear indication of confidence in the
predictions.
• Low Risk of Overfitting: It is less prone to overfitting, especially when the
number of features is relatively small. Regularization techniques, such as L1 or L2
regularization, can be employed to further reduce overfitting.
• Handles Non-Linearities: Although logistic regression is a linear algorithm, it can
handle certain non-linear relationships through feature engineering and
polynomial transformations.
i) Disadvantages of Logistic Regression:
• Limited to Linear Boundaries: Logistic regression is a linear classifier, which
means it can only create linear decision boundaries. It may struggle with datasets
that require complex and non-linear decision boundaries for accurate
classification.
• Sensitivity to Outliers: Outliers in the data can significantly influence the model's
coefficients and predictions, leading to potential performance issues.
• Not Suitable for Multiclass Problems: While it can be extended to handle
multiclass problems using techniques like one-vs-rest, it may not be as effective as
other algorithms specifically designed for multiclass classification tasks.
• Assumption of Independence: Logistic regression assumes that the input
features are independent of each other. Violation of this assumption can lead to
biased and unreliable results.
• Feature Engineering Dependency: The performance of logistic regression heavily
depends on the quality of feature engineering and feature selection. Choosing
irrelevant or irrelevant features can negatively impact the model's performance.
• Imbalanced Data Handling: Logistic regression can struggle with imbalanced
datasets, where one class is significantly more prevalent than the others. It may
produce biased predictions towards the majority class.
2) DECISION TREES

• A decision tree is a hierarchical structure where each node symbolizes a feature (attribute),
each branch represents a decision (rule), and each leaf denotes an outcome (categorical or
continuous value).
• This tree is typically visualized upside down, with the root at the top. In the illustration on
the left, the bold text in black signifies a condition (internal node) that determines the
branching of the tree into edges (branches). At the end of a branch, where no further
splitting occurs, lies the decision (leaf). In this example, the decision tree classifies
passengers as either survivors (green text) or non-survivors (red text).
• This methodology, known as learning decision trees from data, facilitates clear
understanding of feature importance and relationships. The tree shown above is a
Classification tree, where the objective is to classify passengers as survivors or non-
survivors. For Regression trees, the representation is the same, but they predict continuous
values, such as the price of a house.

What are the reasons for choosing Decision trees?


• Decision trees closely resemble human-level thinking, making it easier to comprehend the
data and derive meaningful interpretations.
• Decision trees provide a clear and transparent logic for data interpretation, unlike black
box algorithms such as SVM, NN, etc., which can be more challenging to interpret.

Black-box algorithms:
• Black box algorithms" refer to machine learning models or techniques that are complex
and opaque in terms of their internal workings and decision-making processes.
• When you use a black box algorithm to make predictions, you may not easily understand
how the model arrives at its conclusions or how it relates input features to output
predictions.

Recursive Binary Splitting


• Recursive Binary Splitting is a method where all features are considered, and various split
points are tested using a cost function. The split that yields the best cost (or lowest cost)
is chosen.
• In our scenario with three features, three potential splits are evaluated. The accuracy of
each split is calculated using a function, and the one with the least cost is selected.
• In this example, the split based on the passenger's sex proves to be the most cost-
effective.
• The algorithm is recursive because the groups formed can be further divided using the
same strategy. This process is also known as the greedy algorithm since it aims to
continuously reduce the cost. As a result, the root node becomes the best predictor or
classifier.

Cost of a split
a) For regression tasks,
• the cost of a split is calculated using the sum of squared differences between
the actual target values (y) and the predictions made for each group.
• The decision tree initiates the splitting process by considering each feature in
the training data.
• It calculates the mean of the responses for the inputs in a particular group
and treats it as the prediction for that group.
• The given function is then applied to all data points, and the cost is
computed for each potential split. Ultimately, the split with the lowest cost is
chosen as the optimal one.
• Alternatively, another cost function involves minimizing the reduction in
standard deviation, which can provide more insights into the process.

b) For classification tasks,


• a Gini score is employed to assess the quality of a split based on the degree
of mixing of response classes within the groups created by the split.
• The proportion (pk) of inputs from the same class in a specific group is crucial
for this calculation.
• Perfect class purity is achieved when a group contains only inputs from the
same class, resulting in pk being either 1 or 0, leading to a Gini score (G) of 0.
• On the other hand, a node with a 50-50 split of classes within a group has the
worst purity, so for binary classification, it will have pk = 0.5 and G = 0.5. The
goal is to minimize the Gini score and obtain highly pure groups through the
split.

When to Halt Splitting?


• To enhance the performance of a decision tree, pruning can be employed. Pruning involves
the removal of branches that utilize features with low importance. This process reduces the
complexity of the tree, thus improving its predictive power and mitigating overfitting.
• Pruning can commence at either the root or the leaves of the tree. The simplest pruning
method starts at the leaves, where each node is removed if it represents the most popular
class in that leaf. This change is retained if it does not adversely affect accuracy. This
technique is also referred to as reduced error pruning.
• Alternatively, more sophisticated pruning methods can be used, such as cost complexity
pruning. This approach employs a learning parameter (alpha) to determine whether nodes
can be removed based on the size of the sub-tree. It is also known as weakest link
pruning.

Benefits of CART (Classification and Regression Trees):


• Easy to comprehend, interpret, and visualize: CART models provide clear and
straightforward decision rules, making them easy for users to understand and interpret.
The visual representation of decision trees further aids in grasping the model's logic.
• Implicit variable screening or feature selection: During the process of constructing the
decision tree, the algorithm implicitly evaluates the importance of features by placing the
most relevant ones near the root. This serves as a built-in feature selection mechanism.
• Handling of numerical and categorical data, as well as multi-output problems: CART
can efficiently deal with various data types, including both numerical and categorical
features. Moreover, it is adaptable to multi-output tasks, making it suitable for diverse
prediction scenarios.
• Low data preparation effort: Decision trees require relatively little data preprocessing
compared to some other machine learning algorithms. They can handle missing values and
outliers without extensive data transformation.
• Insensitivity to nonlinear relationships: CART can effectively capture nonlinear
relationships between parameters without compromising its performance. This ability to
handle complex relationships is advantageous when dealing with real-world data, which
often exhibits nonlinear patterns.

Drawbacks of CART (Classification and Regression Trees):


• Overfitting: Decision trees can create overly complex trees that may not generalize well
to new data, resulting in poor performance on unseen examples. Overfitting occurs when
the tree captures noise or random variations in the training data, leading to reduced
predictive power on test data.
• Unstable Trees: Small changes or variations in the data can lead to significantly different
decision trees being generated. This instability, known as variance, can result in less reliable
predictions. Techniques like bagging and boosting are used to lower the variance and
improve the stability of the model.
• Lack of Globally Optimal Solution: Greedy algorithms used in constructing decision trees
do not guarantee the globally optimal tree structure. As a result, the generated tree may
not be the most accurate or efficient one for the entire dataset.
• Bias in Class Imbalance: When some classes dominate the data, decision tree learners
may create biased trees that favour the majority class. To address this, it is advisable to
balance the dataset before fitting the decision tree, ensuring fair representation of all
classes.

ID3 (Iterative Dichotomiser 3)


is an algorithm that utilizes the Entropy function and Information Gain as metrics.

Entropy represents the degree of disorder or uncertainty in each dataset.

Information Gain measures the reduction in this uncertainty regarding the target variable or
class when additional information (features or independent variables) is considered.

To calculate Information Gain, we subtract the entropy of the target variable (Y) given a
particular feature (X) from the entropy of the target variable (Y) on its own. This quantifies
the reduction in uncertainty about Y when we have extra information (X) about it.

In real-world scenarios with more than two features, the initial split is made based on the
most informative feature. Then, at each subsequent split, the information gain for each
additional feature needs to be recomputed, as the information gain from each feature in
isolation may differ. The entropy and information gain must be recalculated after one or
more splits have already been performed, leading to changing results. A decision tree
repeats this process, growing deeper until reaching a pre-defined depth or when no further
split can result in a higher information gain beyond a specified threshold, which can often be
set as a hyper-parameter.
3) RANDOM FOREST CLASSIFIER
Decision Tree: The Fundamental Concept
• A decision tree model takes input data and proceeds through a sequence of branching
steps, following a series of if-then rules until it arrives at one of the predefined output
values.
• It can be seen as an algorithm that aims to split the data into the most homogeneous
groups at each split, generating a tree-like set of rules for classification or regression
tasks.
• The recursive nature of decision trees involves splitting the dataset repeatedly until
reaching a predefined stopping condition, often based on a minimum count threshold
for entries in the dataset corresponding to a leaf node.
• When a decision tree is built using a dataset containing specific features, it generates
a set of rules that facilitate prediction. However, its performance is highly dependent
on the initial dataset, which can lead to limited accuracy when applied to real-world
scenarios.
• Overfitting is a potential concern, as the decision tree may excessively adapt to the
peculiarities of the initial dataset.

Why is it Called 'Random Forest'?


The name 'random forest' stems from its unique approach in creating multiple decision trees.
Each decision tree within the forest employs a random subset of features to form questions
and only has access to a random set of training data points. This deliberate randomness
introduces diversity into the forest, resulting in more robust and accurate overall predictions,
hence the name 'random forest.

RANDOM FOREST

a) The random forest classifier is an ensemble algorithm, which means it combines


multiple algorithms of the same or different kinds for classifying objects. It creates a
set of decision trees from randomly selected subsets of the training set and then
aggregates the votes from different decision trees to make the final classification for a
test object. This ensemble approach helps reduce the impact of noise and increases
the accuracy of results compared to a single decision tree.
b) The random forest can apply the concept of weighting to consider the impact of the
results from each decision tree. Trees with high error rates are given lower weight
values, while those with low error rates have higher influence. This allows the random
forest to make more reliable decisions based on the collective wisdom of the individual
decision trees.
c) In the training (or fitting) phase of model building, a random forest, a supervised
machine learning model, learns to map data (such as temperature today and historical
average) to corresponding outputs (such as the max temperature tomorrow).
d) The model grasps the relationships between the data (referred to as features) and the
values it aims to predict (known as the target). Each decision tree within the random
forest contributes to this process by determining the most accurate questions to ask
in order to make precise estimates.
e) During the prediction phase, the random forest aggregates the individual decision tree
estimates by taking an average for regression tasks, like when predicting a continuous
value such as temperature.
f) For classification tasks, where the targets are discrete class labels like 'cloudy' or
'sunny,' the random forest performs a majority vote to determine the predicted class.

How does the random forest algorithm operate?


• Data Subset Creation: The random forest algorithm begins by splitting the dataset
into subsets using two main methods. Firstly, it randomly selects features on which
each tree will be trained, known as random feature subspaces. Secondly, it takes a
sample with replacement from the chosen features, called a bootstrap sample.
• Decision Tree Training: After dividing the dataset into subsets, decision trees are
trained on each of these subsets. Since the trees are independent of each other, the
training process can be easily parallelized, speeding up the overall training time.
• Result Aggregation: Each individual tree produces a result that depends on its specific
initial data. To eliminate the dependency on the initial data and enhance the accuracy
of the estimation, the output of all trees is combined into a single result. Various
methods can be used for aggregating the results.
• Model Validation: Finally, the performance of the random forest model is validated.
This step involves assessing the model's accuracy and generalization ability on a
separate test dataset to ensure its effectiveness in making predictions for new, unseen
data.

Advantages
• Cost-effective: Random Forest (RF) models are more affordable and quicker to train
compared to neural networks, while still maintaining a high level of accuracy. This
makes them suitable for applications in mobile devices, for instance.
• Robust against overfitting: Since RF consists of multiple uncorrelated trees, if one
tree makes an inaccurate prediction due to outliers, other trees can compensate for it.
This leads to better performance than individual trees taken separately.
• High coverage rates and low bias: The robustness of RF makes it well-suited for
scenarios with missing values in the dataset or when assessing the variability between
different data outputs, such as predicting whether college undergraduates will
complete their studies, proceed to a master's degree, or drop out.
• Applicable for classification and regression: RF demonstrates equally accurate
results for both classification and regression tasks.
• Handling missing values: RF can handle missing values in features without
introducing bias into its predictions.
• Easy interpretation: Each tree in the forest makes predictions independently, making
it straightforward to examine and understand the prediction process of any individual
tree.

Despite the benefits, random forest classifiers also come with some
challenges:
• Complexity: Random forests are more complex than decision trees, requiring
additional steps to combine and aggregate the results from multiple trees. It is not as
straightforward as following a single decision tree's path to decide.
• Slower Execution: The process of training and aggregating multiple decision trees can
make random forests slower compared to certain other types of machine learning
models. This aspect might limit their suitability for certain time-sensitive applications.
• Large Datasets and Adequate Training Data: Random forests perform optimally with
large datasets and when there is sufficient training data available. In scenarios with
limited data, the performance of the random forest model may not be as effective.

4) K-nearest neighbors (KNN)


a) What is K- Nearest neighbors?

• Nonparametric as it does not assume about the underlying data distribution


pattern.
• Lazy algorithm as KNN does not have a training step. All data points will be
used only at the time of prediction. With no training step, prediction step is
costly.
• Used for both Classification and Regression
• Uses feature similarity to predict the cluster that the new point will fall into.
b) What is K is K nearest neighbors?
K is a number used to identify similar neighbors for the new data point. KNN takes K
nearest neighbors to decide where the new data point will belong to. This decision is
based on feature similarity.
c) How do we choose the value of K?

Choice of K has a drastic impact on the results we obtain from KNN. We can take the
test set and plot the accuracy rate or F1 score against different values of K. We see a
high error rate for test set when K=1. Hence, we can conclude that model overfits
when k=1. For a high value of K, we see that the F1 score starts to drop. The test set
reaches a minimum error rate when k=5.
d) How does KNN work?
Step 1: Choose a value for K. K should be an odd number.
Step 2: Find the distance of the new point to each of the training data.
Step 3: Find the K nearest neighbors to the new data point.
Step 4: For classification, count the number of data points in each category among the k
neighbors. New data point will belong to class that has the most neighbors. For
regression, value for the new data point will be the average of the k neighbors.
e) How is the distance calculated?
Distance can be calculated using:
1) Euclidean distance

Euclidean distance is the square root of the sum of squared distance between two points. It
is also known as L2 norm.
2) Manhattan distance
Manhattan distance is the sum of the absolute values of the differences between two points.

3) Hamming Distance
Hamming distance is used for categorical variables. In simple terms it tells us if the two
categorical variables are same or not.

4) Minkowski Distance
Minkowski distance is the used to find distance similarity between two points. When p=1, it
becomes Manhattan distance and when p=2, it becomes Euclidean distance.

f) Pros of K Nearest Neighbors:

• Simple algorithm and hence easy to interpret the prediction.


• Nonparametric, so makes no assumption about the underlying data pattern.
• used for both classification and Regression.
• Training step is much faster for nearest neighbor compared to other machine
learning algorithms.
g) Cons of K Nearest Neighbors:
• KNN is computationally expensive as it searches the nearest neighbors for the
new point at the prediction stage.
• High memory requirement as KNN must store all the data points.
• Prediction stage is very costly.
• Sensitive to outliers, accuracy is impacted by noise or irrelevant data.
h) Is K-means and KNN related or is there a difference between KNN and K-Means?
• KNN is supervised machine learning algorithm whereas K-means is unsupervised
machine learning algorithm.
• KNN is used for classification as well as regression whereas K-means is used for
clustering.
• K in KNN is no. of nearest neighbors whereas K in K-means in the no. of clusters
we are trying to identify in the data

5) NAÏVE BAYES Algorithm

Bayes Theorem

Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis.

Naive Bayes Classifier


Naive Bayes classifier calculates the probabilities for every factor. Then it selects the outcome
with highest probability.
This classifier assumes the features (in this case we had words as input) are independent.
Hence the word naive. Even with this it is powerful algorithm used for
• Real time Prediction
• Text classification/ Spam Filtering
• Recommendation System
Example:
We classify whether the day is suitable for playing golf, given the features of the day. The
columns represent these features, and the rows represent individual entries. If we take the
first row of the dataset, we can observe that is not suitable for playing golf if the outlook is
rainy, temperature is hot, humidity is high, and it is not windy. We make two assumptions
here, one as stated above we consider that these predictors are independent. That is, if the
temperature is hot, it does not necessarily mean that the humidity is high. Another
assumption made here is that all the predictors have an equal effect on the outcome. That is,
the day being windy does not have more importance in deciding to play golf or not.

Bayes theorem can be rewritten as:

The variable y is the class variable (play golf), which represents if it is suitable to play golf or
not given the conditions. Variable X represent the parameters/features.
X is given as,

Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook, temperature,
humidity and windy. By substituting for X and expanding using the chain rule we get,

Now, you can obtain the values for each by looking at the dataset and substitute them into
the equation. For all entries in the dataset, the denominator does not change, it remains
static. Therefore, the denominator can be removed, and a proportionality can be introduced.

In our case, the class variable(y) has only two outcomes, yes or no. There could be cases
where the classification could be multivariate. Therefore, we need to find the class y with
maximum probability.
Types of Naive Bayes Classifier

Multinomial Naive Bayes:

This is mostly used for document classification problem, i.e whether a document belongs to
the category of sports, politics, technology etc. The features/predictors used by the classifier
are the frequency of the words present in the document.
Bernoulli Naive Bayes:
This is like the multinomial naive bayes, but the predictors are Boolean variables. The
parameters that we use to predict the class variable take up only values yes or no, for
example if a word occurs in the text or not.
Gaussian Naive Bayes:
When the predictors take up a continuous value and are not discrete, we assume that these
values are sampled from a gaussian distribution.
Since the way the values are present in the dataset changes, the formula for conditional
probability changes to,

Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering,
recommendation systems etc. They are fast and easy to implement but their biggest
disadvantage is that the requirement of predictors to be independent. In most of the real-life
cases, the predictors are dependent, this hinders the performance of the classifier.

Advantages
• It is easy and fast to predict the class of the test data set. It also performs well
in multi-class prediction.
• When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
• It performs well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which is
a strong assumption).
Disadvantages
• If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as Zero Frequency. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
• On the other side naive Bayes is also known as a bad estimator, so the probability
outputs are not to be taken too seriously.
• Another limitation of Naive Bayes is the assumption of independent predictors. In
real life, it is almost impossible that we get a set of predictors which are completely
independent.

When to use
• Text Classification
• when dataset is huge
• When you have small training set

Applications
• Real time Prediction: Naive Bayes is an eager learning classifier, and it is sure fast.
Thus, it could be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and negative customer
sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative
Filtering together builds a Recommendation System that uses machine learning and
data mining techniques to filter unseen information and predict whether a user
would like a given resource or not.
6) Gradient boosting
Ensemble:

When we try to predict the target variable using any machine learning technique, the main
causes of the difference in actual and predicted values are noise, variance, and bias.
Ensemble helps to reduce these factors (except noise, which is an irreducible error).

An ensemble is just a collection of predictors which come together (e.g. mean of all
predictions) to give a final prediction. The reason we use ensembles is that many different
predictors trying to predict the same target variable will perform a better job than any single
predictor alone. Ensembling techniques are further classified into Bagging and Boosting.

Bagging:

Bagging is a simple ensembling technique in which we build many independent


predictors/models/learners and combine them using some model averaging techniques. (e.g.,
weighted average, majority vote, or normal average).
We typically take random sub-sample/bootstrap of data for each model, so that all the models
are a little different from each other. Each observation is chosen with a replacement to be used
as input for each of the model. So, each model will have different observations based on the
bootstrap process. Because this technique takes many uncorrelated learners to make a final
model, it reduces error by reducing variance. Example of bagging ensemble is Random Forest
models.

Boosting:
Boosting is an ensemble technique in which the predictors are not made independently, but
sequentially.

This technique employs the logic in which the subsequent predictors learn from the mistakes
of the previous predictors. Therefore, the observations have an unequal probability of
appearing in subsequent models and the ones with the highest error appear most. (So, the
observations are not chosen based on the bootstrap process but based on the error). The
predictors can be chosen from a range of models like decision trees, regressors, classifiers,
etc. Because new predictors are learning from mistakes committed by previous predictors, it
takes less time/iterations to reach close to actual predictions. But we must choose the
stopping criteria carefully or it could lead to overfitting on training data. Gradient
Boosting is an example of boosting algorithm.

Bagging has many uncorrelated trees in the final model which helps in reducing variance.
Boosting will reduce variance in the process of building sequential trees. At the same time,
its focus remains on bridging the gap between the actual and predicted values by reducing
residuals, hence it also reduces bias.

Gradient Boosting Trees for Classification


Step 1: Initial Prediction — We start with a leaf which represents an initial prediction for
every individual. For classification, this will be equal to log(odds) of the dependent variable.
we convert this to a probability using the Logistic Function.
If we consider the probability threshold as 0.5, this means that our initial prediction is that all
the individuals have heart disease.

Step 2: Calculate Residuals — We will now calculate the residuals for each observation by
using the following formula,
Residual = Actual value — Predicted value

Step 3: Predict residuals — Our next step involves building a Decision Tree to predict the
residuals.
Step 4: Obtain new probability of having a heart disease — Now, let us pass each sample
in our dataset through the nodes of the newly formed decision tree. The predicted residuals
obtained for each observation will be added to the previous prediction.

Assuming a learning rate of 0.2

we convert this new log(odds) into a probability value.

Step 5: Obtain new residuals — After obtaining the predicted probabilities for all the
observations, we will calculate the new residuals by subtracting these new predicted values
from the actual values.
Step 6: Repeat steps 3 to 5 until the residuals converge to a value close to 0 or the number
of iterations matches the value given as hyperparameter while running the algorithm.
Step 7: Final Computation — After we have calculated the output values for all the trees,
the final log(odds) prediction.

Next, we need to convert this log(odds) prediction into a probability by plugging it into the
logistic function.
Using the common probability threshold of 0.5 for making classification decisions,

Advantages
• It is a generalised algorithm which works for any differentiable loss function.
• It often provides predictive scores that are far better than other algorithms.
• It can handle missing data — imputation not required.
Disadvantages
• This method is sensitive to outliers. Outliers will have much larger residuals than
non-outliers, so gradient boosting will focus a disproportionate amount of its
attention on those points. Using Mean Absolute Error (MAE) to calculate the error
instead of Mean Square Error (MSE) can help reduce the effect of these outliers since
the latter gives more weights to larger differences. The parameter ‘criterion’ helps you
choose this function.
• It is prone to overfit if number of trees is too large. The
parameter ‘n_estimators’ can help determining a good point to stop before our
model starts overfitting.
• Computation can take a long time. Hence, if you are working with a large dataset,
always keep in mind to take a sample of the dataset (keeping odds ratio for target
variable same) while training the model.
Gradient Boosting algorithm
Gradient boosting is a machine learning technique for regression and classification problems,
which produces a prediction model in the form of an ensemble of weak prediction models,
typically decision trees.
The objective of any supervised learning algorithm is to define a loss function and minimize
it.

By using gradient descent and updating our predictions based on a learning rate, we can
find the values where MSE is minimum.
Intuition behind Gradient Boosting
the intuition behind gradient boosting algorithm is to repetitively leverage the patterns in
residuals and strengthen a model with weak predictions and make it better. Once we reach a
stage where residuals do not have any pattern that could be modelled, we can stop
modelling residuals.

How it works?
We first model data with simple models and analyse data for errors. These errors signify data
points that are difficult to fit by a simple model. Then for later models, we particularly focus
on those hard-to-fit data to get them right. In the end, we combine all the predictors by
giving some weights to each predictor.
Steps to fit a Gradient Boosting model.
Step 1: Fit a simple linear regressor or decision tree on data (I have chosen decision tree in
my code) [call x as input and y as output].
Step 2: Calculate error residuals. Actual target value, minus predicted target value
[e1= y - y_predicted1].
Step 3: Fit a new model on error residuals as target variable with same input variables [call it
e1_predicted]

Step 4: Add the predicted residuals to the previous predictions


[y_predicted2 = y_predicted1 + e1_predicted]
Step 5: Fit another model on residuals that are still left. i.e. [e2 = y - y_predicted2] and
repeat steps 2 to 5 until it starts overfitting, or the sum of residuals become constant.
Overfitting can be controlled by consistently checking accuracy on validation data.
AdaBoost (Adaptive Boosting)

AdaBoost is a boosting ensemble model and works especially well with the decision tree.
Boosting model’s key is learning from the previous mistakes, e.g., misclassification data
points. AdaBoost learns from the mistakes by increasing the weight of misclassified data
points.

Step 0: Initialize the weights of data points. if the training set has 100 data points, then each
point’s initial weight should be 1/100 = 0.01.
Step 1: Train a decision tree.
Step 2: Calculate the weighted error rate (e) of the decision tree. The weighted error rate
(e) is just how many wrong predictions out of total and you treat the wrong predictions
differently based on its data point’s weight. The higher the weight, the more the
corresponding error will be weighted during the calculation of the (e).
Step 3: Calculate this decision tree’s weight in the ensemble.
the weight of this tree = learning rate * log ((1 — e) / e)
• the higher weighted error rate of a tree, the less decision power the tree will be
given during the later voting.
• the lower weighted error rate of a tree, the higher decision power the tree will be
given during the later voting.
Step 4: Update weights of wrongly classified points
the weight of each data point
• if the model got this data point correct, the weight stays the same.
• if the model got this data point wrong, the new weight of this point = old weight
* np.exp(weight of this tree)
Step 5: Repeat Step 1(until the number of trees we set to train is reached)
Step 6: Make the final prediction.
The AdaBoost makes a new prediction by adding up the weight (of each tree) multiply the
prediction (of each tree). Obviously, the tree with higher weight will have more power of
influence the final decision.
Note: The higher the weight of the tree (more accurate this tree performs), the more boost
(importance) the misclassified data point by this tree will get. The weights of the data points
are normalized after all the misclassified points are updated.

K-Means Clustering

K-Means clustering is an unsupervised machine learning algorithm. K-Means clustering is used to


identify natural clusters within an unlabeled dataset, extracting insights from these patterns.
Clustering:
Clustering is a fundamental task in unsupervised learning which deals with finding structure in a
collection of unlabeled data. A basic explanation of clustering could be “organizing items into groups
sharing some form of similarity.” The way we judge similarity is through distance: when objects are
“close” based on a given distance (like geometric distance), they’re part of the same cluster. This is
known as distance-based clustering. Another type is conceptual clustering: if objects share a
common concept, they belong to the same cluster.

Introduction to K-Means Clustering:


It maintains k centroids to define clusters. A point belongs to a cluster if it’s closer to that cluster’s
centroid than others.
K-Means determines centroids by alternating between (1) assigning data points to clusters using
current centroids, and (2) selecting centroids (points at the cluster center) based on current data
point assignments to clusters.
Centroid — A centroid is a data point at the center of a cluster. It is an iterative algorithm in which
the notion of similarity is derived by how close a data point is to the centroid of the cluster.
The algorithm requires the number of clusters K and the data set as input. The data set is a collection
of features for each data point. The algorithm starts with initial estimates for the K centroids. The
algorithm then iterates between two steps: -
Data assignment step:
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, which is based on the squared Euclidean distance. So, if ci is the collection of centroids in
set C, then each data point is assigned to a cluster based on minimum Euclidean distance.
Centroid update step:
In this step, the centroids are recomputed and updated. This is done by taking the mean of all data
points assigned to that centroid’s cluster.

The algorithm then iterates between step 1 and step 2 until a stopping criterion is met. Stopping
criteria means no data points change the clusters, the sum of the distances is minimized, or some
maximum number of iterations is reached. This algorithm is guaranteed to converge to a result.
The result may be a local optimum meaning that assessing more than one run of the algorithm with
randomized starting centroids may give a better outcome.

Choosing the value of K:


To find the number of clusters in the data, we need to run the K-Means clustering algorithm for
different values of K and compare the results. So, the performance of K-Means algorithm depends
upon the value of K.
The Elbow Method:

In this diagram the elbow point occurs at K = 3. The Elbow method is used to determine the optimal
number of clusters in K-means clustering. The elbow method uses what is called Sum of Squared
Errors. To computer SSE for each cluster, we measure the Euclidean Distance of a Point from its
assigned cluster center. We perform this operation for each point for each of the k th cluster and add
the sum of squares of these distances to get the value of Sum of Squared Error for corresponding K.
Applications:

• Market Segmentation: K-means clustering can be used to segment customers based on their
purchasing behaviour, demographics, or preferences. This information helps businesses tailor
their marketing strategies and target specific customer groups.
• Document Clustering: K-means clustering can group similar documents together, enabling
efficient document organization, information retrieval, and topic modelling. It has applications in
text mining, content recommendation, and document classification.
• Image Segmentation: K-means clustering can partition an image into distinct regions based on
colour similarity. This technique is used in image processing, computer vision, and object
recognition.
• Anomaly Detection: K-means clustering can identify outliers or anomalies in a dataset by
assigning them to separate clusters. This is useful in fraud detection, network intrusion
detection, and outlier analysis.

Drawbacks:

• Sensitive to Initial Centroids: K-means clustering starts by placing centroids randomly. The final
clusters and results can vary based on these initial positions. In some cases, poor initial
placement can lead to suboptimal or incorrect cluster assignments.
• Assumes Spherical Clusters with Equal Size: K-means assumes that clusters are spherical and
have roughly equal sizes. However, real-world data can have clusters of various shapes and sizes.
This can cause K-means to struggle with clusters that are elongated or irregularly shaped.
• Requires Predefined Number of Clusters (K): K-means requires you to specify the number of
clusters (K) beforehand. But in many cases, it’s not clear what the optimal number of clusters
should be. Selecting an incorrect value for K can lead to inaccurate results, and finding the
optimal K using techniques like the elbow method is not always straightforward.

a) Can you explain the K-Means algorithm briefly?


K-Means is an unsupervised clustering algorithm that aims to partition a dataset into a
predetermined number of clusters. It works by iteratively assigning data points to the nearest cluster
centroid and then recalculating the centroids based on the assigned points.

b) Why is K-Means a suitable choice for this customer segmentation task?


K-Means is well-suited because it’s efficient and can handle large datasets. It’s also easy to
implement and interpret. Since we want to group customers based on their purchasing behavior, K-
Means can help us identify distinct segments of customers with similar spending patterns.

c) How would you determine the optimal number of clusters for this task?
A common approach is the ‘elbow method.’ We run K-Means with different numbers of clusters and
calculate the sum of squared distances from each point to its assigned centroid. As the number of
clusters increases, this value tends to decrease. The ‘elbow point’ is where the rate of decrease slows
down, indicating a good trade-off between minimizing intra-cluster distance and avoiding over-
segmentation.”

d) Could you walk us through the K-Means algorithm steps in this context?
• Choose the number of clusters (K) based on the elbow method.
• Initialize K cluster centroids randomly.
• Assign each customer to the nearest centroid based on their purchasing behavior.
• Recalculate the centroids as the mean of all data points assigned to each centroid.
• Repeat steps 3 and 4 until convergence or a maximum number of iterations.

e) What are the potential challenges or limitations of K-Means?


K-Means has a few limitations. It assumes that clusters are spherical and equally sized, which might
not hold for all types of data. It’s also sensitive to the initial placement of centroids, which can lead
to different results. Additionally, K-Means might not work well if clusters have varying densities.
f) How would you interpret the results of the K-Means clustering for the retail company?
After running K-Means, we’ll have distinct clusters of customers. Each cluster represents a segment
of customers who share similar purchasing behaviors. For example, we might have clusters like ‘High-
Spending Young Adults,’ ‘Budget-Conscious Seniors,’ etc. This segmentation can guide the company in
tailoring marketing strategies for each group.

g) Can you mention any alternatives to K-Means that could be used for this task?
Certainly. Hierarchical clustering is an alternative that doesn’t require specifying the number of
clusters beforehand. It creates a tree-like structure of clusters, which can be useful for exploring
different levels of segmentation. Another option is Gaussian Mixture Models (GMM), which can
capture more complex cluster shapes and provide probabilistic cluster assignments.

h) How would you handle the scenario where some customers’ purchasing behaviour doesn’t fit
well into any cluster?
If we encounter such outliers or noise points, we might consider using techniques like DBSCAN
(Density-Based Spatial Clustering of Applications with Noise). DBSCAN is robust to noise and can
identify points that don’t belong to any cluster. We can set a minimum number of points required to
form a cluster and a distance threshold to distinguish noise.

Evaluation metrics:

• Accuracy:
Formula: (Number of Correct Predictions) / (Total Number of Predictions)
Accuracy measures the proportion of correctly predicted instances out of the total
predictions. It provides a general view of the model's overall performance but may not
be suitable for imbalanced datasets.
• Precision:
Formula: (True Positives) / (True Positives + False Positives)
Precision quantifies the proportion of true positive predictions among all positive
predictions. It measures the model's ability to avoid false positives.
• Recall (Sensitivity or True Positive Rate):
Formula: (True Positives) / (True Positives + False Negatives)
Recall calculates the proportion of true positive predictions among all actual positives. It
assesses the model's ability to identify all positive instances. To avoid false negatives.
• F1-Score:
Formula: 2 * (Precision * Recall) / (Precision + Recall)
The F1-Score is the harmonic mean of precision and recall. It balances the trade-off
between precision and recall, providing a single metric that considers both false positives
and false negatives.
• Specificity (True Negative Rate):
Formula: (True Negatives) / (True Negatives + False Positives)
Specificity measures the proportion of true negative predictions among all actual
negatives. It is useful when the cost of false positives is high.
• False Positive Rate (FPR):
Formula: (False Positives) / (False Positives + True Negatives)
FPR calculates the proportion of false positive predictions among all actual negatives. It's
the complement of specificity and is useful for imbalanced datasets.
• Area Under the Receiver Operating Characteristic Curve (AUC-ROC):
The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at
various threshold settings. AUC-ROC measures the area under this curve, indicating the
model's ability to distinguish between classes. A value of 0.5 indicates random guessing,
while higher values indicate better performance.

Threshold Settings:

• A threshold is a decision boundary that determines how the model classifies instances
into positive or negative categories.
• By adjusting the threshold, you can control the trade-off between true positives and
false positives.

ROC Curve:

• The ROC curve is created by plotting TPR (sensitivity) on the y-axis and FPR (1-specificity)
on the x-axis for different threshold settings.
• Each point on the ROC curve corresponds to a specific threshold setting.
• The curve starts at the point (0,0) where both TPR and FPR are zero, indicating that all
predictions are negative.
• As the threshold changes, the model's TPR and FPR values vary, resulting in a curve that
typically rises from (0,0) to (1,1).
• A curve that bows toward the upper-left corner indicates better model performance,
with higher TPR for a given FPR.

Area Under the ROC Curve (AUC-ROC):

• The AUC-ROC measures the area under the ROC curve. A higher AUC-ROC value (closer
to 1) indicates better discrimination between positive and negative classes.
• An AUC-ROC of 0.5 suggests random guessing, while an AUC-ROC of 1 represents perfect
classification.

You might also like