ML-classification models
ML-classification models
1) Logistic Regression:
• Logistic Regression is a classification algorithm used to predict the probability of
a specific outcome based on input features. For instance, in the context of
determining whether a student passes or fails an exam given the number of hours
spent studying, logistic regression analyses the relationship between the features
and the probability of passing.
• It is commonly used for two types of logistic regression: binary and multinomial.
Binary logistic regression classifies data into two categories, such as tumour being
malignant or benign, while multinomial logistic regression handles multiple
categories like distinguishing between cats, dogs, or sheep.
a) Logit function:
The logistic regression model can be represented using the logit or log-odds function. The
odds signify the ratio of the probability of success to the probability of failure.
b) Sigmoid:
By taking the inverse of the logit function, we get the sigmoid function, which produces an
S-shaped curve, always yielding probability values between 0 and 1. The algorithm
transforms its output into a probability value using the logistic sigmoid function. To map the
predicted values to probabilities, the sigmoid function is used, which maps any real value to
a value between 0 and 1.
c) Decision Boundary:
The decision boundary helps to differentiate between positive and negative classes by using
a threshold value. If the predicted probability is above the threshold, the instance is classified
into Class 1; otherwise, it belongs to Class 2.
Logistic regression is preferred over linear regression for classification tasks, especially when
dealing with outliers in the data. It ensures that the predicted probabilities lie between 0 and
1, unlike linear regression, which may produce values outside this range.
Unlike linear regression, which uses Ordinary Least Square for parameter estimation, logistic
regression employs Maximum Likelihood Estimation, finding the regression coefficients that
maximize the probability of the observed data.
e) Cost function:
The main goal of Gradient descent is to minimize the cost value. i.e., min J(θ). Gradient
descent has an analogy in which we must imagine ourselves at the top of a mountain valley
and left stranded and blindfolded, our objective is to reach the bottom of the hill. Feeling the
slope of the terrain around you is what everyone would do. Well, this action is analogous to
calculating the gradient descent, and taking a step is analogous to one iteration of the
update to the parameters.
f) Model evaluation
In model evaluation, null deviance represents the response predicted by a model with just an
intercept, while model deviance indicates the response predicted by a model with
independent variables. The accuracy of the model can also be assessed using a Confusion
Matrix.
g) Multi-classification
For multi-class problems, logistic regression adopts a one v/s all approach. It turns the
problem into multiple binary classification problems, where each class is compared against
all the others. The classification with the highest probability value relative to the rest is
chosen as the final solution.
h) Advantages of Logistic Regression:
• Simplicity and Interpretability: Logistic regression is a straightforward algorithm
with easily interpretable results. The model's coefficients provide insight into the
importance and direction of each feature's impact on the prediction.
• Efficient and Fast: Logistic regression is computationally efficient and can handle
large datasets with relative ease. Training and prediction times are typically fast,
making it suitable for real-time and online applications.
• Probabilistic Output: Logistic regression predicts the probability of an instance
belonging to a particular class, providing a clear indication of confidence in the
predictions.
• Low Risk of Overfitting: It is less prone to overfitting, especially when the
number of features is relatively small. Regularization techniques, such as L1 or L2
regularization, can be employed to further reduce overfitting.
• Handles Non-Linearities: Although logistic regression is a linear algorithm, it can
handle certain non-linear relationships through feature engineering and
polynomial transformations.
i) Disadvantages of Logistic Regression:
• Limited to Linear Boundaries: Logistic regression is a linear classifier, which
means it can only create linear decision boundaries. It may struggle with datasets
that require complex and non-linear decision boundaries for accurate
classification.
• Sensitivity to Outliers: Outliers in the data can significantly influence the model's
coefficients and predictions, leading to potential performance issues.
• Not Suitable for Multiclass Problems: While it can be extended to handle
multiclass problems using techniques like one-vs-rest, it may not be as effective as
other algorithms specifically designed for multiclass classification tasks.
• Assumption of Independence: Logistic regression assumes that the input
features are independent of each other. Violation of this assumption can lead to
biased and unreliable results.
• Feature Engineering Dependency: The performance of logistic regression heavily
depends on the quality of feature engineering and feature selection. Choosing
irrelevant or irrelevant features can negatively impact the model's performance.
• Imbalanced Data Handling: Logistic regression can struggle with imbalanced
datasets, where one class is significantly more prevalent than the others. It may
produce biased predictions towards the majority class.
2) DECISION TREES
• A decision tree is a hierarchical structure where each node symbolizes a feature (attribute),
each branch represents a decision (rule), and each leaf denotes an outcome (categorical or
continuous value).
• This tree is typically visualized upside down, with the root at the top. In the illustration on
the left, the bold text in black signifies a condition (internal node) that determines the
branching of the tree into edges (branches). At the end of a branch, where no further
splitting occurs, lies the decision (leaf). In this example, the decision tree classifies
passengers as either survivors (green text) or non-survivors (red text).
• This methodology, known as learning decision trees from data, facilitates clear
understanding of feature importance and relationships. The tree shown above is a
Classification tree, where the objective is to classify passengers as survivors or non-
survivors. For Regression trees, the representation is the same, but they predict continuous
values, such as the price of a house.
Black-box algorithms:
• Black box algorithms" refer to machine learning models or techniques that are complex
and opaque in terms of their internal workings and decision-making processes.
• When you use a black box algorithm to make predictions, you may not easily understand
how the model arrives at its conclusions or how it relates input features to output
predictions.
Cost of a split
a) For regression tasks,
• the cost of a split is calculated using the sum of squared differences between
the actual target values (y) and the predictions made for each group.
• The decision tree initiates the splitting process by considering each feature in
the training data.
• It calculates the mean of the responses for the inputs in a particular group
and treats it as the prediction for that group.
• The given function is then applied to all data points, and the cost is
computed for each potential split. Ultimately, the split with the lowest cost is
chosen as the optimal one.
• Alternatively, another cost function involves minimizing the reduction in
standard deviation, which can provide more insights into the process.
Information Gain measures the reduction in this uncertainty regarding the target variable or
class when additional information (features or independent variables) is considered.
To calculate Information Gain, we subtract the entropy of the target variable (Y) given a
particular feature (X) from the entropy of the target variable (Y) on its own. This quantifies
the reduction in uncertainty about Y when we have extra information (X) about it.
In real-world scenarios with more than two features, the initial split is made based on the
most informative feature. Then, at each subsequent split, the information gain for each
additional feature needs to be recomputed, as the information gain from each feature in
isolation may differ. The entropy and information gain must be recalculated after one or
more splits have already been performed, leading to changing results. A decision tree
repeats this process, growing deeper until reaching a pre-defined depth or when no further
split can result in a higher information gain beyond a specified threshold, which can often be
set as a hyper-parameter.
3) RANDOM FOREST CLASSIFIER
Decision Tree: The Fundamental Concept
• A decision tree model takes input data and proceeds through a sequence of branching
steps, following a series of if-then rules until it arrives at one of the predefined output
values.
• It can be seen as an algorithm that aims to split the data into the most homogeneous
groups at each split, generating a tree-like set of rules for classification or regression
tasks.
• The recursive nature of decision trees involves splitting the dataset repeatedly until
reaching a predefined stopping condition, often based on a minimum count threshold
for entries in the dataset corresponding to a leaf node.
• When a decision tree is built using a dataset containing specific features, it generates
a set of rules that facilitate prediction. However, its performance is highly dependent
on the initial dataset, which can lead to limited accuracy when applied to real-world
scenarios.
• Overfitting is a potential concern, as the decision tree may excessively adapt to the
peculiarities of the initial dataset.
RANDOM FOREST
Advantages
• Cost-effective: Random Forest (RF) models are more affordable and quicker to train
compared to neural networks, while still maintaining a high level of accuracy. This
makes them suitable for applications in mobile devices, for instance.
• Robust against overfitting: Since RF consists of multiple uncorrelated trees, if one
tree makes an inaccurate prediction due to outliers, other trees can compensate for it.
This leads to better performance than individual trees taken separately.
• High coverage rates and low bias: The robustness of RF makes it well-suited for
scenarios with missing values in the dataset or when assessing the variability between
different data outputs, such as predicting whether college undergraduates will
complete their studies, proceed to a master's degree, or drop out.
• Applicable for classification and regression: RF demonstrates equally accurate
results for both classification and regression tasks.
• Handling missing values: RF can handle missing values in features without
introducing bias into its predictions.
• Easy interpretation: Each tree in the forest makes predictions independently, making
it straightforward to examine and understand the prediction process of any individual
tree.
Despite the benefits, random forest classifiers also come with some
challenges:
• Complexity: Random forests are more complex than decision trees, requiring
additional steps to combine and aggregate the results from multiple trees. It is not as
straightforward as following a single decision tree's path to decide.
• Slower Execution: The process of training and aggregating multiple decision trees can
make random forests slower compared to certain other types of machine learning
models. This aspect might limit their suitability for certain time-sensitive applications.
• Large Datasets and Adequate Training Data: Random forests perform optimally with
large datasets and when there is sufficient training data available. In scenarios with
limited data, the performance of the random forest model may not be as effective.
Choice of K has a drastic impact on the results we obtain from KNN. We can take the
test set and plot the accuracy rate or F1 score against different values of K. We see a
high error rate for test set when K=1. Hence, we can conclude that model overfits
when k=1. For a high value of K, we see that the F1 score starts to drop. The test set
reaches a minimum error rate when k=5.
d) How does KNN work?
Step 1: Choose a value for K. K should be an odd number.
Step 2: Find the distance of the new point to each of the training data.
Step 3: Find the K nearest neighbors to the new data point.
Step 4: For classification, count the number of data points in each category among the k
neighbors. New data point will belong to class that has the most neighbors. For
regression, value for the new data point will be the average of the k neighbors.
e) How is the distance calculated?
Distance can be calculated using:
1) Euclidean distance
Euclidean distance is the square root of the sum of squared distance between two points. It
is also known as L2 norm.
2) Manhattan distance
Manhattan distance is the sum of the absolute values of the differences between two points.
3) Hamming Distance
Hamming distance is used for categorical variables. In simple terms it tells us if the two
categorical variables are same or not.
4) Minkowski Distance
Minkowski distance is the used to find distance similarity between two points. When p=1, it
becomes Manhattan distance and when p=2, it becomes Euclidean distance.
Bayes Theorem
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis.
The variable y is the class variable (play golf), which represents if it is suitable to play golf or
not given the conditions. Variable X represent the parameters/features.
X is given as,
Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook, temperature,
humidity and windy. By substituting for X and expanding using the chain rule we get,
Now, you can obtain the values for each by looking at the dataset and substitute them into
the equation. For all entries in the dataset, the denominator does not change, it remains
static. Therefore, the denominator can be removed, and a proportionality can be introduced.
In our case, the class variable(y) has only two outcomes, yes or no. There could be cases
where the classification could be multivariate. Therefore, we need to find the class y with
maximum probability.
Types of Naive Bayes Classifier
This is mostly used for document classification problem, i.e whether a document belongs to
the category of sports, politics, technology etc. The features/predictors used by the classifier
are the frequency of the words present in the document.
Bernoulli Naive Bayes:
This is like the multinomial naive bayes, but the predictors are Boolean variables. The
parameters that we use to predict the class variable take up only values yes or no, for
example if a word occurs in the text or not.
Gaussian Naive Bayes:
When the predictors take up a continuous value and are not discrete, we assume that these
values are sampled from a gaussian distribution.
Since the way the values are present in the dataset changes, the formula for conditional
probability changes to,
Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering,
recommendation systems etc. They are fast and easy to implement but their biggest
disadvantage is that the requirement of predictors to be independent. In most of the real-life
cases, the predictors are dependent, this hinders the performance of the classifier.
Advantages
• It is easy and fast to predict the class of the test data set. It also performs well
in multi-class prediction.
• When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
• It performs well in case of categorical input variables compared to numerical
variable(s). For numerical variable, normal distribution is assumed (bell curve, which is
a strong assumption).
Disadvantages
• If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as Zero Frequency. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
• On the other side naive Bayes is also known as a bad estimator, so the probability
outputs are not to be taken too seriously.
• Another limitation of Naive Bayes is the assumption of independent predictors. In
real life, it is almost impossible that we get a set of predictors which are completely
independent.
When to use
• Text Classification
• when dataset is huge
• When you have small training set
Applications
• Real time Prediction: Naive Bayes is an eager learning classifier, and it is sure fast.
Thus, it could be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and negative customer
sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative
Filtering together builds a Recommendation System that uses machine learning and
data mining techniques to filter unseen information and predict whether a user
would like a given resource or not.
6) Gradient boosting
Ensemble:
When we try to predict the target variable using any machine learning technique, the main
causes of the difference in actual and predicted values are noise, variance, and bias.
Ensemble helps to reduce these factors (except noise, which is an irreducible error).
An ensemble is just a collection of predictors which come together (e.g. mean of all
predictions) to give a final prediction. The reason we use ensembles is that many different
predictors trying to predict the same target variable will perform a better job than any single
predictor alone. Ensembling techniques are further classified into Bagging and Boosting.
Bagging:
Boosting:
Boosting is an ensemble technique in which the predictors are not made independently, but
sequentially.
This technique employs the logic in which the subsequent predictors learn from the mistakes
of the previous predictors. Therefore, the observations have an unequal probability of
appearing in subsequent models and the ones with the highest error appear most. (So, the
observations are not chosen based on the bootstrap process but based on the error). The
predictors can be chosen from a range of models like decision trees, regressors, classifiers,
etc. Because new predictors are learning from mistakes committed by previous predictors, it
takes less time/iterations to reach close to actual predictions. But we must choose the
stopping criteria carefully or it could lead to overfitting on training data. Gradient
Boosting is an example of boosting algorithm.
Bagging has many uncorrelated trees in the final model which helps in reducing variance.
Boosting will reduce variance in the process of building sequential trees. At the same time,
its focus remains on bridging the gap between the actual and predicted values by reducing
residuals, hence it also reduces bias.
Step 2: Calculate Residuals — We will now calculate the residuals for each observation by
using the following formula,
Residual = Actual value — Predicted value
Step 3: Predict residuals — Our next step involves building a Decision Tree to predict the
residuals.
Step 4: Obtain new probability of having a heart disease — Now, let us pass each sample
in our dataset through the nodes of the newly formed decision tree. The predicted residuals
obtained for each observation will be added to the previous prediction.
Step 5: Obtain new residuals — After obtaining the predicted probabilities for all the
observations, we will calculate the new residuals by subtracting these new predicted values
from the actual values.
Step 6: Repeat steps 3 to 5 until the residuals converge to a value close to 0 or the number
of iterations matches the value given as hyperparameter while running the algorithm.
Step 7: Final Computation — After we have calculated the output values for all the trees,
the final log(odds) prediction.
Next, we need to convert this log(odds) prediction into a probability by plugging it into the
logistic function.
Using the common probability threshold of 0.5 for making classification decisions,
Advantages
• It is a generalised algorithm which works for any differentiable loss function.
• It often provides predictive scores that are far better than other algorithms.
• It can handle missing data — imputation not required.
Disadvantages
• This method is sensitive to outliers. Outliers will have much larger residuals than
non-outliers, so gradient boosting will focus a disproportionate amount of its
attention on those points. Using Mean Absolute Error (MAE) to calculate the error
instead of Mean Square Error (MSE) can help reduce the effect of these outliers since
the latter gives more weights to larger differences. The parameter ‘criterion’ helps you
choose this function.
• It is prone to overfit if number of trees is too large. The
parameter ‘n_estimators’ can help determining a good point to stop before our
model starts overfitting.
• Computation can take a long time. Hence, if you are working with a large dataset,
always keep in mind to take a sample of the dataset (keeping odds ratio for target
variable same) while training the model.
Gradient Boosting algorithm
Gradient boosting is a machine learning technique for regression and classification problems,
which produces a prediction model in the form of an ensemble of weak prediction models,
typically decision trees.
The objective of any supervised learning algorithm is to define a loss function and minimize
it.
By using gradient descent and updating our predictions based on a learning rate, we can
find the values where MSE is minimum.
Intuition behind Gradient Boosting
the intuition behind gradient boosting algorithm is to repetitively leverage the patterns in
residuals and strengthen a model with weak predictions and make it better. Once we reach a
stage where residuals do not have any pattern that could be modelled, we can stop
modelling residuals.
How it works?
We first model data with simple models and analyse data for errors. These errors signify data
points that are difficult to fit by a simple model. Then for later models, we particularly focus
on those hard-to-fit data to get them right. In the end, we combine all the predictors by
giving some weights to each predictor.
Steps to fit a Gradient Boosting model.
Step 1: Fit a simple linear regressor or decision tree on data (I have chosen decision tree in
my code) [call x as input and y as output].
Step 2: Calculate error residuals. Actual target value, minus predicted target value
[e1= y - y_predicted1].
Step 3: Fit a new model on error residuals as target variable with same input variables [call it
e1_predicted]
AdaBoost is a boosting ensemble model and works especially well with the decision tree.
Boosting model’s key is learning from the previous mistakes, e.g., misclassification data
points. AdaBoost learns from the mistakes by increasing the weight of misclassified data
points.
Step 0: Initialize the weights of data points. if the training set has 100 data points, then each
point’s initial weight should be 1/100 = 0.01.
Step 1: Train a decision tree.
Step 2: Calculate the weighted error rate (e) of the decision tree. The weighted error rate
(e) is just how many wrong predictions out of total and you treat the wrong predictions
differently based on its data point’s weight. The higher the weight, the more the
corresponding error will be weighted during the calculation of the (e).
Step 3: Calculate this decision tree’s weight in the ensemble.
the weight of this tree = learning rate * log ((1 — e) / e)
• the higher weighted error rate of a tree, the less decision power the tree will be
given during the later voting.
• the lower weighted error rate of a tree, the higher decision power the tree will be
given during the later voting.
Step 4: Update weights of wrongly classified points
the weight of each data point
• if the model got this data point correct, the weight stays the same.
• if the model got this data point wrong, the new weight of this point = old weight
* np.exp(weight of this tree)
Step 5: Repeat Step 1(until the number of trees we set to train is reached)
Step 6: Make the final prediction.
The AdaBoost makes a new prediction by adding up the weight (of each tree) multiply the
prediction (of each tree). Obviously, the tree with higher weight will have more power of
influence the final decision.
Note: The higher the weight of the tree (more accurate this tree performs), the more boost
(importance) the misclassified data point by this tree will get. The weights of the data points
are normalized after all the misclassified points are updated.
K-Means Clustering
The algorithm then iterates between step 1 and step 2 until a stopping criterion is met. Stopping
criteria means no data points change the clusters, the sum of the distances is minimized, or some
maximum number of iterations is reached. This algorithm is guaranteed to converge to a result.
The result may be a local optimum meaning that assessing more than one run of the algorithm with
randomized starting centroids may give a better outcome.
In this diagram the elbow point occurs at K = 3. The Elbow method is used to determine the optimal
number of clusters in K-means clustering. The elbow method uses what is called Sum of Squared
Errors. To computer SSE for each cluster, we measure the Euclidean Distance of a Point from its
assigned cluster center. We perform this operation for each point for each of the k th cluster and add
the sum of squares of these distances to get the value of Sum of Squared Error for corresponding K.
Applications:
• Market Segmentation: K-means clustering can be used to segment customers based on their
purchasing behaviour, demographics, or preferences. This information helps businesses tailor
their marketing strategies and target specific customer groups.
• Document Clustering: K-means clustering can group similar documents together, enabling
efficient document organization, information retrieval, and topic modelling. It has applications in
text mining, content recommendation, and document classification.
• Image Segmentation: K-means clustering can partition an image into distinct regions based on
colour similarity. This technique is used in image processing, computer vision, and object
recognition.
• Anomaly Detection: K-means clustering can identify outliers or anomalies in a dataset by
assigning them to separate clusters. This is useful in fraud detection, network intrusion
detection, and outlier analysis.
Drawbacks:
• Sensitive to Initial Centroids: K-means clustering starts by placing centroids randomly. The final
clusters and results can vary based on these initial positions. In some cases, poor initial
placement can lead to suboptimal or incorrect cluster assignments.
• Assumes Spherical Clusters with Equal Size: K-means assumes that clusters are spherical and
have roughly equal sizes. However, real-world data can have clusters of various shapes and sizes.
This can cause K-means to struggle with clusters that are elongated or irregularly shaped.
• Requires Predefined Number of Clusters (K): K-means requires you to specify the number of
clusters (K) beforehand. But in many cases, it’s not clear what the optimal number of clusters
should be. Selecting an incorrect value for K can lead to inaccurate results, and finding the
optimal K using techniques like the elbow method is not always straightforward.
c) How would you determine the optimal number of clusters for this task?
A common approach is the ‘elbow method.’ We run K-Means with different numbers of clusters and
calculate the sum of squared distances from each point to its assigned centroid. As the number of
clusters increases, this value tends to decrease. The ‘elbow point’ is where the rate of decrease slows
down, indicating a good trade-off between minimizing intra-cluster distance and avoiding over-
segmentation.”
d) Could you walk us through the K-Means algorithm steps in this context?
• Choose the number of clusters (K) based on the elbow method.
• Initialize K cluster centroids randomly.
• Assign each customer to the nearest centroid based on their purchasing behavior.
• Recalculate the centroids as the mean of all data points assigned to each centroid.
• Repeat steps 3 and 4 until convergence or a maximum number of iterations.
g) Can you mention any alternatives to K-Means that could be used for this task?
Certainly. Hierarchical clustering is an alternative that doesn’t require specifying the number of
clusters beforehand. It creates a tree-like structure of clusters, which can be useful for exploring
different levels of segmentation. Another option is Gaussian Mixture Models (GMM), which can
capture more complex cluster shapes and provide probabilistic cluster assignments.
h) How would you handle the scenario where some customers’ purchasing behaviour doesn’t fit
well into any cluster?
If we encounter such outliers or noise points, we might consider using techniques like DBSCAN
(Density-Based Spatial Clustering of Applications with Noise). DBSCAN is robust to noise and can
identify points that don’t belong to any cluster. We can set a minimum number of points required to
form a cluster and a distance threshold to distinguish noise.
Evaluation metrics:
• Accuracy:
Formula: (Number of Correct Predictions) / (Total Number of Predictions)
Accuracy measures the proportion of correctly predicted instances out of the total
predictions. It provides a general view of the model's overall performance but may not
be suitable for imbalanced datasets.
• Precision:
Formula: (True Positives) / (True Positives + False Positives)
Precision quantifies the proportion of true positive predictions among all positive
predictions. It measures the model's ability to avoid false positives.
• Recall (Sensitivity or True Positive Rate):
Formula: (True Positives) / (True Positives + False Negatives)
Recall calculates the proportion of true positive predictions among all actual positives. It
assesses the model's ability to identify all positive instances. To avoid false negatives.
• F1-Score:
Formula: 2 * (Precision * Recall) / (Precision + Recall)
The F1-Score is the harmonic mean of precision and recall. It balances the trade-off
between precision and recall, providing a single metric that considers both false positives
and false negatives.
• Specificity (True Negative Rate):
Formula: (True Negatives) / (True Negatives + False Positives)
Specificity measures the proportion of true negative predictions among all actual
negatives. It is useful when the cost of false positives is high.
• False Positive Rate (FPR):
Formula: (False Positives) / (False Positives + True Negatives)
FPR calculates the proportion of false positive predictions among all actual negatives. It's
the complement of specificity and is useful for imbalanced datasets.
• Area Under the Receiver Operating Characteristic Curve (AUC-ROC):
The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at
various threshold settings. AUC-ROC measures the area under this curve, indicating the
model's ability to distinguish between classes. A value of 0.5 indicates random guessing,
while higher values indicate better performance.
Threshold Settings:
• A threshold is a decision boundary that determines how the model classifies instances
into positive or negative categories.
• By adjusting the threshold, you can control the trade-off between true positives and
false positives.
ROC Curve:
• The ROC curve is created by plotting TPR (sensitivity) on the y-axis and FPR (1-specificity)
on the x-axis for different threshold settings.
• Each point on the ROC curve corresponds to a specific threshold setting.
• The curve starts at the point (0,0) where both TPR and FPR are zero, indicating that all
predictions are negative.
• As the threshold changes, the model's TPR and FPR values vary, resulting in a curve that
typically rises from (0,0) to (1,1).
• A curve that bows toward the upper-left corner indicates better model performance,
with higher TPR for a given FPR.
• The AUC-ROC measures the area under the ROC curve. A higher AUC-ROC value (closer
to 1) indicates better discrimination between positive and negative classes.
• An AUC-ROC of 0.5 suggests random guessing, while an AUC-ROC of 1 represents perfect
classification.