ML final
ML final
1
2. Explain Regression Line, Scatter Plot, Error in Prediction and Best
fitting line.
2
Here’s an explanation of Margin and Support Vector in the context of
Support Vector Machines (SVM):
3
d(A,B)=∣x2−x1∣+∣y2−y1∣d(A, B) = |x_2 - x_1| + |y_2 -
y_1|d(A,B)=∣x2−x1∣+∣y2−y1∣
It is suitable for high-dimensional data or grid-like structures.
3. Cosine Similarity: Measures the cosine of the angle between two
vectors, focusing on their orientation. This is useful for text data
and when magnitude is less important than direction.
4. Jaccard Similarity: A measure used for categorical or binary
data, calculating the ratio of the intersection to the union of two
sets.
5. Mahalanobis Distance: Accounts for the correlation between
variables, providing a more accurate measure in datasets with
varying scales or feature dependencies.
4
Here are some common terms involved in logistic regression:
5
6. Explain the steps of developing Machine Learning applications.
1. Problem Definition
● Identify the Problem: The first step is to define the problem that
you want to solve using machine learning. This includes
understanding the objective, such as predicting an outcome
(regression) or classifying data into categories (classification).
● Business or Research Objective: Align the ML problem with
business goals or research objectives to ensure the results are
practical and useful.
2. Data Collection
3. Data Preprocessing
6
● Feature Engineering: Create new features that may enhance
model performance, such as extracting relevant information or
creating composite features.
● Data Splitting: Divide the dataset into training, validation, and
test sets to evaluate model performance without overfitting.
5. Model Training
● Train the Model: Use the training dataset to teach the model to
identify patterns in the data. The model adjusts its parameters
(e.g., weights in neural networks) to minimize the error using
techniques like gradient descent.
● Hyperparameter Tuning: Adjust the hyperparameters (e.g.,
learning rate, number of trees in a forest) to optimize model
performance. This can be done using techniques like grid
search or random search.
6. Model Evaluation
7
common metrics include accuracy, precision, recall, and F1
score. For regression, metrics like Mean Squared Error (MSE) or
R-squared are used.
● Cross-Validation: Implement cross-validation techniques (e.g.,
k-fold cross-validation) to ensure the model is not overfitting to
the training data and generalizes well across different subsets of
data.
7. Model Optimization
8. Model Deployment
9. Model Maintenance
8
10. Feedback and Iteration
The Equation: The equation for a simple linear regression model is:
y = mx + b
Where:
9
Steps in Linear Regression
1000 200,000
1500 300,000
2000 400,000
2500 500,000
10
The goal is to predict the house price (YYY) based on the size of the
house (XXX). By fitting a linear regression model, we find the
equation:
Where:
Prediction
Y=100,000+200(1800)=100,000+360,000=460,000Y = 100,000 +
200(1800) = 100,000 + 360,000 =
460,000Y=100,000+200(1800)=100,000+360,000=460,000
Thus, the model predicts that the house will be worth $460,000.
● Multiple Classes: The target variable has more than two classes.
For example, classifying images of animals into categories like
"dog," "cat," and "bird" is a multiclass problem.
11
● Mutually Exclusive Classes: The classes are mutually exclusive,
meaning each data point belongs to exactly one class at a time.
A sample cannot belong to more than one class simultaneously.
● Apple
● Banana
● Orange
● Mango
12
3. Softmax Function (Used in Neural Networks):
○ For deep learning models, particularly neural networks, the
softmax function is used at the output layer to calculate the
probabilities of each class.
○ The softmax function converts the raw output values
(logits) into probabilities, ensuring that the sum of all class
probabilities is equal to 1. The class with the highest
probability is then chosen as the predicted label.
4. Decision Trees and Random Forests:
○ Decision trees can naturally handle multiclass
classification, as they can split the data based on feature
values to create distinct classes.
○ Random forests (an ensemble method) can also handle
multiclass problems by building multiple decision trees
and aggregating their predictions.
13
like class weighting or resampling (e.g., oversampling
underrepresented classes) can help mitigate this issue.
● Complexity in Decision Boundaries: As the number of classes
increases, the complexity of decision boundaries also increases.
This can make the learning task more difficult.
● Model Interpretability: Multiclass classification models,
particularly ensemble methods or deep learning models, may be
more complex to interpret compared to binary classifiers.
14
different subset of the data, and the model's final prediction is
made by aggregating the predictions from all the trees.
4. Random Feature Selection: In addition to random sampling of
data points, Random Forest also introduces randomness at the
feature level. When building each decision tree, it selects a
random subset of features (instead of using all the features) to
split the data at each node. This helps in creating diverse trees
and reduces correlation between them.
15
Disadvantages of Random Forest
2. Boosting
16
● Concept: Boosting involves training multiple classifiers
sequentially, where each classifier tries to correct the mistakes
made by the previous ones. Boosting algorithms focus more on
the examples that were misclassified by previous models, giving
them higher weights in subsequent rounds.
● How it works:
○ Models are trained one after another, and each new model
pays more attention to the errors made by the previous
models.
○ For classification, the final prediction is made by weighted
voting, where each classifier’s prediction is weighted by its
accuracy. More accurate classifiers have more influence.
○ In regression, the predictions of all models are combined
using a weighted average.
4. Voting
17
● Concept: Voting is a simple technique where the predictions
from multiple classifiers are combined through a majority vote
(for classification tasks) or average (for regression tasks).
● How it works:
○ In hard voting (majority voting), each classifier makes a
prediction, and the class that gets the most votes is the
final prediction.
○ In soft voting, classifiers output probabilities for each
class, and the class with the highest average probability
across all classifiers is chosen as the final prediction.
6. Bagged Boosting
18
11. Explain Expectation-Maximization algorithm.
19
○ This step involves calculating the posterior distribution of
the latent variables, given the observed data and the
current parameter estimates. This expectation is typically
calculated using the current parameter estimates and a
probabilistic model.
3. M-step (Maximization Step):
○ Update the parameters by maximizing the expected
complete log-likelihood, which is computed from the
E-step.
○ In the M-step, the algorithm updates the parameters by
optimizing the likelihood of the observed data, given the
estimated values of the missing data or latent variables
from the E-step.
4. Repeat: Repeat the E-step and M-step until the parameters
converge (i.e., the change in the parameters becomes very
small, or the likelihood reaches a maximum).
20
● Works with Incomplete Data: EM is specifically designed to
handle incomplete data or missing values by treating them as
latent variables.
● General Applicability: EM can be applied to a wide variety of
probabilistic models, including Gaussian mixtures, hidden
Markov models, and others.
● Theoretical Foundation: EM is based on maximizing the
likelihood function, making it a solid approach for many
statistical estimation problems.
1. Accuracy:
Accuracy is the simplest and most commonly used metric,
representing the proportion of correctly classified instances out
of all instances.
Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP +
21
TN}{TP + TN + FP + FN}Accuracy=TP+TN+FP+FNTP+TN
Pros: It is intuitive and easy to understand.
Cons: It can be misleading, especially in imbalanced datasets
where one class significantly outweighs the other.
2. Precision:
Precision measures the accuracy of positive predictions,
specifically the proportion of true positives (TP) out of all
predicted positives (TP + FP).
Precision=TPTP+FP\text{Precision} = \frac{TP}{TP +
FP}Precision=TP+FPTP
Pros: It is useful when false positives are costly or undesirable
(e.g., email spam detection).
Cons: It doesn't account for false negatives.
3. Recall (Sensitivity or True Positive Rate):
Recall measures the ability of a model to identify all actual
positive instances, calculated as the proportion of true positives
out of the total actual positives (TP + FN).
Recall=TPTP+FN\text{Recall} = \frac{TP}{TP +
FN}Recall=TP+FNTP
Pros: It’s crucial when false negatives have severe
consequences (e.g., medical diagnoses).
Cons: It may result in many false positives.
4. F1-Score:
The F1-score is the harmonic mean of precision and recall,
providing a balance between the two. It is particularly useful
when you need to balance false positives and false negatives.
F1-Score=2×Precision×RecallPrecision+Recall\text{F1-Score} =
2 \times \frac{\text{Precision} \times
\text{Recall}}{\text{Precision} +
\text{Recall}}F1-Score=2×Precision+RecallPrecision×Recall
Pros: It offers a single metric that considers both precision and
recall.
Cons: While informative, it may be less intuitive than precision
or recall alone.
22
5. ROC Curve (Receiver Operating Characteristic Curve):
The ROC curve plots the True Positive Rate (Recall) against the
False Positive Rate (1 - Specificity) at various thresholds. Pros:
Provides a good graphical representation of model performance
across different thresholds.
Cons: It can be less informative in multi-class classification.
6. AUC (Area Under the ROC Curve):
AUC quantifies the overall ability of the model to distinguish
between classes. A higher AUC indicates better model
performance. Pros: It’s robust to class imbalance and provides a
comprehensive view of model performance.
Cons: AUC can be harder to interpret directly in some cases.
7. Confusion Matrix:
The confusion matrix displays the counts of TP, TN, FP, and FN.
It is a comprehensive tool to analyze the types of errors a model
makes. Pros: It provides a clear breakdown of model
performance.
Cons: It can become complicated in multi-class problems.
23
1. Standardize the Data:
PCA is sensitive to the scale of the data, so it is essential to
standardize the features (mean = 0, variance = 1) before applying
PCA. This ensures that all features contribute equally to the
analysis, especially when they are measured on different scales
(e.g., height in cm vs. weight in kg).
2. Compute the Covariance Matrix:
The covariance matrix captures the relationships between the
different features. It measures how much the features vary
together. For a dataset with features X1,X2,...,XnX_1, X_2, ...,
X_nX1,X2,...,Xn, the covariance matrix Σ\SigmaΣ is computed as:
Σ=1n−1XTX\Sigma = \frac{1}{n-1} X^T XΣ=n−11XTX
where XXX is the data matrix after standardization.
3. Compute the Eigenvalues and Eigenvectors:
Eigenvalues represent the amount of variance captured by each
principal component, while eigenvectors represent the
directions of the principal components in the feature space.
Solving the equation Σv=λv\Sigma v = \lambda vΣv=λv gives the
eigenvectors vvv and eigenvalues λ\lambdaλ.
4. Sort the Eigenvalues and Eigenvectors:
The eigenvalues are sorted in descending order, and the
corresponding eigenvectors are arranged in the same order. The
top eigenvectors correspond to the directions with the most
significant variance in the data.
5. Select the Top k Principal Components:
The number of components to retain, kkk, is usually chosen
based on the cumulative sum of the eigenvalues. This is
typically determined by how much of the total variance you want
to retain. For example, you might choose to retain enough
components to capture 95% of the variance in the data.
6. Transform the Data:
The data is projected onto the selected principal components,
reducing the number of features. If kkk principal components
are chosen, the new data representation
XPCAX_{\text{PCA}}XPCAis obtained by multiplying the original
data matrix XXX by the matrix of the top kkk eigenvectors:
XPCA=X⋅VkX_{\text{PCA}} = X \cdot V_kXPCA=X⋅Vk
where VkV_kVkis the matrix of the top kkk eigenvectors.
Benefits of PCA
24
● Dimensionality Reduction: By selecting only the most significant
principal components, PCA reduces the number of features,
making the model simpler and faster to train.
● Noise Reduction: By discarding less significant components,
PCA can reduce noise in the data, which might improve the
model's performance.
● Visualization: PCA is often used for visualizing high-dimensional
data in 2D or 3D, helping to understand the structure and
relationships between data points.
● Improved Performance: Reducing dimensionality can prevent
overfitting and improve the generalization ability of machine
learning models.
Applications of PCA
Limitations of PCA
25
traditional clustering methods like K-means, DBSCAN can find
clusters based on density rather than distance, making it more robust
to noise and outliers. Here’s an overview of DBSCAN:
26
○ If a core point has enough neighbors, the algorithm
recursively expands the cluster by checking the neighbors
of the core points in the cluster. This process continues
until no more points can be added to the cluster.
3. Mark noise points:
○ Points that do not meet the density requirement (i.e., are
neither core nor border points) are considered noise and
are left out of any cluster.
4. Repeat the process:
○ The algorithm moves to the next unvisited point and
repeats the process until all points are either assigned to a
cluster or labeled as noise.
Advantages of DBSCAN
Disadvantages of DBSCAN
27
clustering results, such as over-segmentation or
under-segmentation of clusters.
2. Sensitivity to Varying Density:
○ DBSCAN may struggle with datasets where clusters have
varying densities. If one cluster has a significantly higher
density than another, DBSCAN may fail to identify the
lower-density cluster properly, or treat it as noise.
3. Not Suitable for High-Dimensional Data:
○ DBSCAN tends to perform poorly in high-dimensional
spaces due to the "curse of dimensionality," where the
concept of density becomes less meaningful as the
number of dimensions increases.
Example:
scss
Copy code
(1, 2), (2, 2), (3, 2), (8, 7), (8, 8), (25, 80)
● Points (1, 2), (2, 2), and (3, 2) are close enough to each other, so
they form a cluster.
● Points (8, 7) and (8, 8) are close enough to each other and form
another cluster.
● Point (25, 80) is too far from the others and does not meet the
density requirements, so it is classified as noise.
DBSCAN Result:
28
15. Explain how to choose right algorithm for Machine learning
algorithm
29
4. Evaluate Accuracy and Interpretability Needs
Additional Considerations
● Scalability: Some algorithms like k-NN don’t scale well with large
data. For large-scale applications, consider scalable methods
like Stochastic Gradient Descent (SGD).
● Handle Missing Data: Certain algorithms (e.g., k-NN) are
sensitive to missing data, while others (e.g., Random Forests)
handle missing data better.
● Avoiding Overfitting: Complex models (e.g., deep learning) can
overfit on small datasets. Regularization techniques or simpler
models might be preferable in these cases
30
maximizes the distance between the means of different classes
while minimizing the variance within each class.
2. Assumptions:
○ Each class has a Gaussian distribution.
○ The classes have the same covariance matrix, making
them linearly separable in the transformed space.
○ The feature independence within each class is assumed for
better separation.
3. Mathematical Approach:
○ LDA calculates the within-class scatter matrix and
between-class scatter matrix based on class labels.
○ It maximizes the Fisher’s criterion, which is the ratio of the
variance between classes to the variance within classes,
helping to select the optimal projection directions.
4. Dimensionality Reduction:
○ LDA reduces the feature space's dimensionality, often to a
number equal to the number of classes minus one. This
makes it useful for visualizing data or reducing
computation in high-dimensional datasets.
5. Applications:
○ LDA is widely used in face recognition, medical diagnosis,
and image classification tasks, where distinguishing
between multiple classes is critical.
1. Accuracy
31
2. Precision
3. Recall (Sensitivity)
4. F1 Score
5. Confusion Matrix
32
Purpose Primarily for binary Can handle both binary
classification and multiclass
classification
33
incorrectly identifying negative instances as positive (False
Positives).
The Area Under the Curve (AUC) is a numerical summary of the ROC
curve, representing the overall ability of the model to distinguish
between classes. An AUC value ranges from 0 to 1:
34
2. Minimal Spanning Tree (MST):
The MST is a subset of the graph that connects all nodes with the
minimal possible sum of edge weights, ensuring there are no cycles.
Algorithms like Kruskal’s and Prim’s are commonly used to construct
the MST.
The idea is that nodes within the same cluster are closer to each other
than nodes in different clusters, with the removed edges acting as
natural boundaries between clusters.
Effective for Outliers: Outliers, which often have high edge weights,
can be effectively isolated by MST, resulting in cleaner clusters.
5. Applications:
35
In biology and bioinformatics, MST clustering is often applied for
phylogenetic tree construction and analyzing gene expression data,
where clusters represent related species or similar gene expressions.
6. Challenges:
1. Overfitting
2. Underfitting
36
● Definition: Underfitting occurs when a model is too simple to
capture the underlying structure of the data. It fails to learn the
relationships between features and target variables effectively,
leading to poor performance on both the training and test sets.
● Example: A linear model applied to a highly non-linear dataset
may not capture the data's complexity, resulting in underfitting.
● Indicators: Low accuracy on both the training and test datasets
is a typical sign of underfitting.
● Solutions: Using a more complex model, increasing the number
of features, or decreasing regularization can help the model
learn the data more accurately.
3. Bias-Variance Trade-off
37
● Achieving Balance: Techniques like cross-validation, adjusting
model complexity, and using ensemble methods (e.g., bagging
or boosting) can help manage the bias-variance trade-off to
create a model that generalizes well.
Types of Regression:
1. Linear Regression:
○ Simple Linear Regression: This involves a relationship
between one independent variable and the dependent
variable. The model follows the equation y=mx+by = mx +
by=mx+b, where mmm is the slope and bbb is the
y-intercept.
○ Multiple Linear Regression: Involves more than one
independent variable. The equation becomes
y=b0+b1x1+b2x2+...+bnxny = b_0 + b_1x_1 + b_2x_2 + ... +
b_nx_ny=b0+b1x1+b2x2+...+bnxn, where each xnx_nxn
represents a predictor variable.
2. Polynomial Regression:
○ A type of regression where the relationship between the
independent and dependent variables is modeled as an
nth-degree polynomial. This is useful when the relationship
is nonlinear but still follows a recognizable pattern. The
model equation might look like y=b0+b1x+b2x2+...+bnxny =
b_0 + b_1x + b_2x^2 + ... +
b_nx^ny=b0+b1x+b2x2+...+bnxn.
3. Ridge Regression (L2 Regularization):
38
○ A variation of linear regression that includes a penalty term
in the cost function to prevent overfitting. This
regularization term shrinks the coefficients to reduce the
impact of irrelevant variables.
4. Lasso Regression (L1 Regularization):
○ Similar to ridge regression, but the penalty term is based
on the absolute values of the coefficients, which can lead
to some coefficients being exactly zero. It is useful for
feature selection as it helps in reducing the number of
predictors.
5. Elastic Net Regression:
○ Combines the penalties of both ridge and lasso regression.
It is useful when there are multiple features correlated with
each other, combining the strengths of both L1 and L2
regularization.
6. Logistic Regression:
○ Despite its name, logistic regression is used for
classification tasks, not regression. It models the
probability of a binary outcome (e.g., 0 or 1) and uses the
logistic function to squeeze the output between 0 and 1.
7. Stepwise Regression:
○ This is a method where the choice of predictor variables is
carried out by an automated process of adding or
removing predictors based on certain criteria (like the AIC
or BIC), helping to build a simplified model.
8. Quantile Regression:
○ Focuses on predicting specific quantiles (like the median)
of the dependent variable, rather than the mean. It is useful
when the data has outliers or a skewed distribution.
39
1. Overfitting Prevention:
○ Overfitting occurs when a model performs well on training
data but poorly on unseen data. This happens when the
model learns the noise or details in the training data rather
than generalizing from it.
○ Cross-validation helps to detect overfitting by evaluating
the model on multiple subsets of the data, providing a
more reliable estimate of its performance on unseen data.
2. Model Evaluation:
○ Cross-validation gives a better estimate of a model's
accuracy by evaluating it on different data splits rather
than just a single train-test split. This is important for
understanding the model's behavior and reliability across
various subsets of the data.
3. Hyperparameter Tuning:
○ When tuning the model’s hyperparameters,
cross-validation provides a more robust performance
metric by evaluating different configurations of
hyperparameters on various data splits, helping to select
the optimal set of hyperparameters.
4. Small Dataset Usage:
○ For small datasets, using only a single train-test split might
lead to an unreliable model evaluation. Cross-validation
utilizes all available data for both training and testing,
allowing every data point to contribute to both training and
testing phases.
5. Bias-Variance Tradeoff:
○ Cross-validation helps in identifying the bias-variance
tradeoff. If a model is underfitting (high bias) or overfitting
(high variance), cross-validation helps to understand the
error more accurately by providing insights into how the
model performs across different splits of the data.
6. Model Comparison:
○ When comparing different models, cross-validation
provides a fair comparison by using the same train-test
data splits. This ensures that the comparison is unbiased
and based on the same evaluation criteria.
40
K-fold cross-validation is a specific form of cross-validation that
divides the data into K equal-sized folds and iteratively uses each fold
as a test set while the remaining K-1 folds are used for training. This
technique allows for a comprehensive assessment of the model's
performance and reduces the variance in performance estimation.
1. Reduces Overfitting:
○ By training and testing on different subsets, K-fold
cross-validation ensures that the model is less likely to
overfit the training data.
2. Efficient Use of Data:
41
○ Every data point is used for both training and testing,
maximizing the use of available data, which is especially
important for smaller datasets.
3. More Reliable Results:
○ Averaging the results from multiple folds reduces the
impact of a poor train-test split, making the evaluation
more robust and reliable.
Disadvantages:
1. Computational Cost:
○ K-fold cross-validation requires training the model K times,
which can be computationally expensive, especially with
large datasets or complex models.
2. Not Ideal for Time Series:
○ For time-series data, where temporal relationships exist,
K-fold cross-validation is not ideal because it doesn't
respect the time-ordering of data. In such cases, other
methods like time-series cross-validation are preferred.
1. Root Node: The tree starts at the root node, which represents
the entire dataset. The root node is split into branches based on
a feature that best separates the data according to a specific
criterion (like information gain or Gini impurity for
classification).
2. Splitting: The dataset is recursively split into subsets based on
the values of different features. The goal is to partition the data
42
in a way that the subsets become more homogeneous in terms
of the target variable (class label or value).
3. Stopping Criteria: The tree-building process continues until one
of the stopping conditions is met:
○ A predefined maximum tree depth is reached.
○ Further splits do not improve the homogeneity of the data.
○ Each subset contains fewer than a certain number of data
points.
4. Leaf Node: Once the tree is fully grown, the leaf nodes contain
the predicted class label (in classification) or predicted value (in
regression) for new data points that reach that leaf.
Example:
● The root node might split the data based on age (e.g., Age < 30
or Age >= 30).
● Further splits may occur based on income, resulting in different
branches for different age and income combinations.
● The leaf nodes would contain the final prediction, such as "Buy"
or "Don't Buy."
Advantages:
Disadvantages:
43
Support Vector Machines (SVMs) are a powerful machine learning
algorithm used for classification and regression tasks. At its core,
SVM is a constrained optimization problem. This means we're seeking
to find the optimal solution to a problem while adhering to specific
constraints.
The Constraints:
The data points must be correctly classified on the correct side of the
hyperplane. This constraint can be expressed as:
Where:
44
To solve this constrained optimization problem, we introduce
Lagrange multipliers αi for each constraint. The Lagrangian function
is then defined as:
45
The kernel trick is a technique used in Support Vector Machines
(SVM) to handle non-linearly separable data. It allows SVM to operate
in higher-dimensional spaces without explicitly transforming the data,
making it computationally efficient while enabling the algorithm to
find a non-linear decision boundary.
Several kernel functions are commonly used in SVM, each suited for
different types of data:
1. Linear Kernel:
○ K(xi,xj)=xi⋅xjK(x_i, x_j) = x_i \cdot x_jK(xi,xj)=xi⋅xj
○ This is the standard inner product, and it is used when the
data is already linearly separable in the input space.
2. Polynomial Kernel:
○ K(xi,xj)=(xi⋅xj+c)dK(x_i, x_j) = (x_i \cdot x_j +
c)^dK(xi,xj)=(xi⋅xj+c)d
○ This kernel maps the data into a higher-dimensional space
where polynomial decision boundaries can be formed.
3. Radial Basis Function (RBF) Kernel (Gaussian Kernel):
46
○ K(xi,xj)=exp(−γ∥xi−xj∥2)K(x_i, x_j) = \exp\left(-\gamma
\|x_i - x_j\|^2\right)K(xi,xj)=exp(−γ∥xi−xj∥2)
○ This kernel maps the data into an infinite-dimensional
space and is widely used when the data is not linearly
separable. It creates a highly flexible decision boundary.
4. Sigmoid Kernel:
○ K(xi,xj)=tanh(αxi⋅xj+c)K(x_i, x_j) = \tanh(\alpha x_i \cdot x_j
+ c)K(xi,xj)=tanh(αxi⋅xj+c)
○ This kernel is related to the neural network activation
function and can be useful in certain cases.
1. Feature Selection
47
ANOVA F-test, and Correlation coefficients. Features with higher
relevance are selected.
● Wrapper Methods: These methods evaluate subsets of features
by training and testing a machine learning model on them.
Techniques like Recursive Feature Elimination (RFE) are
examples of wrapper methods, where features are iteratively
removed based on model performance.
● Embedded Methods: These methods perform feature selection
during the model training process. Algorithms like Lasso
regression (L1 regularization) or Decision Trees automatically
perform feature selection by penalizing or splitting on the most
relevant features.
Advantages:
2. Feature Extraction
Advantages:
48
● Reduces dimensionality while retaining important information.
● Helps to deal with multicollinearity by transforming correlated
features.
● Improves the efficiency of machine learning models.
K-Means Algorithm
The K-Means algorithm is one of the most popular and widely used
unsupervised learning algorithms for clustering. Its primary goal is to
partition a dataset into K distinct, non-overlapping clusters based on
the similarity of the data points. The algorithm minimizes the variance
within each cluster to create cohesive groups of similar data points.
1. Initialization:
○ Choose the number of clusters KKK, which is a
user-defined parameter.
○ Randomly initialize KKK centroids. These centroids
represent the center of each cluster.
2. Assigning Data Points to Clusters:
○ Each data point is assigned to the nearest centroid based
on a distance metric (typically Euclidean distance).
○ The data points that are closer to a particular centroid are
grouped into that cluster.
3. Update Centroids:
○ After all data points are assigned to clusters, the centroids
are recalculated by taking the mean of all the points
assigned to each cluster.
○ This new centroid represents the updated center of the
cluster.
4. Repeat Steps 2 and 3:
○ The steps of assigning data points to clusters and
updating centroids are repeated iteratively until the
centroids no longer change or the changes are minimal
(i.e., convergence).
5. Termination:
49
○ The algorithm stops when the centroids have stabilized
and no longer move significantly between iterations, or
after a fixed number of iterations.
Advantages of K-Means:
Limitations:
50
scans to detect anomalies like tumors, fractures, or diseases.
For example, ML models can identify early-stage cancer or
detect heart disease from imaging data.
● Predictive Analytics: ML models are trained on patient data
(such as medical history, demographics, and lab results) to
predict health risks like heart attacks, diabetes, or strokes.
These predictive models can help doctors make timely
interventions, improving patient outcomes.
● Drug Discovery: ML is used in drug development to analyze
biological data, identify potential drug candidates, and simulate
how drugs interact with the human body, significantly
accelerating the process of drug discovery.
3. Autonomous Vehicles
51
detection, lane detection, pedestrian recognition, and traffic sign
recognition. These models help the vehicle "see" its
environment and make real-time driving decisions.
● Sensor Fusion: Autonomous vehicles use various sensors like
LiDAR, radar, and cameras. ML algorithms combine data from
these sensors to create a comprehensive understanding of the
vehicle’s surroundings, ensuring safe navigation.
● Path Planning and Control: Machine learning is used for
decision-making, allowing the vehicle to plan its path, make lane
changes, avoid obstacles, and follow traffic rules while
interacting with other road users.
52
recommendation systems to suggest relevant products, movies,
or videos to users based on their past interactions.
● Inventory Management: ML is used for demand forecasting and
inventory optimization. By analyzing historical sales data,
seasonal trends, and external factors, ML models predict the
demand for products, helping businesses maintain optimal
stock levels and reduce waste.
● Dynamic Pricing: ML algorithms analyze factors like customer
demand, competitor pricing, and market conditions to adjust
prices dynamically. This is commonly seen in industries such as
airlines, hotels, and e-commerce platforms to maximize profit
while remaining competitive.
Where:
53
● ϵ\epsilonϵ is the error term (the difference between the predicted
and actual value of YYY).
Example:
54
○ Normality of errors: The error terms should be normally
distributed.
4. Multicollinearity: This occurs when two or more independent
variables are highly correlated with each other. It can cause
instability in the model, making it difficult to estimate the
individual effect of each feature.
5. Overfitting: If the model is too complex (with too many
variables), it might fit the training data very well but perform
poorly on new, unseen data. This is called overfitting, and
techniques like regularization (e.g., Ridge or Lasso) are used to
prevent it.
55
How Bagging Works:
Example:
Advantages of Bagging:
56
Disadvantages of Bagging:
● Does not reduce bias: If the base model is biased (e.g., if the
decision tree is very simple), bagging won't make it better.
● Computationally expensive: Since many models are trained, it
can be resource-intensive.
2. Boosting
57
Example:
Advantages of Boosting:
Disadvantages of Boosting:
58
● Multiple Linear Regression: Involves two or more independent
variables.
●
y = mx + b
2. Where:
Key Concepts:
59
● Evaluation Metrics: Common metrics for evaluating linear
regression models include:
○ Mean Squared Error (MSE)
○ Root Mean Squared Error (RMSE)
○ Mean Absolute Error (MAE)
○ R-squared
1. Class Separation:
60
○ It calculates the mean and covariance matrix for each
class.
○ The goal is to find a projection that maximizes the distance
between the means of different classes while minimizing
the variance within each class.
2. Finding Optimal Projection:
Limitations of LDA:
61
● Assumption of Gaussian Distribution: LDA assumes that the
data within each class follows a Gaussian distribution. If this
assumption is violated, the performance of LDA may degrade.
● Small Sample Size: LDA can be sensitive to the number of
samples in each class. If the sample size is small, the estimated
covariance matrices may be unreliable
62
1. Accuracy: The proportion of correctly classified instances out of
the total instances.
○ Example: If a model correctly classifies 90 out of 100
samples, its accuracy is 90%.
2. Precision: The ratio of true positive predictions (correctly
predicted positive cases) to the total predicted positive cases
(true positives + false positives).
○ Example: If a model predicts 40 positives and 30 are
correct, precision is 30/40 = 75%.
3. Recall (Sensitivity or True Positive Rate): The ratio of true
positives to the total actual positive cases (true positives + false
negatives).
○ Example: If there are 50 actual positives and the model
correctly predicts 30, recall is 30/50 = 60%.
4. F1-Score: The harmonic mean of precision and recall, providing
a balance between the two. It is useful when there is an uneven
class distribution.
○ Example: If precision is 75% and recall is 60%, F1-score is
2 * (0.75 * 0.60) / (0.75 + 0.60) = 0.67.
5. AUC-ROC (Area Under Curve - Receiver Operating
Characteristic): Measures the model's ability to distinguish
between classes. A higher AUC indicates better model
performance.
○ Example: An AUC of 0.9 means the model has a 90%
chance of ranking a randomly chosen positive instance
higher than a randomly chosen negative instance.
63
represents perfect equality (everyone has the same income), and 1
represents perfect inequality (one person has all the income).
Example:
● Person 1: $10,000
● Person 2: $20,000
● Person 3: $30,000
● Person 4: $40,000
● Person 5: $50,000
64
To calculate the Gini index:
1 $10,0 20% 8%
00
Export to Sheets
65
4. Compare to the line of perfect equality:
66
Ensemble learning refers to combining multiple individual models to
produce a stronger, more accurate model. Bagging (Bootstrap
Aggregating) and Boosting are two popular ensemble methods that
improve the performance of machine learning models, but they work
in fundamentally different ways. Below is a comparison of the two
methods:
67
● Disadvantages:
○ May not work well with weak learners, as the method does
not focus on correcting errors made by individual models.
2. Boosting
68
● Disadvantages:
○ Can overfit if the number of models is too large.
○ Training is sequential, so boosting is less parallelizable
compared to bagging.
○ Sensitive to noisy data and outliers because it gives more
weight to misclassified instances.
69
Computational More efficient, Less efficient, sequential
Efficiency parallelizable model building
70
37. Consider the use case of Email spam detection. Identify and
explain the suitable machine learning technique for this task.
There are various algorithms that can be used for spam detection, but
some of the most common and effective ones are:
71
the email. It uses the frequencies of terms in spam and
non-spam emails to estimate these probabilities.
● Why Suitable:
○ Efficiency: Naive Bayes performs well with a small amount
of training data and is computationally efficient, making it
ideal for real-time spam detection in large email systems.
○ Simplicity: It is simple to implement and can handle a large
number of features (words) efficiently.
○ Effectiveness: Despite the "naive" assumption
(independence of features), it often works surprisingly well
in practice for text classification tasks.
72
3. Logistic Regression
4. Decision Trees
73
uses a criterion like Gini impurity or entropy to decide the best
feature to split on at each step.
● Why Suitable:
○ Interpretability: Decision trees are easy to understand and
interpret, making them useful for debugging or explaining
predictions.
○ Handling Non-linearities: Decision trees can handle
non-linear relationships between features, which may be
present in the spam detection problem.
○ Feature Selection: Decision trees naturally perform feature
selection by choosing the most informative features for
splitting.
5. Random Forests
74
38. Explain the Ensemble Learning Algorithm Random forest and its
use cases in real world applications.
75
from all the decision trees. In regression tasks, the final
prediction is the average of the predictions made by all the
trees.
76
● Stock Market Prediction: It is used for predicting stock prices
and market trends by analyzing historical data, trading volumes,
and other relevant financial indicators.
5. Agriculture
77
environmental data, soil conditions, and historical crop
performance.
● Yield Prediction: It can be applied to forecast crop yield by
analyzing weather conditions, soil type, and previous yield data,
which helps farmers make informed decisions about resource
allocation.
78
39. Explain the Dimensionality reduction technique Linear
Discriminant Analysis and its real-world applications.
79
1. Face Recognition (Computer Vision):
2. Medical Diagnosis:
80
4. Speech Recognition:
81
● Problem: Fraud detection systems often deal with large datasets
that involve customer transactions, account details, and
behavior patterns. Identifying fraud among legitimate
transactions requires distinguishing complex patterns.
● How LDA Helps: LDA helps by reducing the number of features
involved in fraud detection while preserving those features that
are most useful for differentiating between fraudulent and
non-fraudulent transactions.
Advantages of LDA:
82
Limitations of LDA:
1. Hyperplane:
83
○ w\mathbf{w}w is the normal vector (weights) to the
hyperplane.
○ x\mathbf{x}x represents the input features (data points).
○ bbb is the bias term, which controls the offset of the
hyperplane from the origin.
2. Support Vectors:
● Definition: Support vectors are the data points that are closest
to the hyperplane and are critical in defining the optimal
hyperplane. These points are the most important for the SVM, as
they directly influence the position and orientation of the
hyperplane.
● Role in SVM: SVM only depends on these support vectors to
determine the decision boundary. All other points, which are
farther away from the hyperplane, do not affect the model. If the
support vectors are removed or changed, the hyperplane could
shift, thus altering the classifier.
● Intuition: Support vectors are the "marginal" points that lie on
the boundary of the margin. In a 2D plot, these points are those
closest to the hyperplane and are often shown as circles or
specific markers.
3. Hard Margin:
84
this strict separation is not allowed. This is typically used when
the data is linearly separable (i.e., the two classes can be
perfectly separated by a linear hyperplane).
● Mathematical Formulation: The margin is defined as the distance
between the hyperplane and the closest support vectors, and
SVM aims to maximize this margin. This constraint is
represented as:
4. Soft Margin:
5. Kernel:
85
data points in the higher-dimensional space without needing to
explicitly compute the transformation.
● Common Types of Kernels:
○ Linear Kernel: This is used when the data is already
linearly separable.
○ Polynomial Kernel: This kernel maps the data into a
higher-dimensional space where polynomial decision
boundaries are possible.
○ Radial Basis Function (RBF) Kernel: The RBF kernel is
commonly used for non-linear data and helps map the data
into an infinite-dimensional space.
○ Sigmoid KernelThis kernel is based on the hyperbolic
tangent function, which is used for certain types of
non-linear classification tasks.
41. What is Density based clustering? Explain the steps used for
clustering task using Density-Based Spatial Clustering of
Applications with Noise (DBSCAN) algorithm.
Density-Based Clustering
86
Steps involved in DBSCAN:
1. Parameter Setting:
Key Points:
87
● It can effectively identify noise points in the data.
● The choice of ε and MinPts parameters is crucial for the
performance of DBSCAN.
Example:
● Anomaly detection
● Outlier detection
● Customer segmentation
● Image analysis
1. Definition:
○ Supervised Learning: This is a type of machine learning
where the model is trained on a labeled dataset, meaning
each input has a corresponding output. The model learns
by mapping inputs to the correct outputs based on these
labels.
○ Unsupervised Learning: In this approach, the model is
given data without any labeled outcomes. The goal is to
identify patterns or groupings within the data.
2. Data Labeling:
○ Supervised Learning: Requires labeled data, where each
training example is paired with an output label.
88
○ Unsupervised Learning: Works with unlabeled data,
allowing the model to independently find structure in the
data.
3. Purpose:
○ Supervised Learning: Primarily used for tasks where
predictions or classifications are needed, such as
predicting house prices or classifying emails as spam or
not spam.
○ Unsupervised Learning: Typically used for discovering
hidden patterns or groupings, like customer segmentation
or clustering similar images.
4. Algorithms:
○ Supervised Learning: Common algorithms include linear
regression, decision trees, and support vector machines
(SVM).
○ Unsupervised Learning: Examples include k-means
clustering, hierarchical clustering, and principal
component analysis (PCA).
5. Performance Measurement:
○ Supervised Learning: The model's performance can be
measured by comparing predictions to known labels, using
metrics like accuracy, precision, and recall.
○ Unsupervised Learning: Since there are no labels,
performance is often evaluated by the quality of the
discovered patterns or clusters, often using silhouette
scores or other cluster validation methods.
6. Output:
○ Supervised Learning: Produces outputs in the form of
predictions or classifications based on labeled data.
○ Unsupervised Learning: Results in clusters or
associations, which are insights about the structure of the
data.
7. Examples:
89
○ Supervised Learning: Image recognition, where images are
labeled as ‘cat’ or ‘dog.’
○ Unsupervised Learning: Market basket analysis to find
items frequently bought together without predefined
categories.
1. Definition:
○ Machine Learning (ML): A branch of artificial intelligence
focused on building models that allow computers to learn
from data and make predictions or decisions without
explicit programming. It is an iterative process where
models improve over time with new data.
○ Data Mining: A process of discovering patterns,
correlations, and insights within large datasets. It involves
analyzing data to extract meaningful information but does
not necessarily involve model training or predictive
capabilities.
2. Purpose:
○ ML: Primarily used to predict outcomes or automate
decisions based on past data. For instance, predicting
stock prices or recognizing images.
○ Data Mining: Aims to explore and understand existing data,
often for insights that can inform business or research
decisions, like identifying customer buying habits or fraud
detection patterns.
3. Data and Output:
○ ML: Typically relies on labeled datasets (especially in
supervised learning) to train models that produce
predictive outcomes.
90
○ Data Mining: Works with both labeled and unlabeled data
to find patterns. The output is usually a set of patterns or
relationships rather than predictions.
4. Process:
○ ML: Involves iterative model training, tuning, and validation
to improve prediction accuracy over time. Algorithms learn
and adjust based on performance.
○ Data Mining: Involves steps such as data cleaning,
transformation, and exploratory analysis to reveal insights;
it doesn’t focus on continuous learning or model
improvement.
5. Techniques Used:
○ ML: Employs algorithms like regression, neural networks,
decision trees, and support vector machines, often
requiring specialized tuning.
○ Data Mining: Uses techniques such as clustering,
association rule mining, and anomaly detection to find
patterns without necessarily building predictive models.
6. Application Examples:
○ ML: Self-driving cars (predicting obstacles), speech
recognition, personalized recommendations.
○ Data Mining: Market basket analysis, discovering fraud
patterns, segmenting customer demographics.
7. Scope of Automation:
○ ML: Highly automated, as models can continuously learn
and adapt from new data, leading to systems that can make
decisions in real time.
○ Data Mining: Less automated, often requiring human
interpretation of the results to draw conclusions from the
data patterns.
91
92