ML
Definition of Linear Regression
Linear Regression is a statistical method used to model the relationship between a dependent
variable y and one or more independent variables x.
• Simple Linear Regression: Involves one independent variable x and models the
relationship as a straight line:
y=β0+β1x+ϵ
Where:
• y: Dependent variable
• x: Independent variable
• β0: Intercept
• β1: Slope (change in yyy for a unit change in x)
• ϵ: Error term
• Multiple Linear Regression: Involves two or more independent variables x1,x2,...,xk and
models the relationship as:
y=β0+β1x1+β2x2+…+βkxk+ϵ
Definition of Polynomial Regression
Polynomial Regression is a form of regression analysis where the relationship between the
independent variable (x) and the dependent variable (y) is modeled as an n-degree polynomial. It is
a generalization of linear regression and allows for capturing non-linear relationships.
The polynomial regression equation is:
y=β0+β1x+β2x2+β3x3+…+βnxn+ϵ
Where:
• y: Dependent variable
• x,x2,x3,…,xn: Independent variable terms
• β0,β1,…,βn: Coefficients
• ϵ: Error term
When to Use Polynomial Regression?
1. When the data shows a non-linear trend.
2. When linear regression fails to provide a good fit for the data.
Advantages of Polynomial Regression
1. Captures non-linear relationships effectively.
2. Can approximate complex patterns in data.
Limitations
1. Overfitting with high-degree polynomials.
2. Sensitive to outliers.
Definition of Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the cost function by iteratively
adjusting parameters (weights) in the direction of the negative gradient of the cost function. It is
widely used in machine learning and deep learning for optimizing models.
Mathematical Formulation
1. Cost Function
For linear regression, the cost function is the Mean Squared Error (MSE):
Where:
• m: Number of data points
• hθ(xi)=θ0+θ1xi: Hypothesis (model prediction)
• yi: Actual output
2. Gradient Descent Update Rule
Derivation:
Advantages
1. Simple to implement and computationally efficient for small datasets.
2. Versatile for various cost functions and models.
Disadvantages
1. May converge slowly or get stuck in local minima.
2. Highly sensitive to the learning rate (α\alphaα).
Overfitting and Underfitting: Bias-Variance Tradeoff
1. Definition
Overfitting
• The model learns the noise and details in the training data to such an extent that it performs
poorly on unseen data.
• It captures both the patterns and the randomness in the data, leading to high variance.
• Symptoms: High accuracy on training data but low accuracy on test data.
Underfitting
• The model is too simple to capture the underlying patterns in the data.
• It cannot learn enough from the training data, leading to high bias.
• Symptoms: Low accuracy on both training and test data.
2. Bias-Variance Tradeoff
Bias
• Bias represents the error due to overly simplistic assumptions in the model.
• High bias causes the model to miss relevant relations between features and target labels.
• Associated with underfitting.
Variance
• Variance represents the error due to sensitivity to small fluctuations in the training data.
• High variance causes the model to perform well on training data but poorly on new, unseen
data.
• Associated with overfitting.
Tradeoff
• A balance between bias and variance is crucial for optimal model performance.
• Low Bias + Low Variance: The ideal scenario where the model generalizes well to unseen
data.
3. Accuracy and Performance
Scenario Training Accuracy Test Accuracy Bias Variance
Overfitting High Low Low High
Underfitting Low Low High Low
Balanced Model Moderate/High High Moderate Moderate
4. Visual Representation
1. Overfitting: The model is too complex and fits the noise in the data, resulting in poor
generalization.
• Graph: Training error is near zero, but test error is high.
2. Underfitting: The model is too simple and fails to capture the underlying structure of the
data.
• Graph: Both training and test errors are high.
3. Ideal Fit: The model captures the underlying structure without overfitting or underfitting.
• Graph: Both training and test errors are low and close.
5. Methods to Address Overfitting and Underfitting
To Reduce Overfitting:
1. Simplify the model: Reduce the number of features or decrease the complexity of the
model.
2. Regularization: Add penalties (e.g., L1 or L2 regularization) to discourage overly complex
models.
3. Increase training data: Provide more examples for the model to generalize better.
4. Early stopping: Halt training when performance on the validation set stops improving.
5. Dropout (for Neural Networks): Randomly deactivate neurons during training to prevent
reliance on specific features.
To Reduce Underfitting:
1. Increase model complexity: Use a more sophisticated model or add more features.
2. Decrease regularization: Reduce the regularization term to allow the model to learn more
complex patterns.
3. Increase training time: Train the model for more epochs to allow it to learn better.
4. Feature engineering: Provide more relevant features to help the model capture underlying
patterns.
Key Takeaways
1. Overfitting:
• High variance, low bias.
• Focus on generalizing better.
2. Underfitting:
• High bias, low variance.
• Focus on making the model more complex or learning better.
3. Ideal Model:
• Balance between bias and variance.
• Results in the highest accuracy on both training and test data.
Lasso & Ridge Regression
Lasso and Ridge are linear regression techniques used to reduce overfitting and improve model
performance, particularly when dealing with multicollinearity or high-dimensional data.
Lasso Regression
• L1 Regularization: Adds the absolute value of the coefficients as a penalty to the loss
function.
Cost Function:
J(θ)=MSE+λ∑i=1^n∣θi∣
Key Features:
1. Performs feature selection by shrinking some coefficients to exactly zero.
2. Useful when we suspect that some features are irrelevant.
Pros:
• Can produce sparse models (few features with non-zero coefficients).
• Great for high-dimensional data.
Cons:
• May underperform when all predictors are relevant.
2. Ridge Regression
• L2 Regularization: Adds the squared value of the coefficients as a penalty to the loss
function.
Cost Function:
J(θ)=MSE+λ∑i=1^n|θi^2|
Key Features:
1. Shrinks coefficients towards zero but does not eliminate them entirely.
2. Useful for handling multicollinearity.
Pros:
• Reduces model complexity without eliminating predictors.
• Works well when all features are relevant.
Cons:
• Does not perform feature selection.
3. Difference Between Lasso & Ridge Regression
Feature Lasso (L1) Ridge (L2)
Penalty Term ( \lambda \sum_{i=1}^n \theta_i
Effect on Shrinks coefficients to zero for irrelevant
Shrinks coefficients close to zero
Coefficients features
Feature Selection Yes No
Handles multicollinearity
Use Case Sparse models
effectively
Logistic Regression Classification Output Following Regression
Logistic Regression is a classification algorithm that predicts probabilities for a binary outcome
using a logistic function, although its name suggests it might be related to regression.
1. Key Idea
• Instead of predicting a continuous value like in linear regression, logistic regression predicts
the probability of a sample belonging to a certain class.
• Outputs a value between 0 and 1 using the sigmoid function.
2. Sigmoid Function
The sigmoid function converts linear regression output into probabilities:
hθ(x)=1\1+e^−θTx
Decision Boundary:
• Probability >0.5: Predict 1 (positive class).
• Probability ≤0.5: Predict 0 (negative class).
4. Differences Between Regression and Classification
Aspect Linear Regression Logistic Regression
Output Continuous values Probabilities (between 0 and 1)
Predicts a numerical
Purpose Predicts a binary classification
outcome
Model y=11+e−θTxy = \frac{1}{1 + e^{-\theta^T
y=θTxy = \theta^T xy=θTx
Equation x}}y=1+e−θTx1
Loss Function Mean Squared Error Log Loss (Cross-Entropy Loss)
5. Extension to Multi-Class Classification
For multi-class problems, logistic regression is extended using:
1. One-vs-Rest (OvR): Train separate binary classifiers for each class.
2. Softmax Regression: Generalizes the sigmoid function to predict probabilities for multiple
classes.
Linear Regression to Logistic Regression: Mathematical Transition
The transition from Linear Regression to Logistic Regression involves modifying the linear
regression equation to predict probabilities for a classification problem, rather than continuous
values. This is achieved using the sigmoid function.
1. Linear Regression Equation
The equation for Linear Regression is:
y=θ^Tx=θ0+θ1x1+θ2x2+⋯+θnx
Where:
• y: The predicted output (continuous value).
• θ: Model coefficients (parameters).
• x: Feature values.
2. Limitations of Linear Regression for Classification
Linear regression predicts continuous values, but classification requires probabilities bounded
between 000 and 111. Linear regression doesn't ensure this because:
1. y can take any value from −∞ to +∞
2. It is not interpretable as a probability.
Comparison: Linear vs Logistic Regression
Aspect Linear Regression Logistic Regression
Equation y=θ^Tx ( P(y=1
Output Continuous value (−∞,∞) Probability (0,1)
Purpose Regression (continuous prediction) Classification (binary or multi-class)
Cost Function Mean Squared Error (MSE) Log Loss (Cross-Entropy Loss)
Interpretability Direct prediction of y Probability of y=1 (classification)
Sigmoid Function
The sigmoid function, also known as the logistic function, is a mathematical function that maps
any real-valued number into the range (0,1). This makes it ideal for predicting probabilities in
binary classification tasks, as it transforms linear output into probabilities.
Mathematical Definition:
The sigmoid function is defined as:
σ(z)=1/1+e^−z
Where:
• z: Input value (often a linear combination of features in machine learning, e.g., θTx).
• e: Euler's number (approximately 2.71828).
Graph of the Sigmoid Function
The graph of the sigmoid function shows how the output behaves as the input zzz varies:
• When z→+∞,σ(z)→1
• When z→−∞,σ(z)→0
• At z=0, σ(0)=0.5
Sigmoid Function in Machine Learning
In Logistic Regression, the sigmoid function is applied to a linear combination of input features
xxx and model parameters θ\thetaθ to predict the probability of a sample belonging to a certain
class.
Logistic Regression Model:
The model can be expressed as:
P(y=1∣x)=1/1+e^−(θ0+θ1x1+⋯+θnxn)
Where:
• θ0,θ1,…,θn are the parameters (weights).
• x1,x2,…,xn are the feature values.
The sigmoid function transforms the linear combination of features into a probability, which is then
used to make a binary classification decision.
Decision Rule:
• If P(y=1∣x)>0.5, predict class 1.
• If P(y=1∣x)≤0.5, predict class 0.
Gradient of the Sigmoid Function
The derivative (gradient) of the sigmoid function is important for optimization algorithms like
gradient descent, which is used to train models in machine learning.
The derivative of the sigmoid function with respect to z is:
σ′(z)=σ(z)(1−σ(z))
This property is crucial when applying gradient descent to optimize the cost function, as it allows
the model to adjust the weights effectively.
Use Cases of Sigmoid Function:
1. Binary Classification: Predict the probability of an event occurring (e.g., spam or not spam,
disease or no disease).
2. Neural Networks: Often used as an activation function in hidden layers or output layers for
binary classification.
3. Probabilistic Predictions: Used to model probabilities in various machine learning
algorithms.
Linear Regression vs Logistic Regression
Both Linear Regression and Logistic Regression are foundational models in machine learning, but
they are used for different types of tasks and have different characteristics.
Here’s a detailed comparison:
1. Purpose
• Linear Regression:
• Purpose: Predict a continuous dependent variable (e.g., house price, stock price).
• Output: A continuous value.
• Logistic Regression:
• Purpose: Predict a categorical dependent variable, typically for binary classification
(e.g., spam vs. non-spam, disease vs. no disease).
• Output: A probability value between 0 and 1 (for binary classification), often
mapped to class labels (0 or 1).
2. Mathematical Equation
• Linear Regression:
• The model predicts a continuous value y based on the linear combination of input
features x and parameters θ:
y=θ0+θ1x1+θ2x2+⋯+θnxn
Where:
• y: Predicted output (continuous value).
• x1,x2,…,xn: Feature values.
• θ0,θ1,…,θn: Model parameters (weights).
• Logistic Regression:
• The model predicts a probability using the sigmoid function:
P(y=1∣x)=1/1+e−(θ0+θ1x1+⋯+θnxn)
Where:
• P(y=1∣x): The probability of the positive class (1).
• The sigmoid function maps the output to a probability between 0 and 1.
• The output is typically used for binary classification.
3. Output
• Linear Regression:
• The output is continuous, meaning it can take any real value from −∞ to ∞.
• Example: Predicting the price of a house.
• Logistic Regression:
• The output is a probability between 0 and 1, which is then used to classify into two
classes (0 or 1).
• Example: Predicting whether an email is spam (1) or not spam (0).
When to Use
• Linear Regression:
• When the task involves predicting a continuous variable.
• Example: Predicting house prices, stock prices, temperature, etc.
• Logistic Regression:
• When the task involves binary classification (two possible outcomes).
• Example: Classifying whether an email is spam or not, predicting whether a
customer will buy a product or not, etc.
Visual Comparison
Aspect Linear Regression Logistic Regression
Classification (binary
Problem Type Regression (predict continuous value)
classification)
Output Continuous value (yyy) Probability between 0 and 1
Log Loss (Cross-Entropy
Cost Function Mean Squared Error (MSE)
Loss)
Interpretation Direct prediction Probability for class prediction
Classifying into two classes
Use Case Predicting continuous values (e.g., house price) (e.g., spam or not)
Logistic Regression (Def.)
Logistic Regression is a type of regression analysis used for binary classification problems. It
models the probability of a binary response based on one or more predictor variables (features). The
output of logistic regression is a probability value between 0 and 1, which is then used to classify
the observation into one of the two classes (typically 0 or 1).
Classification (Def.)
Classification is a supervised machine learning technique used to categorize data into discrete
classes or labels. Unlike regression, where the output is continuous, classification predicts a
categorical outcome. It can be binary (two classes) or multiclass (more than two classes).
Key Differences Between Logistic Regression and Classification
Aspect Logistic Regression Classification
A specific type of classification A broader category of algorithms that includes
Type of
algorithm used for binary logistic regression, decision trees, support
Algorithm
classification problems. vector machines, and more.
Outputs probabilities between 0
Output Outputs discrete classes or categories.
and 1 for binary outcomes.
Mainly used for binary Used for binary and multiclass classification
Use Case
classification tasks. problems.
Models the relationship between Can use various algorithms (e.g., decision
Modeling
features and the log-odds of a trees, random forests, KNN) for classifying
Approach
binary outcome. into categories.
Typically uses a threshold of 0.5 to
Decision Decision rules can vary depending on the
decide between two classes (0 or
Threshold algorithm used.
1).
Email spam detection, disease Spam detection (binary), handwriting
Examples
classification. recognition (multiclass).
Classification: Supervised Learning
Classification is a supervised learning technique in machine learning where the goal is to predict
the categorical label (or class) of an input based on labeled training data. Supervised learning means
that the model is trained on a dataset that includes both the input features and the corresponding
output labels.
Steps in Classification using Supervised Learning
1. Data Collection: Collect labeled data for training the model.
• For example, a dataset containing emails (features) and their labels (spam or not
spam).
2. Data Preprocessing: Clean the data and prepare it for training.
• This could involve handling missing values, encoding categorical variables, scaling
numerical features, etc.
3. Choosing a Model: Select a classification algorithm (e.g., Logistic Regression, Decision
Trees, Random Forests, K-Nearest Neighbors).
4. Model Training: Train the model using the labeled training data.
• During training, the model learns to map input features to output labels.
5. Model Evaluation: Evaluate the model's performance using metrics like accuracy,
precision, recall, F1-score, etc.
• Accuracy: Percentage of correct predictions out of total predictions.
• Precision: Proportion of true positives out of all predicted positives.
• Recall: Proportion of true positives out of all actual positives.
6. Prediction: Once the model is trained, use it to make predictions on new, unseen data.
Common Algorithms for Classification (Supervised Learning)
Here are some of the most popular supervised learning algorithms used for classification tasks:
• Logistic Regression: A statistical method for binary classification. It outputs probabilities
and applies a threshold to make predictions.
• Decision Trees: A tree-like structure used for classification tasks. It splits data based on
feature values and assigns labels to leaves.
• Random Forest: An ensemble method that combines multiple decision trees to improve
classification performance.
• Support Vector Machine (SVM): A method that finds the hyperplane that best separates
different classes in a high-dimensional feature space.
• K-Nearest Neighbors (KNN): A simple algorithm that classifies a sample based on the
majority class of its nearest neighbors.
• Naive Bayes: A probabilistic classifier based on Bayes’ theorem, assuming independence
between features.
K-Nearest Neighbors (KNN) Algorithm for Classification
The K-Nearest Neighbors (KNN) algorithm is a non-parametric and lazy supervised learning
algorithm used for both classification and regression tasks. For classification, it works by finding
the K nearest data points to a given input and predicting the label based on a majority vote from
those K neighbors.
Steps to Perform KNN Classification
1. Choose the number of neighbors (K): Select the value of K, which is the number of
nearest neighbors to consider when making a prediction.
2. Calculate the distance: For a new input point, calculate the distance between the input and
each point in the training dataset using a distance metric (commonly Euclidean distance).
3. Sort the distances: Sort the training data points by their distance from the input point.
4. Select the K nearest neighbors: Select the K points with the smallest distances.
5. Classify by majority vote: Assign the most common label among the K neighbors as the
predicted label.
6. Return the predicted label: The output of the model is the predicted class label for the
input data.
Naive Bayes Classification
Naive Bayes is a simple and efficient classification algorithm based on Bayes' Theorem, which is
particularly useful for large datasets. It assumes that the features used for classification are
independent (this is the "naive" assumption). Despite this assumption often being unrealistic,
Naive Bayes can perform surprisingly well in many real-world applications.
Bayes' Theorem
Bayes' Theorem is the foundation of Naive Bayes. It describes the probability of an event occurring,
given the probability of other related events. The formula for Bayes' Theorem is:
P(A∣B)=P(B∣A)P(A)/P(B)
Where:
• P(A∣B) is the posterior probability: the probability of event A occurring given that event B
has occurred.
• P(B∣A) is the likelihood: the probability of event B occurring given that event A has
occurred.
• P(A) is the prior probability: the initial probability of event A.
• P(B) is the evidence or normalizing constant: the total probability of event B.
In the context of Naive Bayes, P(A) corresponds to the class label, and P(B∣A) represents the
probability of observing the features given the class.
Naive Bayes Classification Steps
1. Calculate Prior Probabilities: The prior probability P(C) is the probability of each class in
the dataset. This is usually calculated by dividing the number of occurrences of each class by
the total number of samples.
2. Calculate Likelihood: For each class, calculate the likelihood P(X∣C) of the features
X=(X1,X2,...,Xn) given the class C. Naive Bayes assumes the features are independent, so:
P(X∣C)=P(X1∣C)⋅P(X2∣C)⋅⋯⋅P(Xn∣C)
3. Calculate Posterior Probability: The posterior probability P(C∣X) is calculated using
Bayes' Theorem:
P(C∣X)=P(C)⋅P(X∣C)/P(X)
Where:
• P(C) is the prior probability of the class.
• P(X∣C)) is the likelihood of the features given the class.
• P(X) is the probability of the features (evidence), which is a constant across all
classes, so it does not affect the final classification.
4. Choose the Class with the Maximum Posterior Probability: For each class, calculate
P(C∣X), and the class with the highest posterior probability is assigned as the predicted
class.
Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine learning algorithm used for both
classification and regression tasks. However, it is primarily used for classification problems. SVM
works by finding the optimal hyperplane that best separates the data points of different classes. The
key idea is to maximize the margin (the distance between the hyperplane and the nearest data point
of any class). Types of SVM
Non-Linear SVM
In real-world datasets, data points are often not linearly separable. To handle this, we use Non-
Linear SVM, which transforms the original feature space into a higher-dimensional space where
the data may become linearly separable.
This transformation is done using a technique called the kernel trick.
Classification with SVM
SVM is used for binary classification, but it can be extended to multi-class problems using
strategies like One-vs-One or One-vs-All.
Advantages of SVM
• SVM performs well in high-dimensional spaces and is effective when there is a clear margin
of separation.
• The kernel trick allows SVM to efficiently handle non-linear classification.
• SVM is robust to overfitting, especially in high-dimensional space.
Disadvantages of SVM
• SVMs are computationally intensive, especially for large datasets.
• Choosing the right kernel and tuning the hyperparameters can be challenging.
Decision Tree Classification
A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It works by recursively splitting the data into subsets based on the value of input
features, aiming to make the output variable as homogeneous as possible in each leaf node. The
decision tree is structured like a flowchart, where each internal node represents a feature or
attribute, each branch represents a decision rule, and each leaf node represents the outcome or class
label.
Entropy in Decision Trees
Entropy is a measure of impurity or disorder in a set of data. In the context of decision trees, it is
used to quantify the uncertainty or unpredictability of a class label. The goal of decision tree
algorithms is to reduce entropy at each node, which means making decisions that lead to more
homogeneous subgroups.
Algorithm for Decision Tree (Classification)
1. Input: Dataset D with features F={f1,f2,...,fn}, a target variable Y, and a stopping criterion
(e.g., minimum samples per leaf).
2. Output: A decision tree T.
Steps:
1. Start with the root node containing all the data points.
2. If the stopping criterion is met (e.g., all data points belong to the same class, or the tree
reaches the maximum depth), return the leaf node with the majority class label.
3. Calculate the entropy of the dataset.
4. For each feature fi∈F, calculate the information gain from splitting the data based on
feature fi.
5. Select the feature that gives the highest information gain and split the dataset into subsets.
6. Recursively repeat steps 3-5 for each child node created by the split, until the stopping
criterion is met.
Advantages of Decision Trees
• Easy to understand and interpret.
• Can handle both numerical and categorical data.
• Non-linear relationships between features do not affect tree performance.
Disadvantages of Decision Trees
• Prone to overfitting, especially with deep trees.
• Sensitive to noisy data.
• May not perform well with unstructured data like text.
Clustering: Definition
Clustering is an unsupervised machine learning technique that groups similar data points into
clusters or groups, based on certain characteristics. The goal of clustering is to organize data points
in such a way that data points within a cluster are more similar to each other than to those in other
clusters.
Clustering can be used in various applications, such as customer segmentation, anomaly detection,
image compression, and pattern recognition.
Types of Clustering Algorithms
1. K-Means Clustering: It divides the data into kkk clusters by minimizing the variance
within each cluster.
2. Hierarchical Clustering: It creates a tree-like structure (dendrogram) to represent data
points and their hierarchy.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): It groups data
based on the density of data points in a region, and can handle noise (outliers).
4. Gaussian Mixture Models (GMM): It assumes that the data is generated from a mixture of
several Gaussian distributions and tries to estimate the parameters of these distributions.
Differences between Clustering and Classification
Clustering and classification are both supervised learning and unsupervised learning techniques, but
they differ significantly in terms of approach and application.
Aspect Clustering Classification
Learning Type Unsupervised Learning Supervised Learning
Predict a categorical label for data
Purpose Group similar data points into clusters
points
Predicted class label for each data
Output Clusters of similar data points
point
Training Data No labeled data, only the features of the Requires labeled data with features and
data corresponding labels
Data is grouped based on similarity or Models are trained on input-output
Methodology
distance metrics pairs (feature-label)
K-Means, DBSCAN, Hierarchical Logistic Regression, Decision Trees,
Examples
Clustering, GMM SVM, K-NN
Customer segmentation, anomaly Email classification, image
Use Cases
detection, market basket analysis recognition, sentiment analysis
Output Sets of data points in clusters, no
Specific categories or classes
Structure predefined categories
K-Means Algorithm (Clustering)
The K-Means algorithm is one of the most popular unsupervised learning algorithms for clustering.
It partitions a dataset into k clusters, where each cluster is represented by its centroid (the mean of
the data points in the cluster). The algorithm works iteratively to find the optimal centroids and
assign data points to the nearest cluster.
Steps in the K-Means Algorithm
1. Initialization:
• Choose k initial centroids randomly or using some method (e.g., KMeans++).
2. Assignment Step:
• Assign each data point to the nearest centroid. This is usually done by calculating the
Euclidean distance between each data point and each centroid. The point is assigned
to the cluster with the closest centroid.
3. Update Step:
• After all points have been assigned, update the centroids of each cluster by
calculating the mean of all the points in that cluster.
4. Repeat:
• Repeat the assignment and update steps until convergence (i.e., when the centroids
no longer change or the assignments do not change).
5. Convergence:
• The algorithm stops when the centroids stabilize, and no further changes occur in the
cluster assignments.
K-Medoids Algorithm (PAM)
K-Medoids is a clustering algorithm that is similar to K-Means, but instead of using the mean of
data points to represent the centroid of a cluster, it uses an actual data point, called the medoid. A
medoid is the data point that minimizes the total distance to all other points in the cluster. This
makes K-Medoids more robust to noise and outliers compared to K-Means.
PAM (Partitioning Around Medoids) is the most popular implementation of the K-Medoids
algorithm.
K-Medoids (PAM) Algorithm Steps
1. Initialization:
• Select k initial medoids randomly or using some other heuristic.
2. Assignment Step:
• Assign each data point to the nearest medoid. The distance between a point and a
medoid is calculated using a distance metric (Manhattan distance is often used in
some cases).
3. Update Step:
• For each medoid, try to replace it with another point in the same cluster. Compute the
total distance for both the current medoid and the new candidate point. If the new
candidate minimizes the total distance, replace the old medoid with the new one.
4. Repeat:
• Repeat the assignment and update steps until convergence (i.e., when the medoids
no longer change).
Manhattan Distance (L1 Norm)
Manhattan distance is a metric used to measure the distance between two points in a grid-like path,
as opposed to Euclidean distance, which measures the shortest straight-line distance. The formula
for Manhattan distance between two points P(x1,y1) and Q(x2,y2)in 2D space is:
Manhattan Distance(P,Q)=∣x2−x1∣+∣y2−y1
In an n-dimensional space, the Manhattan distance between two points P=(x1,x2,...,xn) and
Q=(y1,y2,...,yn) is:
Manhattan Distance(P,Q)=∑i=1^n∣xi−yi∣
DBSCAN Algorithm (Density-Based Spatial Clustering of Applications with
Noise)
DBSCAN is a density-based clustering algorithm that identifies clusters based on the density of
data points. It is particularly useful for datasets with noise and clusters of varying shapes.
Key Concepts:
1. Core Points: A data point that has at least a specified minimum number of neighboring
points (MinPts) within a given radius (epsilon, ϵ).
2. Border Points: A point that has fewer than MinPts neighbors but is within the epsilon
distance of a core point.
3. Noise Points (Outliers): A point that is neither a core point nor a border point. These points
do not belong to any cluster.
4. Epsilon (ϵ): The radius around a point used to search for neighboring points.
5. MinPts: The minimum number of points required to form a dense region (cluster).
DBSCAN Algorithm Steps:
1. Select a random point and check if it has at least MinPts points within ϵ. If it does, it is a
core point.
2. Form a cluster: If the point is a core point, retrieve all neighboring points (within ϵ) and
add them to the cluster.
3. Expand the cluster: For each point in the cluster, check if they have neighboring points. If a
neighboring point has MinPts neighbors, add those points to the cluster.
4. Handle noise: If a point does not meet the requirements for a core or border point, mark it
as noise (outlier).
5. Repeat until all points are processed.
Expectation-Maximization (EM) Algorithm
The Expectation-Maximization (EM) algorithm is a general framework used for finding
maximum likelihood estimates of parameters in probabilistic models, especially when the data has
missing or incomplete information. It works iteratively to optimize parameters.
Key Steps:
1. E-Step (Expectation Step): Given the current parameter estimates, compute the expected
value of the log-likelihood function, accounting for missing or latent data.
2. M-Step (Maximization Step): Maximize the expected log-likelihood found in the E-step
with respect to the parameters.
The EM Algorithm is widely used for Gaussian Mixture Models (GMMs), where the data is
assumed to come from a mixture of several Gaussian distributions.
Example (Gaussian Mixture Model):
1. E-Step: Calculate the posterior probabilities that each data point belongs to each Gaussian
component.
2. M-Step: Update the parameters (means, variances, and mixing coefficients) of the Gaussian
components based on the posterior probabilities.
Multilayer Perceptron (MLP)
MLP (Multilayer Perceptron) is a class of feedforward artificial neural network models. It
consists of multiple layers of neurons and is used for supervised learning tasks such as classification
and regression.
Key Features:
• Input Layer: The layer where the data is input into the network.
• Hidden Layers: Layers between the input and output layers. They perform transformations
of the input data.
• Output Layer: The final layer that produces the network's output.
• Activation Functions: MLP uses nonlinear activation functions like sigmoid, ReLU, tanh,
etc., to introduce non-linearity to the model.
Training Process:
• Forward Propagation: Input is passed through the network to calculate the output.
• Loss Function: A loss function (like Mean Squared Error for regression or Cross-
Entropy for classification) measures how well the model's output matches the true labels.
• Backpropagation: The error is propagated back through the network to adjust the weights
using optimization algorithms like Gradient Descent.
Principal Component Analysis (PCA)
PCA is a statistical technique used for dimensionality reduction. It transforms a set of possibly
correlated variables into a smaller number of uncorrelated variables called principal components.
Steps in PCA:
1. Standardize the data: Normalize the dataset (zero mean, unit variance).
2. Compute the covariance matrix: This matrix represents how the features co-vary with
each other.
3. Find the eigenvalues and eigenvectors of the covariance matrix: Eigenvectors represent
the principal components, and eigenvalues represent the variance explained by each
principal component.
4. Sort the eigenvalues: Sort the eigenvectors by the corresponding eigenvalues in descending
order.
5. Select top k eigenvectors: Choose the top k eigenvectors to reduce the dimensionality.
6. Transform the data: Multiply the original data by the selected eigenvectors to get the
reduced-dimensional data.
Applications of PCA:
• Data compression: Reduce the dimensions of data while retaining as much information as
possible.
• Visualization: Reducing high-dimensional data to 2D or 3D for visualization.
• Noise reduction: By discarding components with low variance, noise can be reduced.