MACHINE LEARNING
MODULE – 5
COURSE CODE BCS602
TOPICS
• Introduction to Clustering Approaches
• Proximity Measures
• Hierarchical Clustering Algorithms
• Partitional Clustering Algorithm
• Density-based Methods
• Grid-based Approach
• Reinforcement Learning: Overview of Reinforcement Learning
• Scope of Reinforcement Learning
• Reinforcement Learning as Machine Learning
• Components of Reinforcement Learning
• Markov Decision Process
• Multi-Arm Bandit Problem and Reinforcement Problem Types
• Model-based Learning
• Model Free Methods
• Q-Learning
• SARSA Learning.
Introduction to Clustering Approaches
What is Clustering?
• Clustering is part of unsupervised learning — meaning, you
are given data without any labels (no correct answers) and
your goal is to find structure or patterns in that data.
• Clustering means grouping similar data points together into
clusters.
• For example, if you have pictures of animals and no labels,
clustering might group dogs together, cats together, etc.,
based on their features (like size, color, etc.).
How Does Clustering Work?
• You are given a dataset of objects or samples (e.g., measurements, images,
or records).
• Each object has features (like height, width, etc.).
• The goal is to group similar objects into clusters, so that:
• Items in the same cluster are similar to each other.
• Items in different clusters are very different.
(a) Top Graph:
•Each dot is a data point with
two features (plotted as
Sample vs Value).
•Right now, there’s no clear
grouping shown.
(b) Bottom Graph:
•Ellipses are drawn around
groups of dots that are close
to each other.
•These represent manually
identified clusters.
What If There Are Many Features?
• If your data has many features (e.g., 100 or more), then you can’t visualize it
easily like in 2D.
• You need automated clustering algorithms like:
• K-Means
• DBSCAN
• Hierarchical Clustering
What is a Centroid?
• A centroid is the center of a cluster (like the average of all points in that
group).
• Example:
• Suppose the data points are (3,3), (2,6), and (7,9).
Important Points:
• Clusters should not overlap — each data point should belong to one cluster.
• Clustering is done using trial and error, since there are no labels to guide the
algorithm.
• Once clusters are found, they can be given labels later, if needed (e.g., "this
group is dogs"
Challenges of Clustering Algorithms
• High-Dimensional Data:
• When the data has many features (called high dimensions), clustering becomes
hard.
• Example: If each object has 100 or more features, it’s tough to decide what makes
them similar or different.
• Scaling Issues:
• Some clustering algorithms work well with small data but struggle with large or high-
dimensional data.
• Also, values of different scales (like cm vs kg vs dollars) can confuse the algorithm if
not handled properly.
• Units of Measurement:
• If one feature is in kilograms and another in pounds, this can affect the distance
calculations. You must make sure all data is on a common scale (normalization).
• Proximity (Distance) Measure:
• Clustering uses distance to decide how similar two points are.
• Choosing the right way to measure this distance (called a proximity measure) is a big
challenge.
Advantages Disadvantages
Can handle missing values and Sensitive to the order and starting
outliers. point of data.
Helps label unlabelled data in The number of clusters needs to be
semi-supervised learning. given in advance.
Easy to understand and
Scaling the data properly is difficult.
implement.
Clustering is one of the oldest Choosing the right distance measure
and simplest techniques. is hard.
Proximity Measures
When we do clustering, we want to group similar items
together. But how do we measure similarity or difference
between two objects?
We use proximity measures:
• Similarity → how close two items are.
• Dissimilarity (distance) → how far apart two items are.
In clustering, we usually use distance (dissimilarity), which
is a number showing how different two objects are. The
more the distance, the more different they are.
Types of Distance Measures (for numerical data)
Minkowski Distance
Minkowski distance is a general formula for measuring the
distance between two objects (or points), and it can be
customized depending on the value of a parameter called r.
Binary Attributes (0 or 1)
When we have objects with binary features (values are only 0 or 1),
we need a special way to measure how similar or different they are.
We do this using a contingency table.
Hierarchical Clustering
Hierarchical clustering is a way to group data into a tree-like structure (called
a dendrogram), where similar data points are combined into clusters step by
step.
There are two types:
1.Agglomerative (Bottom-Up): Start with each point as its own cluster and
merge them.
2.Divisive (Top-Down): Start with all points in one big cluster and split them.
Agglomerative Clustering (Bottom-Up)
•Start with each point in its own cluster.
•If there are N data points, you have N clusters to begin with.
•Repeat the following steps until only one cluster is left:
•Find the two clusters that are most similar (closest).
•Merge those two clusters into one.
•This reduces the number of clusters by 1 each time.
•When only one big cluster remains, you’re done.
You can cut the tree at any level to get different numbers of clusters.
Example:
If you have 4 points: A, B, C, D
• Start with 4 clusters: {A}, {B}, {C}, {D}
• Merge the closest two → {A}, {B}, {C, D}
• Merge again → {A}, {B, C, D}
• Merge one last time → {A, B, C, D} (final cluster)
Single Linkage Clustering
•This is a hierarchical clustering method.
•It merges clusters based on the smallest distance (minimum) between any two points
across clusters.
•The distance between two clusters is the minimum of all pairwise distances between
their members.
•You repeat merging the closest clusters until only one cluster remains.
Example refer text book
Complete Linkage (also called MAX or Clique)
In Complete Linkage, the distance between two clusters is defined as the maximum distance between any
pair of points, where one point belongs to the first cluster and the other to the second cluster.
This method ensures that all points in the resulting merged cluster are close to each other, making the
resulting clusters tighter and more compact.
Dendrogram (Figure 13.3):
• The dendrogram shows that clusters are merged only when even their
farthest members are not too far apart.
• In the figure, merging happens at higher levels because Complete Linkage
waits until even the maximum distances between clusters are tolerable.
• This leads to smaller, more uniform clusters.
Average Linkage
In Average Linkage, the distance between two clusters is calculated as the
average distance of all possible pairs of points, one from each cluster.
• This approach finds a balance between single and complete linkage by
considering all pairwise distances — not just the closest or farthest points.
Dendrogram (Figure 13.4):
• Shows a more balanced merging strategy.
• Clusters are merged when their overall (average) distances are small.
• This often leads to better handling of noisy data compared to complete
linkage.
Mean-Shift Clustering Algorithm
Mean-Shift is a non-parametric and hierarchical clustering method that
does not require prior knowledge of the number of clusters.
It is also known as:
• Mode seeking algorithm
• Sliding window algorithm
Imagine placing a "window" (kernel) on the data points and sliding it toward
the region with the highest data density (like climbing a hill of data density).
Over time, the window "shifts" to areas of high density — which are the modes
(centers) of clusters.
How It Works :
Step 1: Design a window
• Choose a window shape and size (usually a Gaussian kernel is used).
• The window radius is called bandwidth — this determines how close data
points must be to influence the window.
Step 2: Place the window on the data points
• Initially, center the window at a random data point (or iterate over all data
points).
Step 3: Compute the mean of all points under the window
• Find all points within the window (within a given distance — i.e., inside the
kernel's radius).
• Compute the mean (centroid) of these points.
Step 4: Move the window to this mean
• Shift the center of the window to this new mean.
• This shift is done using the mean shift vector:
Step 5: Repeat steps 3–4 until convergence
•Once the window's movement is negligible, convergence
is achieved.
•The final location of the window is considered a cluster
center (a mode).
Applications:
• Image segmentation
• Object tracking
• Pattern recognition
• Computer vision
Advantages of Mean-Shift Clustering
1.No Model Assumptions
1. Unlike K-means or Gaussian Mixture Models, Mean-Shift makes no assumptions
about the shape or number of clusters in the data.
2. It’s non-parametric: you don’t need to define the number of clusters ahead of time.
2.Suitable for All Non-Convex Shapes
1. Works well with clusters that have arbitrary or non-convex shapes (e.g., crescent
moon, spiral).
2. K-means would struggle in such cases due to its spherical bias.
1.Only One Parameter (Bandwidth)
1. The only key parameter is the bandwidth (radius of the window/kernel).
2. Simple configuration compared to algorithms that require tuning multiple
hyperparameters.
2.Robust to Noise
1. Since it’s based on density estimation, sparse outlier points don’t significantly affect
the cluster centers.
2. Naturally filters out noise points not falling into any dense region.
3.No Local Minima or Premature Termination
1. The method does not rely on a loss function to minimize, so it avoids getting trapped
in local minima.
2. The shift continues until true convergence is achieved (no more movement).
Disadvantages of Mean-Shift Clustering
1.Selecting the Bandwidth is Challenging
1. Critical parameter that controls the result:
1.If bandwidth is too large, multiple clusters may merge into one (under-clustering).
2.If bandwidth is too small, the algorithm may create too many clusters (over-clustering) or fail to
converge efficiently.
2.Number of Clusters Cannot Be Specified
1. The user has no direct control over the number of clusters.
2. This is determined entirely by the data distribution and the bandwidth.
3. May be problematic in applications where a specific number of clusters is needed.
K-Means Clustering
• K-means is a clustering algorithm that groups your data into K clusters.
• The goal is to group similar data points together.
• You need to choose the number of clusters (K) before starting.
Step 1: Choose how many clusters you want (K).
Step 2: Randomly pick K points from your data. These act as the initial
centroids (centers of clusters).
Step 3:
• Assign each data point to the closest centroid.
• Use Euclidean distance (or another measure) to decide "closeness".
Step 4:
• Calculate a new centroid for each cluster.
→ It's the average (mean) of all the points in that cluster.
Step 5:
• Repeat Steps 3 and 4 until the centroids don’t change anymore (i.e., the
clusters are stable).