0% found this document useful (0 votes)
29 views32 pages

Week 9. Unsupervised Learning

The document provides an overview of machine learning, focusing on unsupervised learning and clustering algorithms such as K-Means and Agglomerative Clustering. It explains the differences between supervised and unsupervised learning, the importance of dimensionality reduction, and methods for choosing the optimal number of clusters. Additionally, it covers techniques like PCA for feature extraction and the challenges of evaluating unsupervised learning outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views32 pages

Week 9. Unsupervised Learning

The document provides an overview of machine learning, focusing on unsupervised learning and clustering algorithms such as K-Means and Agglomerative Clustering. It explains the differences between supervised and unsupervised learning, the importance of dimensionality reduction, and methods for choosing the optimal number of clusters. Additionally, it covers techniques like PCA for feature extraction and the challenges of evaluating unsupervised learning outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Machine Learning:

Clustering Algorithms

Instructor: Sabina Mammadova


Agenda

• Unsupervised Learning

• Agglomerative Clustering

• K-Means Algorithm

• Choosing K – Elbow method & Silhouette Analysis

• Dimensionality Reduction
Machine Learning Algorithms
Linear Regression, Polynomial
Regression, Support Vector
Regression Regression, Decision Tree
Regression, Random Forest
Supervised Regression
Learning Logistic Regression, K-Nearest
Neighbors, Support Vector
Classification Machines, Decision Tree, Random
Forest, Naïve Bayes

Clustering K-Means, Hierarchical, DBSCAN

Machine Unsupervised Association Apriori, FP-Growth


Learning Learning Analysis
Dimensionality
PCA, LDA
Reduction

Reinforcemen Q-Learning, Deep Q-Networks…


t Learning
Difference between Supervised and
Unsupervised Learning

• Input data is labelled • Input data is unlabeled


• There is a training phase • There is no training
• Data is modelled based phase
on training dataset • Uses properties of given
• Known number of data for clustering
classes (for • Unknown number of
classification) classes
What is Unsupervised Learning?
• Unsupervised learning is a type of machine learning where an
algorithm learns patterns and structures from data without
labeled outputs. Unlike supervised learning, where models are
trained using labeled input-output pairs (e.g., images labeled as
"cat" or "dog"), unsupervised learning works with unlabeled
data and tries to discover hidden patterns or relationships within
it.
• It is mainly used for data exploration, feature learning, and
preprocessing in machine learning pipelines.
• Hard to Evaluate: No ground truth to compare results with.
• Interpretability: The patterns found by the algorithm may not
always be meaningful or useful.
Unsupervised Learning Algorithms
• Dimensionality
Reduction— the task
of reducing the number
of input features in a
dataset,
• Anomaly Detection—
the task of detecting
instances that are very
different from the
norm, and
• Clustering — the task
of grouping similar
instances into clusters.
Unsupervised Learning Algorithms
• Unsupervised learning includes transformations, clustering, and
anomaly detection algorithms, each with real-world applications.
• Transformations like dimensionality reduction help in
bioinformatics for gene expression analysis, while topic
extraction is used in news categorization and social media
monitoring.
• Clustering groups similar data points, enabling customer
segmentation in marketing and facial recognition in social
media.
• Anomaly detection algorithms identify unusual patterns, making
them essential for fraud detection in banking, cybersecurity
threat detection, and fault detection in manufacturing, helping
to recognize deviations from normal behavior.
Unsupervised Learning Algorithms
• One of the biggest challenges in unsupervised learning is evaluating whether the
algorithm has learned something useful. Since there are no labels in the data, we don’t
have a correct answer to compare the model’s output against. For example, imagine
using a clustering algorithm to group customers based on their purchasing behavior.
The algorithm might group people who buy luxury items separately from those who
buy everyday essentials. While this is a valid way to categorize customers, it may not
be what we were expecting—perhaps we wanted to group them based on shopping
frequency instead. However, because there are no predefined labels, we cannot
directly tell the algorithm what we want. The only way to assess the results is through
manual inspection.
• Due to this challenge, unsupervised learning is mostly used for exploratory data
analysis, helping data scientists uncover hidden patterns in the data rather than
making final decisions in automated systems. Another important use of unsupervised
learning is preprocessing for supervised learning. For example, dimensionality
reduction can simplify complex data, making it easier for supervised algorithms to
work efficiently while also improving their accuracy. Additionally, techniques like
scaling and normalization, which adjust data values to a consistent range, are also
considered unsupervised because they don’t rely on labeled data. These preprocessing
steps are crucial for improving the performance of machine learning models.
Agglomerative
Clustering
Agglomerative Clustering
• Agglomerative clustering refers to a collection of clustering
algorithms that all build upon the same principles: the algorithm
starts by declaring each point its own cluster, and then merges
the two most similar clusters until some stopping criterion is
satisfied. The stopping criterion implemented in scikit-learn is the
number of clusters, so similar clusters are merged until only the
specified number of clusters are left. There are several linkage
criteria that specify how exactly the “most similar cluster” is
measured. This measure is always defined between two existing
clusters.
• Not appropriate for large datasets
Computing Distance Matrix
• The default choice, ward picks the two clusters to merge such that the variance
within all clusters increases the least. This often leads to clusters that are
relatively equally sized.
• Average linkage merges the two clusters that have the smallest average
distance between all their points.
• Complete linkage (also known as maximum linkage) merges the two clusters
that have the smallest maximum distance between their points.
• Single linkage: Merges the two clusters that have the smallest minimum
distance between any of their points. Sensitive to noise and outliers.
• Centroid linkage: Merges clusters based on the distance between their centroids
(mean points). Less sensitive to outliers than single linkage but can cause
inversion (clusters merging in unexpected ways).
• Median Linkage: Similar to centroid linkage but uses the median instead of the
mean when computing cluster distances.
• Ward works on most datasets. If the clusters have very dissimilar numbers of
members (if one is much bigger than all the others, for example), average or
complete might work better.
Computing Distance Matrix
What is Dendrogram?
• A Dendrogram is a diagram that represents the hierarchical relationship between objects. The
Dendrogram is used to display the distance between each pair of sequentially merged objects.
• These are commonly used in studying hierarchical clusters before deciding the number of
clusters significant to the dataset.
• The distance at which the two clusters combine is referred to as the dendrogram distance.
• The primary use of a dendrogram is to work out the best way to allocate objects to clusters.
Hierarchical Agglomerative
Clustering
• It is also known as the bottom-
up approach or hierarchical
agglomerative clustering
(HAC). Unlike flat clustering
hierarchical clustering provides
a structured way to group data.
This clustering algorithm does
not require us to prespecify the
number of clusters. Bottom-up
algorithms treat each data as a
singleton cluster at the outset
and then successively
agglomerate pairs of clusters
until all clusters have been
merged into a single cluster
that contains all data.
Hierarchical Divisive Clustering
• It is also known as a top-
down approach. This
algorithm also does not
require to prespecify the
number of clusters. Top-
down clustering requires
a method for splitting a
cluster that contains the
whole data and proceeds
by splitting clusters
recursively until
individual data have been
split into singleton
clusters.
K-Means Algorithm
What is K-Means Algorithm?
• K-Means Clustering is an Unsupervised Learning algorithm, which groups
the unlabeled dataset into different clusters. Here K defines the number
of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so
on.
• It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a
centroid.
• The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.
What is K-Means Algorithm?
The k-means clustering algorithm mainly performs two tasks:

• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
How does the K-Means Algorithm
Work?
• Step 1: Select the number K to decide the number of clusters.
• Step 2: Select random K points or centroids. (It can be other from
the input dataset).
• Step 3: Assign each data point to their closest centroid, which will
form the predefined K clusters.
• Step 4: Calculate the variance and place a new centroid of each
cluster.
• Step 5: Repeat the third steps, which means reassign each
datapoint to the new closest centroid of each cluster.
• Step 6: If any reassignment occurs, then go to step-4 else go to
FINISH.
• Step 7: The model is ready.
How does the K-Means Algorithm
Work?

1 2 3 4 5

6 7 8 9 10
Choosing K – Elbow
method & Silhouette
Analysis
Elbow Method – optimal value of K
• Perform K-Means clustering on the dataset for a range of values of K (e.g., K = 1
to 10).
• As K increases, inertia will decrease, because adding more clusters reduces the
distance between data points and their centroids.
• The "elbow" point in the plot is where the rate of decrease sharply slows down.
This point suggests a good balance between the number of clusters and the
amount of variance explained.
Silhouette Analysis
• Silhouette analysis can be used to study the separation distance between the
resulting clusters. The silhouette plot displays a measure of how close each
point in one cluster is to points in the neighboring clusters and thus provides a
way to assess parameters like number of clusters visually. This measure has a
range of [-1, 1].
• Silhouette coefficients (as these values are referred to as) near +1 indicate that
the sample is far away from the neighboring clusters. A value of 0 indicates that
the sample is on or very close to the decision boundary between two
neighboring clusters and negative values indicate that those samples might
have been assigned to the wrong cluster.
• To calculate the Silhouette coefficient, we need to define the mean distance
of a point to all other points in its cluster (a(i)) and also define the mean
distance to all other points in the closest cluster (b(i)).
Silhouette Analysis

• For n_clusters = 2 The average silhouette_score is : 0.7049787496083262


• For n_clusters = 3 The average silhouette_score is : 0.5882004012129721
• For n_clusters = 4 The average silhouette_score is : 0.650518663272943
• For n_clusters = 5 The average silhouette_score is : 0.561464362648773
• For n_clusters = 6 The average silhouette_score is : 0.4857596147013469

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.htm
l
Dimensionality
Reduction
Dimensionality Reduction
• Dimensionality reduction is the process of reducing the number of
features (variables) in a dataset while preserving as much important
information as possible. In real-world data, many features can be
correlated or redundant, making models more complex and harder to
interpret. By reducing dimensions, we can simplify the data, improve
computational efficiency, and enhance visualization.
• Imagine you have a dataset with 100 different features describing
customer behavior in an online store. Many of these features might
carry overlapping information—like "total money spent" and "average
purchase value." Instead of analyzing all 100 features, dimensionality
reduction methods help find the most informative ones or create new
features that summarize the essential patterns in the data.
Dimensionality Reduction
• There are two main approaches to dimensionality reduction:
• Feature Selection – Choosing a subset of the most important original
features based on certain criteria (e.g., removing low-variance or highly
correlated features).
• Feature Extraction – Creating new, fewer features that capture the essential
patterns of the data. Techniques like Principal Component Analysis (PCA)
and t-SNE fall into this category.
• A major benefit of dimensionality reduction is that it helps in data
visualization. If a dataset has 50 or 100 features, it's impossible to plot
directly. But by reducing it to two or three dimensions, we can create
scatter plots that reveal meaningful clusters and relationships.
• However, reducing dimensions also has risks—some details might be
lost, and the transformed features may not always have clear
interpretations. Therefore, it's important to balance simplicity with
accuracy, choosing the right number of dimensions based on the
specific problem and dataset.
What is PCA?
• Principal Component Analysis (PCA) is a technique used to
reduce the dimensionality of a dataset while preserving as
much variance (information) as possible. It transforms
correlated features into a new set of uncorrelated features
called principal components.
• PCA is widely used in:
Data compression – Reducing storage needs.
Feature selection – Removing less important features.
Noise reduction – Eliminating redundant information.
Visualization – Plotting high-dimensional data in 2D or 3D.
• Selecting only the top k principal components instead of using all
original features.
Eigenvectors and Eigenvalues
• Eigenvectors are the new directions or axes that we use to transform our
data. In PCA, they represent the directions of maximum variance (the
most important features of the data).
• Eigenvalues tell us how much variance (or "information") is captured by
each eigenvector. A higher eigenvalue means the corresponding
eigenvector (principal component) is more important because it captures
more variance.
• In PCA:
• We first calculate the covariance matrix of our data.
• We find eigenvectors and eigenvalues of the covariance matrix.
• The eigenvectors are the new axes for the transformed data (principal
components).
• The eigenvalues tell us how much each principal component explains the
variance in the data.
• We pick the principal components with the largest eigenvalues to reduce the
number of dimensions while keeping most of the data's information.
Thank you!

Common questions

Powered by AI

Agglomerative clustering, or the bottom-up approach, starts with each data as its own cluster and merges clusters until all data are combined into one cluster . Divisive clustering, or the top-down approach, starts with a single cluster containing all data and recursively splits it until individual data points form separate clusters .

The main challenge with unsupervised learning is its evaluation, as it lacks labeled data, providing no straightforward method for validating the learned patterns . This makes it particularly hard for final decisions in automated systems since any patterns discovered require manual inspection to verify their usefulness and relevance, making it more suited for exploratory analysis rather than definitive decision-making .

K-Means clustering is a centroid-based algorithm because it associates each cluster with a centroid and iteratively reallocates data points to the nearest centroid to minimize variance within clusters . Unlike K-Means, agglomerative clustering does not require pre-defining cluster centroids or a specific number of clusters, instead it merges clusters based on a similarity measure until a stopping criterion is reached .

Unsupervised learning aims to find patterns and structures in data without labeled outputs, primarily used for data exploration, feature learning, and preprocessing . The challenge in evaluating unsupervised learning results stems from the absence of labels, making it hard to compare the output against a ground truth. This necessitates manual inspection to assess whether the identified patterns are meaningful .

Silhouette analysis evaluates how well-separated and compact the clusters are by measuring how close each point in a cluster is to neighboring clusters, providing visualization for assessing the number of clusters . Silhouette coefficients range from -1 to 1; values near +1 imply points are far from neighboring clusters and well-clustered, while scores around 0 indicate points are close to decision boundaries, and negative scores suggest incorrect assignment .

Linkage criteria in agglomerative clustering determine how to measure the distance between clusters for merging . Different criteria impact clustering outcomes by influencing cluster size and sensitivity to noise. For example, 'single linkage' is sensitive to noise as it merges clusters based on the smallest inter-point distance, while 'complete linkage' uses the largest distance, making it more robust but potentially creating elongated clusters. 'Average linkage' uses average distances, balancing between these extremes, and 'ward linkage' minimizes within-cluster variance, often producing compact, equally-sized clusters .

Feature selection involves choosing a subset of significant original features based on criteria like variance or correlation, preserving interpretability by using existing features . Feature extraction creates new features that summarize important data patterns, often reducing dimensionality more effectively when direct feature interpretation is less crucial . Selection is preferred when maintaining original feature understandability is necessary, whereas extraction is better for achieving substantial dimensionality reduction and capturing complex patterns, as in PCA .

Dimensionality reduction benefits machine learning models by simplifying data, improving computational efficiency, enhancing visualization, and potentially increasing the accuracy of models by removing redundant features . However, it risks losing important details and transformed features may lack clear interpretation, necessitating careful balance between simplicity and accuracy .

The Elbow Method determines the optimal number of clusters by performing K-Means clustering for a range of K values and plotting inertia (sum of squared distances to the nearest cluster center) against K. The optimal number of clusters is generally where the plot forms an 'elbow,' indicating a point beyond which adding more clusters yields diminishing returns on variance explained .

In PCA, eigenvectors define the directions of maximum variance in the data, which represent the new axes or principal components . Eigenvalues indicate the amount of variance captured by each eigenvector; higher eigenvalues mean more variance is captured, thus the corresponding eigenvectors are more important . The process involves computing the covariance matrix, determining its eigenvectors and eigenvalues, and selecting the principal components with the largest eigenvalues to reduce dimensionality while preserving information .

You might also like