Datascience Notes Unit-5
Datascience Notes Unit-5
Unsupervised Learning:
Unsupervised learning is a type of machine learning that analyzes and models data without
labelled responses or predefined categories. Unlike supervised learning, where the
algorithm learns from input-output pairs, unsupervised learning algorithms work solely with
input data and aim to discover hidden patterns, structures or relationships within the
dataset independently, without any human intervention or prior knowledge of the data's
meaning.
The image shows set of animals like elephants, camels and cows that represents raw data
that the unsupervised learning algorithm will process.
The "Interpretation" stage signifies that the algorithm doesn't have predefined labels or
categories for the data. It needs to figure out how to group or organize the data based on
inherent patterns.
The output shows the results of the unsupervised learning process. In this case, the
algorithm might have grouped the animals into clusters based on their species (elephants,
camels, cows).
Working of Unsupervised Learning:
2. Select an Algorithm
Choose a suitable unsupervised algorithm such as clustering like K-Means, association rule
learning like Apriori or dimensionality reduction like PCA based on the goal.
The algorithm looks for similarities, relationships or hidden structures within the data.
The algorithm organizes data into groups (clusters), rules or lower-dimensional forms
without human input.
Example: It may group similar animals together or extract key patterns from large datasets.
Analyze the discovered groups, rules or features to gain insights or use them for further
tasks like visualization, anomaly detection or as input for other models.
1. Clustering Algorithms
Clustering is an unsupervised machine learning technique that groups unlabeled data into
clusters based on similarity. Its goal is to discover patterns or relationships within the data
without any prior knowledge of categories or labels.
Works purely from the input data without any output labels.
K-means Clustering: Groups data into K clusters based on how close the points are to each
other.
Density-Based Clustering (DBSCAN): Finds clusters in dense areas and treats scattered points
as noise.
Mean-Shift Clustering: Discovers clusters by moving points toward the most crowded areas.
Spectral Clustering: Groups data by analyzing connections between points using graphs.
Efficient Tree-based Algorithms: Scales to handle large datasets by organizing data in tree
structures.
3. Dimensionality Reduction
Dimensionality reduction is the process of decreasing the number of features or variables in
a dataset while retaining as much of the original information as possible. This technique
helps simplify complex data making it easier to analyze and visualize. It also improves the
efficiency and performance of machine learning algorithms by reducing noise and
computational cost.
It reduces the dataset’s feature space from many dimensions to fewer, more meaningful
ones.
Linear Discriminant Analysis (LDA): Reduces dimensions while maximizing class separability
for classification tasks.
Non-negative Matrix Factorization (NMF): Breaks data into non-negative parts to simplify
representation.
Locally Linear Embedding (LLE): Reduces dimensions while preserving the relationships
between nearby points.
Unsupervised learning has diverse applications across industries and domains. Key
applications include:
Anomaly Detection: Identifies unusual patterns in data, aiding fraud detection, cybersecurity
and equipment failure prevention.
Image and Text Clustering: Groups similar images or documents for tasks like organization,
classification or content recommendation.
Social Network Analysis: Detects communities or trends in user interactions on social media
platforms.
Advantages:
No need for labeled data: Works with raw, unlabeled data hence saving time and effort on
data annotation.
Discovers hidden patterns: Finds natural groupings and structures that might be missed by
humans.
Handles complex and large datasets: Effective for high-dimensional or vast amounts of data.
Clustering is an unsupervised machine learning technique that groups similar data points
together into clusters based on their characteristics, without using any labeled data. The
objective is to ensure that data points within the same cluster are more similar to each other
than to those in different clusters, enabling the discovery of natural groupings and hidden
patterns in complex datasets.
How: Data points are assigned to clusters based on similarity or distance measures.
Types of Clustering
1. Hard Clustering: In hard clustering, each data point strictly belongs to exactly one cluster,
no overlap is allowed. This approach assigns a clear membership, making it easier to
interpret and use for definitive segmentation tasks.
Example: If clustering customer data into 2 segments, each customer belongs fully to
either Cluster 1 or Cluster 2 without partial memberships.
Let's see an example to see the difference between the hard and soft clustering using a
distribution,
Data
Point Hard Clustering Soft Clustering
2. Soft Clustering: Soft clustering assigns each data point a probability or degree of
membership to multiple clusters simultaneously, allowing data points to partially belong to
several groups.
Example: A data point may have a 70% membership in Cluster 1 and 30% in Cluster 2,
reflecting uncertainty or overlap in group characteristics.
Use cases: Situations with overlapping class boundaries, fuzzy categories like
customer personas or medical diagnosis.
Clustering methods can be classified on the basis of how they for clusters,
Centroid-based clustering organizes data points around central prototypes called centroids,
where each cluster is represented by the mean (or medoid) of its members. The number of
clusters is specified in advance and the algorithm allocates points to the nearest centroid,
making this technique efficient for spherical and similarly sized clusters but sensitive to
outliers and initialization.
Algorithms:
K-medoids: Similar to K-means but uses actual data points (medoids) as centers,
robust to outliers.
Pros:
Cons:
Algorithms:
Pros:
Cons:
Approaches:
Divisive (Top-down): Start with one cluster; iteratively split into smaller clusters.
Pros:
Cons:
4. Distribution-based Clustering
Algorithm:
Gaussian Mixture Model (GMM): Fits data as a weighted mixture of Gaussian
distributions; assigns data points based on likelihood.
Pros:
Cons:
Sensitive to initialization.
5. Fuzzy Clustering
Fuzzy clustering extends traditional methods by allowing each data point to belong to
multiple clusters with varying degrees of membership. This approach captures ambiguity
and soft boundaries in data and is particularly useful when the clusters overlap or
boundaries are not clear-cut.
Algorithm:
Fuzzy C-Means: Similar to K-means but with fuzzy memberships updated iteratively.
Pros:
Cons:
Use Cases
Image Segmentation: Dividing images into meaningful parts for object detection,
medical diagnostics or computer vision tasks.
Recommendation Systems: Clustering user preferences to recommend movies,
products or content tailored to different groups.
Distance measures are the backbone of clustering algorithms. Distance measures are
mathematical functions that determine how similar or different two data points are. The
choice of distance measure can significantly impact the clustering results, as it influences
the shape and structure of the clusters.
The choice of distance measure significantly impacts the quality of the clusters formed and
the insights derived from them. A well-chosen distance measure can lead to meaningful
clusters that reveal hidden patterns in the data, while a poorly chosen measure can result in
clusters that are misleading or irrelevant.
The performance of the clustering method, and its result can be strongly impacted by
the distance measure selection.
It affects the formation of clusters and may have an impact on the validity and
interpretability of the clusters.
There are several types of distance measures, each with its strengths and weaknesses. Here
are some of the most commonly used distance measures in clustering:
1. Euclidean Distance
The Euclidean distance is the most widely used distance measure in clustering. It calculates
the straight-line distance between two points in n-dimensional space. The formula for
Euclidean distance is:
d(p,q)=Σi=1n(pi−qi)2d(p,q)=Σi=1n(pi−qi)2
where,
import numpy as np
import [Link] as plt
point1 = [2, 3]
point2 = [5, 7]
[Link]()
[Link]('Euclidean Distance')
[Link]('X-axis')
[Link]('Y-axis')
[Link]()
2. Manhattan Distance
The Manhattan distance, is the total of the absolute differences between their Cartesian
coordinates, sometimes referred to as the L1 distance or city block distance. Envision
maneuvering across a city grid in which your only directions are horizontal and vertical. The
Manhattan distance, which computes the total distance traveled along each dimension to
reach a different data point represents this movement. When it comes to categorical data
this metric is more effective than Euclidean distance since it is less susceptible to outliers.
The formula is:
d(p,q)=Σi=1n∣pi−qi∣d(p,q)=Σi=1n∣pi−qi∣
Implementation in Python
[Link]()
[Link]('Manhattan Distance')
[Link]('X-axis')
[Link]('Y-axis')
[Link]()
Output:
Manhattan Distance: 7
The two points are represented by the red and blue points in the plot. The grid-line-based
method used to determine the Manhattan distance is depicted by the dashed black lines.
3. Cosine Similarity
Instead than concentrating on the exact distance between data points , cosine
similarity measure looks at their orientation. It calculates the cosine of the angle between
two data points, with a higher cosine value indicating greater similarity. This measure is
often used for text data analysis, where the order of features (words in a sentence) might
not be as crucial as their presence. It is used to determine how similar the vectors are,
irrespective of their magnitude.
similarity(A,B)=A.B∥A∥∥B∥similarity(A,B)=∥A∥∥B∥A.B
Example in Python
norm1 = [Link](point1)
norm2 = [Link](point2)
# For Cosine Similarity, we will plot the vectors originating from the origin
origin = [0, 0]
[Link]()
[Link]('Cosine Similarity')
[Link]('X-axis')
[Link]('Y-axis')
[Link]()
Output:
4. Minkowski Distance
d(x,y)=(Σi=1n∣xi−yi∣p)1pd(x,y)=(Σi=1n∣xi−yi∣p)p1
p=3
[Link]()
# For Minkowski with p=3, the visualization isn't straightforward like Euclidean or Manhattan
# We will plot the same line for illustration purposes
[Link]('Minkowski Distance')
[Link]('X-axis')
[Link]('Y-axis')
[Link]()
Output:
In the plot, the visualization is similar to Euclidean distance when p=3, but the distance
calculation formula changes.
5. Jaccard Index
This measure is ideal for binary data, where features can only take values of 0 or 1. It
calculates the ratio of the number of features shared by two data points to the total number
of features. Jaccard Index measures the similarity between two sets by comparing the size of
their intersection and union.
J(A,B)=∣A∩B∣∣A∪B∣J(A,B)=∣A∪B∣∣A∩B∣
Jaccard Index Example in Python
Output:
The type of data and the particulars of the clustering operation will determine which
distance metric is best. Here are some things to think about:
Data Type: Different distance metrics may be needed for binary, categorical , or
numerical data.
Scale Sensitivity: The scale of the data affects the measurement of some distances
such as Euclidean distance. Data standardization can aid in resolving this problem.
Interpretability: For the specified application the selected measure should yield
findings , that are both meaningful and comprehensible.
The choice of distance measure depends on the nature of the data and the clustering
algorithm being used. Here are some general guidelines:
Minkowski distance is suitable when you want to generalize the Euclidean and
Manhattan distances.
Cosine similarity is suitable for text data or when the angle between vectors is more
important than the magnitude.
Jaccard similarity is suitable for categorical data or when the intersection and union
of sets are more important than the individual elements.
Key Considerations
The type of data and the particulars of the clustering operation will determine which
distance metric is best. Here are some things to think about:
Data Type: Different distance metrics may be needed for binary, categorical , or numerical
data.
Scale Sensitivity: The scale of the data affects the measurement of some distances such as
Euclidean distance. Data standardization can aid in resolving this problem.
Interpretability: For the specified application the selected measure should yield findings ,
that are both meaningful and comprehensible.
Computational Efficiency: Take into account the complexity of computing particularly when
working with big datasets.
Existence of Outliers: Outliers have a big impact on metrics based on distance. If outliers are
an issue use metrics that are less susceptible to them.
Clustering Algorithm: Various clustering methods may need a certain distance metric.
The choice of distance measure depends on the nature of the data and the clustering
algorithm being used. Here are some general guidelines:
Manhattan distance is suitable for data with a uniform distribution or when the dimensions
are not equally important.
Minkowski distance is suitable when you want to generalize the Euclidean and Manhattan
distances.
Cosine similarity is suitable for text data or when the angle between vectors is more
important than the magnitude.
Jaccard similarity is suitable for categorical data or when the intersection and union of sets
are more important than the individual elements.
Clustering is a technique in Machine Learning that is used to group similar data points. While
the algorithm performs its job, helping uncover the patterns and structures in the data, it is
important to judge how well it functions. Several metrics have been designed to evaluate
the performance of these clustering algorithms.
Clustering:
There are multiple clustering algorithms, and each has its way of grouping data points.
Clustering metrics are used to evaluate all these algorithms. Let us take a look at some of the
most commonly used clustering metrics:
1. Silhouette Score
The Silhouette Score is a way to measure how good the clusters are in a dataset. It helps us
understand how well the data points have been grouped. The score ranges from -1 to 1.
A score close to 1 means a point fits really well in its group (cluster) and is far from
other groups.
A score close to 0 means the point is on the border between two clusters.
b(i)−a(i)
S(i)=
max(a(i), b(i))
where,
a(i) is the average distance from i to other data points in the same cluster.
b(i) is the smallest average distance from i to data points in a different cluster.
2. Davies-Bouldin Index
The Davies-Bouldin Index (DBI) helps us measure how good the clustering is in a dataset. It
looks at how tight each cluster is (compactness), and how far apart the clusters are
(separation).
1
k
R +R
DB= ∑ ❑ max j≠ i ( ii jj )
k i=1 Rij
where,
The Calinski-Harabasz Index measures how good the clusters are in a dataset.
It looks at:
A higher score is better, as it means the clusters are tight and well-separated. It helps
determine the ideal number of clusters.
where,
where,
K is number of clusters
where,
The Adjusted Rand Index (ARI) helps us measure how accurate a clustering result is by
comparing it to the true labels (ground truth).
Are the same pairs together in both the real and predicted clusters?
where,
Mutual Information measures how much two variables are related or connected. In
clustering, it compares how much the true cluster labels match with the predicted labels. It
shows how much knowing about one variable helps us predict the other. The more
agreement there is, the higher the score.
p( y i , z j)
MI ( y , z)=∑ ❑ ∑ ❑ p ( y i , z j )⋅ log ( )
i j p( y i )⋅ p(z j)
where,
y i is a true label.
z i is a predicted label.
These clustering metrics help in evaluating the quality and performance of clustering
algorithms, allowing for informed decisions when selecting the most suitable clustering
solution for a given dataset.
Let's consider an example using the Iris dataset and the K-Means clustering algorithm. We
will calculate the Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and
Adjusted Rand Index to evaluate the clustering.
Import Libraries
Load or generate your dataset for clustering. Iris dataset consists of 150 samples of iris
flowers. There are three species of iris flower: setosa, versicolor, and virginica with four
features: sepal length, sepal width, petal length, and petal width.
iris = load_iris()
X = [Link]
Perform Clustering
kmeans = KMeans(n_clusters=3)
[Link](X)
mi = mutual_info_score([Link], kmeans.labels_)
Output:
K-means clustering algorithm computes the centroids and iterates until we it finds optimal
centroid. It assumes that the number of clusters are already known. It is also called flat
clustering algorithm. The number of clusters identified from data by algorithm is
represented by 'K' in K-means.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of
the squared distance between the data points and centroid would be minimum. It is to be
understood that less variation within the clusters will lead to more similar data points within
same cluster.
We can understand the working of K-Means clustering algorithm with the help of following
steps −
Step 2 − Next, randomly select K data points and assign each data point to a cluster.
In simple words, classify the data based on the number of data points.
Step 4 − Next, keep iterating the following until we find optimal centroid which is the
assignment of data points to the clusters that are not changing any more −
4.1 − First, the sum of squared distance between data points and centroids would be
computed.
4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster
(centroid).
4.3 − At last compute the centroids for the clusters by taking the average of all data points of
that cluster.
While working with K-means algorithm we need to take care of the following things −
The K-Means algorithm is a straightforward and efficient algorithm, and it can handle large
datasets. However, it has some limitations, such as its sensitivity to the initial centroids, its
tendency to converge to local optima, and its assumption of equal variance for all clusters.
Python has several libraries that provide implementations of various machine learning
algorithms, including K-Means clustering. Let's see how to implement the K-Means
algorithm in Python using the scikit-learn library.
Example :
It is a simple example to understand how k-means works. In this example, we generate 300
random data points with two features. And apply K-means algorithm to generate clusters.
import numpy as np
To test the K-Means algorithm, we need to generate some sample data. In this example, we
will generate 300 random data points with two features. We will visualize the data also.
X = [Link](300,2)
[Link](figsize=(7.5, 3.5))
[Link]()
Output:
Next, we need to initialize the K-Means algorithm by specifying the number of clusters (K)
and the maximum number of iterations.
After initializing the K-Means algorithm, we can train the model by fitting the data to the
algorithm.
[Link](X)
To visualize the clusters, we can plot the data points and color them based on their assigned
cluster.
[Link](figsize=(7.5, 3.5))
[Link](kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],
[Link]()
Output:
The output of the above code will be a plot with the data points colored based on their
assigned cluster, and the centroids marked with an 'x' symbol in red color.
Spectral Clustering
Spectral Clustering is a variant of the clustering algorithm that uses the connectivity
between the data points to form the clustering. It uses eigenvalues and eigenvectors of the
data matrix to forecast the data into lower dimensions space to cluster the data points. It is
based on the idea of a graph representation of data where the data point are represented as
nodes and the similarity between the data points are represented by an edge.
Projecting the data onto a lower Dimensional Space: This step is done to account for the
possibility that members of the same cluster may be far away in the given dimensional
space. Thus the dimensional space is reduced so that those points are closer in the reduced
dimensional space and thus can be clustered together by a traditional clustering algorithm.
It is done by computing the Graph Laplacian Matrix.
To compute it though first, the degree of a node needs to be defined. The degree of the ith
n
node is given byd i= ∑ ❑ w ij Note that w ij is the edge between the nodes i and j
j=1 ∣(i , j )ϵE
import numpy as np
A = [Link]([
[0, 1, 1, 0, 0, 0, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0]])
The degree matrix is defined as follows:- Dij ={d i , i= j ¿ 0 , i≠ j ¿
D = [Link]([Link](axis=1))
print(D)
Thus the Graph Laplacian Matrix is defined as:-
L=D−A
L = D-A
print(L)
This Matrix is then normalized for mathematical efficiency. To reduce the dimensions, first,
the eigenvalues and the respective eigenvectors are calculated. If the number of clusters is k
then the first eigenvalues and their eigenvectors are taken and stacked into a matrix such
that the eigenvectors are the columns.
Example:
Credit Card Data Clustering Using Spectral Clustering
The below steps demonstrate how to implement Spectral Clustering using Sklearn. The data
for the following steps is the Credit Card Data which can be downloaded from Kaggle
Step 1: Importing the required libraries
We will first import all the libraries that are needed for this project
import pandas as pd
import [Link] as plt
from [Link] import SpectralClustering
from [Link] import StandardScaler, normalize
from [Link] import PCA
from [Link] import silhouette_score
Step 2: Loading and Cleaning the Data
# Changing the working location to the location of the data
cd "C:\Users\Dev\Desktop\Kaggle\Credit_Card"
# Loading the data
X = pd.read_csv('CC_GENERAL.csv')
# Dropping the CUST_ID column from the data
X = [Link]('CUST_ID', axis = 1)
# Handling the missing values if any
[Link](method ='ffill', inplace = True)
[Link]()
Output:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_normalized = normalize(X_scaled)
X_normalized = [Link](X_normalized)
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(X_normalized)
X_principal = [Link](X_principal)
In the below steps, two different Spectral Clustering models with different values for the
parameter 'affinity'. You can read about the documentation of the Spectral Clustering class
here. a) affinity = 'rbf'
labels_rbf = spectral_model_rbf.fit_predict(X_principal)
colours = {}
colours[0] = 'b'
colours[1] = 'y'
[Link]()
Output:
s_scores = []
s_scores.append(silhouette_score(X, labels_rbf))
s_scores.append(silhouette_score(X, labels_nn))
print(s_scores)
Step 6: Comparing the performances
[Link](affinity, s_scores)
[Link]('Affinity')
[Link]('Silhouette Score')
[Link]()
Output:
1. Scalability: Spectral clustering can handle large datasets and high-dimensional data,
as it reduces the dimensionality of the data before clustering.
3. Robustness: Spectral clustering can be more robust to noise and outliers in the data,
as it considers the global structure of the data, rather than just local distances
between data points.
The algorithm builds clusters step by step either by progressively merging smaller clusters or
by splitting a large cluster into smaller ones. The process is often visualized using
a dendrogram, which helps to understand data similarity.
Imagine we have four fruits with different weights: an apple (100g), a banana (120g), a
cherry (50g) and a grape (30g). Hierarchical clustering starts by treating each fruit as its own
group.
Merge the closest items: grape (30g) and cherry (50g) are grouped first.
Finally all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.
Dendrogram
A dendrogram is like a family tree for clusters. It shows how individual data points or groups
of data merge together. The bottom shows each data point as its own group and as we move
up, similar groups are combined. The lower the merge point, the more similar the groups
are. It helps us see how things are grouped step by step.
At the bottom of the dendrogram the points P, Q, R, S and T are all separate.
As we move up, the closest points are merged into a single group.
The lines connecting the points show how they are progressively merged based on
similarity.
The height at which they are connected shows how similar the points are to each
other; the shorter the line the more similar they are
Now we understand the basics of hierarchical clustering. There are two main types of
hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
1. Start with individual points: Each data point is its own cluster. For example if we
have 5 data points we start with 5 clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the
two data points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance and
merge them into a single cluster.
4. Update distance matrix: After merging we now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance
matrix until we have only one cluster left.
Implementation
Repeat merging until the desired number of clusters or one cluster remains.
The dendrogram visualizes these merges as a tree, showing cluster relationships and
distances.
import numpy as np
clustering = AgglomerativeClustering(n_clusters=3)
labels = clustering.fit_predict(X)
[Link](X)
def plot_dendrogram(model, **kwargs):
counts = [Link](model.children_.shape[0])
n_samples = len(model.labels_)
current_count = 0
current_count += 1
else:
counts[i] = current_count
linkage_matrix = np.column_stack(
dendrogram(linkage_matrix, **kwargs)
ax1.set_title("Agglomerative Clustering")
ax1.set_xlabel("Feature 1")
ax1.set_ylabel("Feature 2")
[Link](ax2)
[Link]("Sample index")
[Link]("Distance")
plt.tight_layout()
[Link]()
Output :
[Link] with all data points in one cluster: Treat the entire dataset as a single large cluster.
[Link] the cluster: Divide the cluster into two smaller clusters. The division is typically done
by finding the two most dissimilar points in the cluster and using them to separate the data
into two parts.
[Link] the process: For each of the new clusters, repeat the splitting process: Choose the
cluster with the most dissimilar points and split it again into two smaller clusters.
[Link] when each data point is in its own cluster: Continue this process until every data
point is its own cluster or the stopping condition (such as a predefined number of clusters) is
met.
Implementation
import numpy as np
[Link](cluster_to_split)
kmeans = KMeans(n_clusters=2, random_state=42).fit(cluster_to_split)
cluster1 = cluster_to_split[kmeans.labels_ == 0]
cluster2 = cluster_to_split[kmeans.labels_ == 1]
[Link]([cluster1, cluster2])
return clusters
[Link](figsize=(12, 5))
[Link](1, 2, 1)
[Link]()
[Link](1, 2, 2)
dendrogram(linked, orientation='top',
distance_sort='descending', show_leaf_counts=True)
[Link]()
Output: