Chatgpt Unit - 4
Chatgpt Unit - 4
3. Density-Based Clustering
DBSCAN (Density-Based Spa al Clustering of Applica ons with Noise):
1. Overview:
o Groups data points into clusters based on regions of high density.
o Iden fies noise as points that do not belong to any cluster.
2. Key Parameters:
o Epsilon (ε): Neighborhood radius.
o MinPts: Minimum number of points required to form a dense region.
3. Advantages:
o Detects clusters of arbitrary shapes.
o Handles noise effec vely.
4. Disadvantages:
o Sensi ve to the choice of ε and MinPts.
o Not suitable for datasets with varying densi es.
Gaussian Mixture Model (GMM):
1. Overview:
o Represents clusters as a mixture of Gaussian distribu ons.
o Uses the Expecta on-Maximiza on (EM) algorithm to es mate parameters.
2. Advantages:
o Handles overlapping clusters well.
o Probabilis c model provides so clustering.
3. Disadvantages:
o Requires the number of clusters in advance.
o Computa onally expensive for large datasets.
4. Hierarchical Clustering
BIRCH (Balanced Itera ve Reducing and Clustering using Hierarchies):
1. Overview:
o Incrementally builds a hierarchical clustering tree (CF tree).
o Suitable for large datasets.
2. Advantages:
o Scalable and memory-efficient.
o Automa cally determines the number of clusters.
3. Disadvantages:
o Assumes spherical clusters.
Agglomera ve Hierarchical Clustering:
1. Overview:
o Bo om-up approach where each data point starts as a single cluster.
o Merges clusters itera vely based on similarity.
2. Linkage Methods:
o Single Linkage: Minimum distance between points in clusters.
o Complete Linkage: Maximum distance between points in clusters.
o Average Linkage: Average distance between points in clusters.
3. Advantages:
o Does not require the number of clusters in advance.
4. Disadvantages:
o Computa onally expensive.
o Sensi ve to noise and outliers.
Divisive Hierarchical Clustering:
1. Overview:
o Top-down approach where all points start in one cluster.
o Splits clusters itera vely.
2. Advantages:
o Provides a global perspec ve of the dataset.
3. Disadvantages:
o Computa onally expensive.
Conclusion
Clustering is a fundamental unsupervised learning technique that helps uncover hidden pa erns in data. By
understanding and applying various clustering algorithms and evalua ng their results, one can effec vely segment
data and gain valuable insights.