Comprehensive Survey On Hierarchical Clustering Algorithms and The Recent Developments
Comprehensive Survey On Hierarchical Clustering Algorithms and The Recent Developments
https://doi.org/10.1007/s10462-022-10366-3
Xingcheng Ran1,2 · Yue Xi1 · Yonggang Lu1 · Xiangwen Wang1 · Zhenyu Lu1
Abstract
Data clustering is a commonly used data processing technique in many fields, which
divides objects into different clusters in terms of some similarity measure between data
points. Comparing to partitioning clustering methods which give a flat partition of the data,
hierarchical clustering methods can give multiple consistent partitions of the data at differ-
ent levels for the same data without rerunning clustering, it can be used to better analyze
the complex structure of the data. There are usually two kinds of hierarchical clustering
methods: divisive and agglomerative. For the divisive clustering, the key issue is how to
select a cluster for the next splitting procedure according to dissimilarity and how to divide
the selected cluster. For agglomerative hierarchical clustering, the key issue is the similar-
ity measure that is used to select the two most similar clusters for the next merge. Although
both types of the methods produce the dendrogram of the data as output, the clustering
results may be very different depending on the dissimilarity or similarity measure used
in the clustering, and different types of methods should be selected according to different
types of the data and different application scenarios. So, we have reviewed various hierar-
chical clustering methods comprehensively, especially the most recently developed meth-
ods, in this work. The similarity measure plays a crucial role during hierarchical clustering
process, we have reviewed different types of the similarity measure along with the hierar-
chical clustering. More specifically, different types of hierarchical clustering methods are
comprehensively reviewed from six aspects, and their advantages and drawbacks are ana-
lyzed. The application of some methods in real life is also discussed. Furthermore, we have
also included some recent works in combining deep learning techniques and hierarchical
clustering, which is worth serious attention and may improve the hierarchical clustering
significantly in the future.
* Yonggang Lu
[email protected]
Xingcheng Ran
[email protected]
1
School of Information Science and Engineering, Lanzhou University, No. 222 South Tianshui
Road, Lanzhou 730000, Gansu, China
2
Center of Information Technology, Hexi University, No. 846 North Ring Road, Zhangye 734000,
Gansu, China
13
Vol.:(0123456789)
8220 X. Ran et al.
1 Introduction
Data clustering, also known as clustering analysis, is the process of dividing objects into
sets (clusters), such that objects from the same set are similar, while the objects from dif-
ferent sets are dissimilar to each other. It is widely used in data mining (Judd et al. 1998),
machine learning, pattern recognition (Carpineto and Romano 1996), and many other
fields. Data clustering is a difficult unsupervised learning because the data labels are not
available during learning, and different clusters may have different shapes and sizes (Jain
et al. 2000).
There are two different types of clustering techniques: partitional clustering and hier-
archical clustering (Frigui and Krishnapuram 1999). Partitional clustering divides a set of
data (or objects) into a specified number of clusters which is non-overlapping, so that each
element belongs to one cluster (Jain and Dubes 1988). Most of partitional algorithms such
as K-means (Forgy 1965), EM (Redner and Walker 1984) need to specify the number of
clusters K beforehand, but K is usually unknown in practice (Omran et al. 2007). Hierar-
chical clustering obtains a set of nested clusters that are organized as a cluster tree called
dendrogram. Each leaf node of the tree usually contains only one object, while each inner
node represents the union of its children nodes (Tan et al. 2019). From different levels of
the dendrogram, consistent clusters can be obtained at different granularities. So, hierarchi-
cal clustering can be used to detect and analyze the complex structures of the data. Hierar-
chical clustering method can be divided into two types: divisive and agglomerative (Everitt
et al. 2001).
In this paper, the hierarchical clustering methods, including the recently developed
methods, are introduced in the following order:
(1) Divisive hierarchical clustering methods: The clustering starts with one cluster contain-
ing all objects, and then recursively divides it (or each sub-cluster) into two sub-clusters
according to the dissimilarity measure until each sub-cluster includes a single object
(Turi 2001).
(2) Agglomerative hierarchical clustering methods: The clustering begins with each cluster
containing one object and recursively merges the two most similar clusters in terms of
the similarity measure until all objects are included in a single cluster (Turi 2001).
(3) Graph-based hierarchical clustering methods: In some applications, the relationship
between objects can be better represented by a graph or hypergraph, the task of hier-
archical clustering can be converted into the construction and the partitioning of the
graph or hypergraph (von Luxburg 2007).
(4) Density-based hierarchical clustering methods: The methods based on density esti-
mation to construct hierarchy have aroused the interest of researchers. This type of
method not only retains the advantages of density-based clustering, but also reveals the
hierarchical structure among clusters. We can obtain more information from clustering
result.
(5) Combination of the hierarchical clustering methods: Some hierarchical clustering meth-
ods are improved by combining the agglomerative and divisive hierarchical clustering
techniques, or by combining the clustering method and other advanced techniques.
(6) Improving the efficiency of the hierarchical clustering methods: Hierarchical clustering
usually needs expensive computing cost (Reddy and Vinzamuri 2013). So, improving
the efficiency of the hierarchical clustering methods is important for analyzing large
scale datasets.
13
Comprehensive survey on hierarchical clustering algorithms… 8221
Finally, the developing trend of hierarchical clustering is analyzed based on the above
survey.
The rest of this article is organized as follows. In Sect. 2, the generic algorithm scheme
and the distance-based similarities commonly used in hierarchical clustering are intro-
duced. In Sects. 3 and 4, the divisive and agglomerative hierarchical clustering methods are
surveyed respectively. Section 5 introduces the graph based hierarchical clustering methods
including simple graph based and hypergraph based. Section 6 surveys the density based
hierarchical clustering methods. An overview on the combination of the hierarchical clus-
tering methods is given in Sect. 7. The methods of improving the efficiency of the hierar-
chical clustering methods are surveyed in Sect. 8. Section 9 concludes this article with a
summary of the contributions and the direction for future work.
2 Hierarchical clustering
The basic idea of hierarchical clustering algorithms is to construct the hierarchical relation-
ship among data according to a dissimilarity/similarity measure between clusters (Johnson
1967). Hierarchical clustering method is divided into two types: divisive and agglomera-
tive (Everitt et al. 2001), as shown in Fig. 1.
13
8222 X. Ran et al.
The divisive hierarchical clustering method first sets all data points into one initial clus-
ter, then divides the initial cluster into several sub-clusters, and iteratively partitions these
sub-clusters into smaller ones until each cluster contains only one data point or data points
within each cluster are similar enough (Turi 2001). The left branch in Fig. 1 demonstrates
the procedure in detail. When to select a cluster to bisect, the dissimilarities need be com-
puted according certain criterion. The dissimilarity is a key function that strongly exerts
influence on the subsequent operations and the resulting clusters.
Contrary to divisive clustering, the agglomerative hierarchical clustering begins with each
cluster containing only one data point, and then iteratively merges them into larger clusters
until all data points are in one cluster or some conditions are satisfied (Turi 2001). The
right branch in Fig. 1 demonstrates the procedure in detail. When to select two clusters
to merge, the similarities need be computed according to certain criterion. How to calcu-
late the similarity exerts strong influence on the subsequent operations and the resulting
clusters.
Definition 2 The similarity of data X is defined a function that meets the following condi-
tions (Xu and Wunsch II 2005; Murtagh and Contreras 2012, 2017) :
The four conditions above formulate the structure of clustering tree, i.e. dendrogram.
Condition 1 ensures each object forms a cluster initially. In condition 2, when t is enough
large, all objects are contained in a single cluster. Condition 3 guarantees the nested
structure of dendrogram. Last condition shows the partition is stable when existing small
13
Comprehensive survey on hierarchical clustering algorithms… 8223
perturbation 𝜖 . 𝜃 is a scale parameter of dendrogram and reflects the height of different lev-
els. An example of dendrogram is shown in Fig. 2 (Sisodia et al. 2012).
Both divisive and agglomerative hierarchical clustering generate a dendrogram of rela-
tionships of data points and quickly terminate, other advantages are as follows (Kotsiantis
and Pintelas 2004; Berkhin 2006):
However, there are several disadvantages of the hierarchical clustering (Kotsiantis and Pin-
telas 2004; Berkhin 2006):
• Once a merging or division is done on one level of the hierarchy, it cannot be undone
later.
• It is computationally expensive in time and memory, especially for large scale prob-
lems. Generally, the time complexity of hierarchical clustering is quadratic about the
number of data points clustered.
• Termination criteria are ambiguous.
So, various hierarchical clustering algorithms are designed, based on keeping above advan-
tages or reducing the influence of above disadvantages.
During splitting in the divisive hierarchical clustering methods or merging in the
agglomerative hierarchical clustering methods, the dissimilarity or similarity meas-
ure will affect the resulting clusters directly and is usually determined according to the
13
8224 X. Ran et al.
Divisive hierarchical clustering starts from a cluster (root) containing all objects, a cluster
with the largest dissimilarity is iteratively selected to be divided into two sub-clusters to
form its children-nodes until each cluster includes only one object. The dissimilarity is usu-
ally computed in terms of the different types of distance measure. A lot of divisive meth-
ods have been proposed according to the above dissimilarity measure or newly designed
dissimilarity.
A divisive hierarchical clustering can usually be divided into three steps: computing the
dissimilarity, splitting the selected cluster, and determining the node level of the new
clusters (Roux 2018). For the divisive method, the node levels in the dendrogram cannot
be naturally obtained like in the agglomerative method, so a special method needs to be
designed to determine the node levels.
13
Comprehensive survey on hierarchical clustering algorithms… 8225
the continuously good quality of the clusters. It is obvious that bisecting k-Means is an
excellent algorithm for clustering the large-scale data of documents.
Other methods that evaluate the sub-cluster to be split are proposed. A modified
definition of SSE, called ΔSSE is used,
|sj ||sj+1 | ‖ ‖2
ΔSSE(X, j) = ‖m̂j − m
̂ j+1 ‖
(|sj | + |sj+1 |) ‖ ‖
and the sub-cluster with the maximal ΔSSE will be divided in each step (Gracia and Binefa
2011). This method can achieve high accuracy and low over-segmentation rate, and is suit-
able for clustering large video sequence and can quickly provide response for work in real
time, it however requires the final number of the cluster as the prior knowledge (Gracia and
Binefa 2011).
A divisive hierarchical multi-criteria clustering algorithm based on PROMETHEE
(Brans and Vincke 1985; Brans et al. 1986) has been proposed by Ishizaka et al. (2020).
The algorithm takes advantage of the Stochastic Multiobjective Acceptability Analysis
(SMAA) and cluster ensemble methods to avoid the uncertainty and imprecision of
clustering solution. The matrix of the degree of preference in the PROMETHEE is
used as similarity matrix in Ishizaka et al. (2020). When dividing a cluster, pairwise
actions with the highest preference degree in the similarity matrix are selected as the
centroids of two new sub-clusters. Then the alternatives are assigned to the cluster that
has the smallest preference degree to the centroid. The division is accomplished by
comparing the degree of the preference between the action to classify and the centroid.
The algorithm is more stable and robust, and can select the most proper number of
preference clusters for clustering US banks. SMAA generates a large number of solu-
tions by randomly changing the PROMETHEE parameters, and then uses ensemble
clustering to obtain consistent solutions.
To sum up, the dissimilarity measure in divisive hierarchical clustering is deter-
mined in terms of type of dataset, objective of clustering task, easy implementation of
computing, and so forth. It is dependent on the problem.
In the splitting procedure, the main problem is how to divide a selected partition into
sub-partitions. Macnaughton-Smith et al. (1964) have proposed a divisive method by
defining a dissimilarity measure between any two data points or sub-clusters. The dis-
similarity between a data point or sub-cluster and a sub-cluster is the average of the
dissimilarities between this data point and the data points in the sub-cluster. In the
divisive process, the method first selects the farthest data point from the target cluster
as an initial seed. And then it aggregates the seed by adding data points that is closer
to the seed than the other data points in the current cluster. This dissimilarity measure
is suitable for measuring the dissimilarity of variables and attributes. Hubert (1973)
has further developed the idea by using a pair of data points as initial sub-clusters. He
chose the two most dissimilar data points, and then created two new sub-clusters based
on the distances from other data points to the two initial sub-clusters. Kaufman and
Rousseeuw (1990) reused this idea in program DIANA.
13
8226 X. Ran et al.
Many other theories have been used in divisive hierarchical clustering to handle the dif-
ferent kinds of problems, such as Spectral Clustering (von Luxburg 2007), density peaks
(Zhou et al. 2018), rough set theory (Parmar et al. 2007), adaptive resonance theory (Yam-
ada et al. 2020), etc., and some variations of the divisive methods have also been proposed
by combing the above theories.
Vidal et al. (2014) have put forward a divisive hierarchical clustering method that can
identify clusters with arbitrary shapes. The method divides a larger cluster into two smaller
sub-clusters using Spectral Clustering (von Luxburg 2007) associated to random walk.
The Laplacian matrix is defined as L = P−1 W , wherein P is the diagonal matrix of nodes
degree and W is the similarity matrix −d(xi ,xj among
)2
pknng
nodes. The similarity named RBF-PKNNG
is calculated by Spknng (xi , xj , 𝜃) = e 2⋅𝜎2 , wherein the PKNNG metric d(xi , xjpknng ) is the
geodesic distance between points xi and xj , it is very effective for find clusters with arbi-
trary shape in high dimensional data. The method can find a good number of clusters auto-
matically and work well under different scale of clusters because of using the structure of
the tree.
In the hierarchical clustering based on density peaks (HCDP) method (Zhou et al.
2018), the local k-nearest density is computed to construct a tree where the data points are
nodes and edges are generated between a point and its k-nearest neighbors with high den-
sity values. The root point of the tree is of maximal density, and children points have lower
density than its parent point. All the edges in the subtree are sorted in descent order and are
iteratively removed according to the specified number of children. The modified stability
is proposed to extract and evaluate the suitable clusters. Finally, a clustering tree contain-
ing all partitions is produced. By using k-NN when calculating local density, unnecessary
parameter selections are avoided. The new stability is
( )
∑ 1 1
S(Ci ) = ( )− ( ) ,
xj ∈Ci 𝛿exclued xj , Ci 𝛿emerge Ci
where 𝛿exclude is the maximal weight of removing edges by excluding xj from the cluster Ci ,
𝛿emerge is the maximal weight by removing edges of which cluster Ci emerges.
13
Comprehensive survey on hierarchical clustering algorithms… 8227
Every time an edge is cut from the tree, and the current number of clusters is deter-
mined according to whether the calculated stability before splitting is more than the sum of
the stability of clusters after splitting or not.
Nasiriani et al. (2019) have proposed a divisive hierarchical clustering to discover pos-
sible discrimination upon selection of the high impact feature set. In the splitting phase,
a sub-cluster with high impact feature set, that is the set of all high impact attributes, is
selected to divide into two parts. The divisive process is recursively performed until
homogeneous clusters are found. This method can not only identify several instances of
discrimination, but also cluster discrimination data even though providing little relevant
information.
The min-min-roughness (MMR) is a divisive hierarchical clustering algorithm based
on rough set theory (Parmar et al. 2007). The mean roughness, min-roughness, and min-
min-roughness are proposed based on roughness that is calculated by lower approxima-
tion and upper approximation in information system and can represent how strongly an
object belongs to a cluster. The min-roughness (MR) is the minimum among mean rough-
ness values and min-min-roughness (MMR) is the minimum of the min-roughness. MMR
can reflect the closeness of a cluster, the smaller the MMR is, the crisper the cluster is. A
cluster is selected to split binarily in terms of its MMR. MMR is capable of better deal-
ing with the uncertainty for clustering categorical attributes and only requires to input the
number of desired clusters. Later, an improved method named MMeR modified the MMR
to handle not only the categorical data but also the numerical data (Kumar and Tripathy
2009). SDR (standard deviation roughness) is a further improvement of MMeR, standard
deviation of equivalence class is used as similarity in the splitting phase, a total of purity
ratio, which is the sum of the ratio between the number of the data existing in both each
cluster and its correspondence class and the number of the data in dataset, is used to evalu-
ate the results. SDR has capability of clustering heterogeneous data and also handling the
uncertainty during clustering, the efficiency and purity ratio of SDR is also higher over
MMR and MMeR (Tripathy and Ghosh 2011a). Standard Deviation of Standard Deviation
Roughness (SSDR) algorithm further optimizes SDR (Tripathy and Ghosh 2011b). It holds
the advantages above methods and achieves the best efficiency and total purity ratio. Min-
Mean-Mean-Roughness (MMeMeR) is another improvement of SDR, mean mean rough-
ness, which is the mean of mean roughness while the same attribute is assigned to different
values, is the similarity measure, and its efficiency is the highest among MMR, MMeR,
and SDR (Tripathy et al. 2017).
Li et al. (2014) have proposed an attribute selecting and node splitting method named
MTMDP for categorical data. The mean distribution precision (MDP) and cohesion degree
(CD) are proposed, and are used to select the partition attribute based on probabilistic
rough set theory and to determine which child node should be selected to further cluster,
respectively. The total mean distribution precision is the sum of the MDP of all attrib-
utes in the attribute set. MDP reflects the coupling between the equivalence classes, the
larger the MDP is, the smaller the coupling between equivalence classes is. The clustering
starts according to the attribute with maximal total MDP. The child node with minimal
CP should be selected to bi-partition. In the paper of Qin et al. (2014), mean gain ratio
(MGR) is used to select clustering attribute, and the entropy of clusters is used to deter-
mine which equivalence class about the selected attribute is output as a cluster. The value
of MGR reflects how much the partition about an attribute shares information with other
partitions respect to the rest attributes. The larger MGR is, the closer the partition defined
by the attribute is and should be held. In the partition with the highest closeness, the
entropy between objects is the lowest. Therefore, the attribute with the largest MGR value
13
8228 X. Ran et al.
is selected as the clustering attribute to define the partitions and equivalent classes, the one
with the smallest entropy among the equivalent classes defined by the selected attribute is
output as a cluster and removed from the dataset. For each attribute in the dataset, above
operation is repeated until the terminal criterion is met. MGR can complete clustering of
categorical data by specifying or not specifying the number of clusters, and has better clus-
tering accuracy, stability, efficiency, and scalability, as well as can be applied to small- or
large-scale dataset with imbalance or balance class distribution (Qin et al. 2014).
Hierarchical fast topological CIM-based ATR (HFTCA) is a divisive hierarchical clus-
tering algorithm based on adaptive resonance theory-based clustering (Yamada et al. 2020).
During clustering, a network is firstly generated from all the data points by FTCA, fol-
lowing, the data points split by nodes in the previous layer is divided using independently
FTCA by sharing a new similarity threshold in this layer, this procedure is recursively per-
formed until the desired layer is achieved or the network is stable. Obviously, HFTCA is
able to produce the proper number of nodes, and to independently manage increasing of
hierarchy structure basis on data distribution. But there also exists the main shortage that
the similarity threshold in every layer is set in different value.
The divisive hierarchical clustering methods can be utilized to deal with the different types
of data, e.g. community data, document sets, categorical data, and geospatial datasets.
The Principal Direction Divisive Partitioning (PDDP) method (Boley 1998) is proposed
for clustering document sets. It uses Vector Space Model (VSM) (Salton 1975) to represent
documents. In the process of clustering, the method introduces Singular Value Decomposi-
tions (SVD) (Golub and Loan 1996), and calculates the principal direction of each cluster
of each level of the hierarchical clustering. Then it uses a hyperplane, that passes through
the origin and is perpendicular to the principal direction, to divide the cluster into two
parts. Girvan and Newman (2002) have proposed a divisive hierarchical clustering method
to detect community within the real network. The betweenness for all edges is defined, and
the final clusters is obtained by repeatedly removing the edge with the highest between-
ness and recomputing the betweenness influenced by removed edge. Authors applied the
PDDP method on two real-world networks, a collaboration network of scientists and a food
web of marine organisms, and the results were very much in line with expectations. Xiong
et al. (2012) regarded the clustering categorical data as an optimization problem. In the
splitting procedures, the method used Multiple Correspondence Analysis (MCA) (Abdi
and Valentin 2007) to implement the initial splitting, and then optimized the global objec-
tive function the sum of Chi-square error (SCE) to refine the initial splitting of clusters. In
DHCC algorithm based on multiple correspondence analysis (MAC), Xiong et al. (2012)
view the issue of clustering categorical data from optimization standpoint, and the initiali-
zation and refinement of splitting of clusters base on MCA are proposed. Chi-square dis-
tance between a single object and a group of objects is exploited as dissimilarity measure
to calculate MAC on the indicator matrix. In the refinement phase, objective is to improve
the splitting quality by minimizing the sum of Chi-square error (SCE) as the algorithm
iteratively selects one cluster to split into two sub-cultures until no cluster can be split or
not further improve clustering quality. DHCC is fully automated and parameterless, regard-
less of the data processing order. In addition, its time complexity is linear with the num-
ber of clustering objects and the total number of categorical values. DHCC is applied to
detect the natural structure of data on four real data sets, and high-quality clustering is
13
Comprehensive survey on hierarchical clustering algorithms… 8229
obtained. However, DHCC needs to consume more memory, cannot be optimized globally,
and may be affected by the abnormal object in the data. Li et al. (2017) have proposed a
Cell-Dividing Hierarchical Clustering (CDHC) method to handle geospatial datasets. The
method uses the global spatial context to identify the noise points and multi-density points,
and uses a boundary retraction structure to split the dataset into two parts interactively until
the terminating condition is met. CDHC is used to perform the spatial analysis of retail
agglomeration in business circles in Wuhan, and the analysis results reflect well the current
situation of commercial development in Wuhan.
A lot of agglomerative hierarchical methods have been developed by designing the differ-
ent types of similarity measures, e.g. distance-based similarity measure, based on nearest
neighbor similarity measure, based on the quality evaluation of the partition after merging,
and other similarity measure.
Many agglomerative clustering algorithms have been proposed in terms of the different
way that similarity measure is defined based on various distance measure. Single linkage
(Anderberg 1973) uses the shortest distance between data points from two different clus-
ters to define the similarity measure; it works well for the small dataset; complete linkage
(Sneath and Sokal 1975) uses the largest distance between data points from two different
clusters to define the similarity measure; group average linkage (Jain and Dubes 1988) uses
the average distance between data points from two different clusters to define the similar-
ity measure; centroid linkage (Lewis-Beck et al. 2003) uses the shortest distance between
the centroids from two different clusters to define the similarity measure. Ward’s method
(Murtagh and Legendre 2014) uses the smallest increase of the sum of square error when
two clusters are merged. The update of the similarity measure in all of the methods afore-
mentioned is conveniently formulated by Lance–Williams dissimilarity update formula
(Lance and Williams 1967), where the parameters can define the different agglomerative
criterion (Reddy and Vinzamuri 2013; Murtagh and Contreras 2012) and its generaliza-
tions (Wishart 1969; Batagelj 1988; Jambu et al. 1989; Székely and Rizzo 2005). The
Lance–Williams dissimilarity update formula is as follows:
d(i ∪ j, k) = ai d(i, k) + aj d(j, k) + 𝛽d(i, j) + 𝛾|d(i, k) − d(j, k)|,
13
8230 X. Ran et al.
where i and j are the objects or data points to be agglomerated into cluster i ∪ j , 𝛼i , 𝛼j , 𝛽 ,
and 𝛾 are the coefficients defining the agglomerative criterion. The coefficient values of dif-
ferent kinds of the methods we surveyed above are list in Table 2:
CURE (Guha et al. 1998) uses the shortest distance between the representative data
points from two different clusters to define the similarity measure.
Müllner (2011) has proposed a new agglomerative hierarchical clustering algorithm,
called Generic_Linkage, which is suitable for any distance update scheme and achieves
better performance for the “centroid” and “median” clustering schemes. It can handle
inversion of the dendrogram. A priority queue from list of nearest neighbors and mini-
mal distances is generated in the Generic_Linkage algorithm, and a pair of globally clos-
est nodes are found in each iteration. In addition, MST performs best using single-link-
age to compute similarity, NN-chain achieves best performance using complete, average,
weighted, and Ward schemes, in the same environment (Müllner 2011).
Furthermore, it is also known that Rohlf’s algorithm MST-LINKAGE for single clus-
tering, Murtagh’s algorithm NN-CHAIN-LINKAGE for the “complete”, “average”,
“weighted” and “Ward” scenarios are the most efficient among algorithms based on dis-
tance dissimilarities (Müllner 2011).
Gullo et al. (2008) have proposed a centroid-linkage-based agglomerative hierarchi-
cal algorithm for clustering uncertain objects. They define the multivariate and univariate
uncertain prototypes distance between cluster prototypes based on the Bhattacharyya coef-
ficient, the means of these distances are used as the similarity measure.
Sharma et al. (2019) have compared the single linkage, complete linkage, and Ward
method on agglomerative hierarchical clustering, and they have concluded that the com-
plete linkage and Ward method can found the more accurate similarity than the single-
linkage during merging.
In a new agglomerative hierarchical clustering algorithm called HC-OT, hierarchi-
cal clustering is combined with optimal transport (OT) (Rabin et al. 2014; Courty et al.
2017) for the first time, and the OT based distance is also proposed to be similarity meas-
ure used in HC-OT, this new distance considers the intra-cluster distribution of the data
(Chakraborty et al. 2020).
Cho et al. have utilized a new agglomerative clustering to implement the feature match-
ing and deformable object matching (Cho et al. 2009). In the agglomerative clustering,
a comprehensive pairwise dissimilarity function is defined based on photometric dissimi-
larity that are the Euclidean distance between objects and geometric dissimilarity that is
the mean of the conditional geometric dissimilarity between two feature correspondences.
But this dissimilarity cannot deal well with matching of the deformable object because
of the deformation of objects causes the different compactness on geometric dissimilarity.
Based on this dissimilarity, the improved dissimilarity measures, kNN linkage model and
adaptive kNN linkage model (AP-link), are then proposed (Cho et al. 2009). kNN linkage
13
Comprehensive survey on hierarchical clustering algorithms… 8231
model enhances the robust on outliers and compactness owing to considering k nearest
linkage. The k is adaptively determined in the AP-link in terms of a computed threshold,
so the straggling compactness is effectively avoided when the number of the matching pair
between two images is larger than k. The proposed bottom-up clustering framework based
on AP-link can effectively cluster multi-class task with many distracting outliers.
Galdino and Maciel (2019) have proposed hierarchical clustering for interval-valued
data using Width of Range Euclidean Distance, and they have found that different combi-
nations of representative interval-valued distance measures and linkage methods are best
for clusters of particular shapes, and different linkage algorithm can yield totally different
results when used on the same dataset because of the specie properties. They use the width
from Range Euclidean Distance Matrix between pairs of objects to produce dissimilarities
matrix.
In summary, in this type of methods, the similarity is calculated directly or indirectly
based on certain distance measure according to the type of dataset. For the numerical
data, the similarity between objects is computed using distance measures directly, such as
Euclidean distance, Geodesic distance. For the data whose attributes can be converted into
numerical metrics, after converting, the similarity between objects can also be computed
based on distance measures, e.g., Cosine distance. When merging, two objects with maxi-
mal similarity are selected to merge, the similarity matrix is then updated based on merged
clusters using the same distance measure and is used in the next merging.
The nearest neighbors of an object can reflect the affinity around it. A variety of methods
have been proposed based on nearest neighbor as similarity measure during agglomerative
clustering.
Nazari et al. have proposed an agglomerative hierarchical clustering algorithm, in which
the nearest neighbor of each data point is found to form a pair, and all the pairs includ-
ing intersection point are repeatedly jointed to form a cluster until the number of desired
clusters is achieved, and the intersection points between the nearest neighbor pairs is inter-
preted as the similarity measure between two clusters (Nazari and Kang 2018). The method
can run with high efficiency, the time complexity is O(n2 ), and results in clusters with good
quality in most cases regardless of dimensionality and number of data points.
A reciprocal-nearest-neighbors supported clustering algorithm called RSC has been
developed (Cheng et al. 2019b). The RSC is based on a simple hypothesis that the recipro-
cal nearest objects should be in same cluster. The RSC first constructs sub-clustering trees
(SCTs) according to the chain where each data point links to its nearest neighbor until all
data points are in the SCTs, then the data points to which the path length from the root is
larger than a calculated threshold in the SCTs are pruned and are linked to artificial roots.
After the above construction and pruning processes, each obtained SCTs is treated as a
node, and the above construction and pruning processes are repeated to obtain a high-level
clustering until forming one cluster. Because of adopting the reciprocal nearest neighbors
to construct SCTs and pruning, the RSC’s efficiency and effectivity are improved, the time
complexity is O(nlogn), the memory complexity is O(n). The RSC also has a strong inter-
pretation power owing to the hierarchy tree of the result.
Ros et al. (2020) have proposed a novel clustering algorithm called KdMutual which
combined the mutual neighborhood and agglomerative hierarchical methods using mutual
neighbor similarities based on single linkage. The KdMutual firstly merge the data points
13
8232 X. Ran et al.
by their mutual neighborhood to identify potential core clusters. The core clusters then
grow based on a constrained hierarchical merging process like single linkage to handle
noise. In the last phase, the resulting clusters are obtained by ranking the core clusters
according to the new similarity measure, which contains the cluster size, the compactness
of the cluster, and the separability between clusters.
In summary, the methods based on nearest neighbor have higher efficiency and lower
memory requirement, and is robust to noise. But for the large scale dataset, their compu-
tational cost is expensive owing to the huge number of samples, and are influenced by the
curse of dimensionality in high dimensional space.
4.1.3 The methods based on the quality evaluation of the partition after merging
Because the results of hierarchical clustering cannot be undone once the merging or divi-
sion accomplished, the quality of merging or division has significantly influence to the
resulting clustering. The validity evaluation index is usually used to assess the validity of
the clustering method, and used to determine the optimal number of clusters in general,
such as Mean Square Error (MSE) (Han et al. 2011; Tsekouras et al. 2008), Clustering
Dispersion Index (CDI) (Han et al. 2011; Tsekouras et al. 2008; Duran and Odell 2013),
and within cluster sum of squares to between cluster variation ratio (WCBCR) (Tsekouras
et al. 2008). According to whether the ground truth labels are known or not, the clustering
validity evaluation indices can be divided into two types, external evaluation indices and
internal evaluation indices.
4.1.3.1 External evaluation indices Given the knowledge of the ground truth class assign-
ments and our clustering algorithm assignments of the same samples, we can evaluate the
quality of our clustering results using below external evaluation indices. The seven com-
monly used evaluation indices in clustering are listed in Table 3 (Xu and Tian 2015).
4.1.3.2 Internal evaluation index When the ground truth class assignments are unknown,
our clustering algorithm assignments of the same samples are acquired only, we usually
utilize the features of original data to assess the clustering results. Some internal evaluation
indices commonly used are shown in Table 4 (Hubert and Arabie 1985).
4.1.3.3 The measures for directly evaluating the dendrogram quality Besides using exter-
nal evaluation and internal evaluation to assess the quality of clusters, the quality of a cluster
tree can be also evaluated directly using a more holistic measure, such as dendrogram purity
(Kobren et al. 2017). Suppose the ground truth of a dataset, X = {xi }Ni=1, is known, the data
points are assigned into K clusters, C∗ = {Ck∗ }Ki=1. P∗ = {(xi , xj )|C∗ (xi ) = C∗ (xi )} denotes
the pairs of points which are in the same cluster in the ground truth, for the cluster tree T
generated by hierarchical clustering algorithm, the dendrogram purity of T is defined as:
K
1 ∑ ∑
DP(T) = pur(lvs(LCA(xi , xj ))Ck∗ ),
| P∗ | k=1 x ,x ∈C∗
i j k
wherein LCA(xi , xj ) is the least common parent of xi and xj in T, lvs(z) ∈ X is the set of
leaves for any internal node z in T, and pur(S1 , S2 ) = |S1 ∩ S2 |∕|S1 |. The larger the DP(T),
the higher the quality of clustering result.
13
Table 3 External evaluation indices
Evaluation indices Formula Explanation
TP+TN
Rand indicator RI = TP+FP+FN+TN
TP is the number of true positives
TN is the number of true negatives
Fowlked–Mallows indicator TP TP FP is the number of false positives
FM = TP+FP
⋅ TP+FN
√
2
Comprehensive survey on hierarchical clustering algorithms…
TP TP
J(A, B) = �A ⋃ B� = TP+FP+FN
F indicator F𝛽 = (𝛽𝛽 2+1)⋅P⋅R P= TP+FP
, R = TP+FN
⋅P+R
RI−E[RI] TP+TN
ARI ARI = max(RI)−E[RI] RI = TP+FP+TN+FN = (TP + TN)∕ N2
( )
13
Table 4 Internal evaluation indices
8234
13
n
Silhouette coefficient 1 ai +bi ai : The mean distance between a sample and all other points in the same class
S= n max(ai ,bi )
i=1
∑
bi : The mean distance between a sample and all other points in the next nearest cluster
n: the number of samples
tr(Bk ) nE −k
Calinski–Harabasz Index s= tr(Wk )
× k−1
tr(Bk ): trace of the between group dispersion matrix, tr(Wk ): the trace of the within-cluster dispersion
matrix,
k
Wk = (x − cq )(x − cq )T
∑ ∑
q=1 x∈Cq
k
Bk = nq (cq − cE )(cq − cE )T
q=1
∑
ij
Dunn’s Index min1≤i<j≤n d(i,j) Is the distance between any cluster i and j,
DVI = max1≤k≤n d� (k) �
d (k) is intracluster distance of cluster k
X. Ran et al.
Comprehensive survey on hierarchical clustering algorithms… 8235
4.1.3.4 The methods based on the quality evaluation of the partition after merging For
improving the accuracy of agglomerative hierarchical clustering, some methods are devel-
oped by means of assessing the quality of the partition using some evaluation index after
each merging.
Heller and Ghahramani (2005) have proposed a Bayesian agglomerative hierarchical
clustering algorithm, that evaluates marginal likelihoods of a probabilistic model of the
data, and uses a model-based criterion to determine which clusters to be merged. Bayesian
hypothesis testing is used as similarity measure to decide which merges are advantageous
rather than distance measure in the algorithm, and to output the recommended depth of
the tree. Compared to traditional hierarchical clustering, this similarity measure provides a
guide of choosing the correct number of desired clusters and answers how good a cluster-
ing is.
Rocha and Dias (2013) have proposed that the evaluation on the quality of the partition
after merging two clusters is used as the similarity measure in agglomerative hierarchical
clustering. The quality of the partition is the ratio of pairs of alternatives that are indiffer-
ent, comparable and consistent with the decision maker’ preference model.
In the paper of Kaur et al. (2015), the performance of the agglomerative hierarchical
method is evaluated in the view of cluster quality. Cohesion measurement, elapsed time,
and silhouette index, for simplicity called quality index afterward, are adopted to assess
the cluster quality. For a given dataset, an agglomerative hierarchical clustering algorithm
whose quality indices are the best on the dataset has better fitness for the dataset among
alternative algorithms.
PERCH (Sander et al. 2003) use dendrogram purity to evaluate the quality of cluster
tree. The measure assumes that the ground truth of dataset is given. The larger dendrogram
purity, the higher the quality of tree, and the better the clustering result.
This type of methods must calculate the evaluation index to determine whether the
merging is good or not for every merging, their computational time complexity is very
high for the large scale dataset. Besides that, the evaluation index will affect the quality of
resulting clustering.
13
8236 X. Ran et al.
Distance (EuD) between objects, it not only considers the neighborhood and weights of
objects, but also avoids the situation that EuD equals zero by adding COS. This similar-
ity can be robustly applied to generate the dendrogram of the complex structure bipartite
networks.
The density of the data region is also used as similarity measure in the hierarchical cluster-
ing for clustering the data with complex structure and noise. Many agglomerative hierar-
chical clustering methods based on density estimation have been developed to deal with
different kinds of problems.
Lu et al. have used the estimated density values to help the hierarchical clustering, in
which an edge-weighted tree is first constructed from the data using the distance matrix
and the density values, then the hierarchical clustering result is efficiently obtained using
the sorted edges based on the edge weights (Lu and Wan 2013). The proposed method
PHA can handle non-spherical shape clusters, overlapping clusters, and clusters containing
noise, and can run faster and produce the high quality results.
Some agglomerative hierarchical clustering methods are based on dividing points
into core points with higher densities and border points with lower densities. Cheng
et al. (2019b) have proposed a local cores-based hierarchical clustering algorithm called
H-CLORE, which firstly divides the dataset into several clusters by finding local cores that
are the points with the local maximum density in the local neighbors. Then the lower den-
sity points are temporarily removed so that the boundary between clusters is clearer, and
clusters with the highest similarity measure are merged. Finally, the removed points are
assigned to the same clusters as their local cores belong to. The similarity measure between
clusters is defined as a function of the inter-connectivity and closeness to be merged. The
algorithm is effective and efficient for finding the clusters with complex structures. The
similarity function is as follows:
.
H-CLORE initially divides dataset into many smaller clusters by searching the local
core instead of iteratively optimizing the partition like k-means, it reduces the total com-
puting cost to O(NlogN) (Cheng et al. 2019b). The local core is with the maximal density
around its neighbor points, the lower density points are removed temporally, the bound-
ary between clusters is clear, so H-CLORE is insensitive to outliers (Cheng et al. 2019b).
H-CLORE can also discover the clusters with complex structure efficiently and effectively,
and be applied in many fields such as pattern recognition, data analysis in 3D reconstruc-
tion and image segmentation (Cheng et al. 2019b).
13
Comprehensive survey on hierarchical clustering algorithms… 8237
In the paper of Cheng et al. (2019a), a novel hierarchical clustering algorithm based on
noise removal (HCBNR) has been proposed. HCBNR marks the points with lower density
as noise by density curve and remove them from the dataset. Then a saturated neighbor
graph (SNG) is constructed, and the connected sub-graphs are divided into initial clusters
using M-Partition according to the number of clusters. The initial clusters are repeatedly
merged by a newly defined similarity measure between clusters until the desired number of
clusters is obtained. The similarity measure simultaneously takes into account the connec-
tivity and closeness like in the Chameleon algorithm between sub-clusters to be merged.
HCBNR can not only better handle noise data points and clusters with arbitrary shapes, but
also run faster than DBSCAN, Chameleon and CURE.
The Border-peeling clustering method iteratively peel border points of the clusters until
core points remain (Averbuch-Elor et al. 2020). During the peeling process, the associa-
tions between the peeled points and points in the inner layer are created by estimating the
local density of data points. After the peeling process terminates, the remaining set of core
points are merged with the close reachable neighborhood of the points according to the
associations. The border points at iteration t are a set given by XB(t) = {xi ∈ X (t) ∶ B(t)
i
= 1},
where X (t) is the set of unpeeled points by the start of the tth iteration, B(t)
i
is the border
classification value of the point when the value is 1 if xi is a border point and 0 otherwise.
Bi is determined by the reverse k nearest neighbors of xi , i.e., RNk (xi ) = {xj |xi ∈ Nk(t) (xj )}
(t) (t)
and a pairwise relationship function f between points xi and xj (Averbuch-Elor et al. 2020).
The unpeeled points set for next iteration is X (t+1) = X (t) �XB(t).
The association between each identified border point xi ∈ XB(t) and a neighboring non-
border point 𝜌i ∈(t+1) is computed as follows (Averbuch-Elor et al. 2020):
⎧ ⎫
⎪ � ⎪
li = min ⎨ Ck 𝜉(xi , xj ), 𝜆⎬,
⎪ (t)
xj ∈NNB,k ⎪
⎩ ⎭
Deep learning is a popular technique in machine learning, and has been applied in cluster-
ing analysis. The features are extracted by deep learning method and are then used to com-
pute more accurate similarity required during clustering, it can improve the accuracy of
clustering results. This idea is also utilized in agglomerative hierarchical clustering.
In the paper of Zeng et al. (2020), the method HCT combines the hierarchical clus-
tering with hard-batch triplet loss for Person re-identification, the hierarchical clustering
method is used to generate the pseudo labels by iteratively agglomerating two samples
according to their minimal unweighted average linkage distance, the PK sampling algo-
rithm then produces a new dataset using the pseudo labels for fine-tune training in CNN.
The similarity among images is fully used in the target domain by hierarchical clustering.
13
8238 X. Ran et al.
The hard-batch triplet loss with PK can efficiently increase the distance between dissimilar
samples while reducing the distance between similar samples, so that the hard samples are
better distinguished.
The distance used in the unweighted average linkage distance between clusters is
defined as
∑
Dab = n 1n D(Cai , Cbj ),
a b
i∈Ca ,j∈Cb
where Cai , Caj are two samples in the cluster Ca, Cb respectively, na, nb are the number of
data points in the cluster Ca, Cb respectively, D(⋅) indicates the Euclidean distance. This
distance considers all the pairwise distance between two clusters and assigns the same
weight when merging.
Zhao et al. combines the bottom-up hierarchical cluster and CNN for person re-identifica-
tion (Zhao et al. 2020). The proposed non-locally enhanced feature network can extract the
feature of images sufficiently by embedding non-local blocks into the CNN. The similarity
between clusters is determined using inter-cluster distance called intermediate distance (IMD)
and intra-cluster distance called compactness degree (CPD) simultaneously. IMD is the mean
of the farthest distance and the nearest distance between two clusters, it can avoid some wrong
merging. IMD considers the compactness by the mean intra-cluster distance. After being
trained by proposed CNN, two clusters with the largest similarity are merged, training and
agglomerative hierarchical clustering are iteratively performed, images of same person can be
grouped into a cluster.
To avoid the false merging due to the minimum distance criterion of BUC (Lin et al. 2019),
the proposed intermediate distance (IMD) is defined as follows:
( )
1
IMD(A, B) = 2 min d(xa , xb ) + max d(xa , xb ) ,
xa ∈A,xb ∈B xa ∈A,xb ∈B
wherein d(xa , xb ) indicates the Euclidean distance between the feature embeddings of two
images.
The compactness degree (CPD) which evaluates intra-cluster distance is defined as
∑
CPD(A) = 1n d(xi , xj ),
i,j∈A
wherein n is the number of samples in cluster A, d(xi , xj ) indicates the Euclidean distance
between the feature embeddings of two images in cluster A.
Considering the inter-class and intra-class distance simultaneously, the final distance
between cluster A and B is formulized as
D(A, B) = IMD(A, B) + 𝜆(CPD(A) + CPD(B)),
wherein 𝜆 is an effect parameter between IMD and CPD.
Clusters with small D(A, B) should be merged during hierarchically clustering. The method
proposed efficiently reduce the incorrect merging of early stages in light of the newly proposed
measure composing of IDM and CPD, it promotes the quality of bottom-up clustering.
The above methods based on deep learning all adopt ResNet50 as backbone to learn the
feature space of the re-ID objects. ResNet50 is with initialization pre-trained on ImageNet, the
dropout rate to be 0.5, and SGD with a momentum of 0.9 is utilized to optimize the model.
13
Comprehensive survey on hierarchical clustering algorithms… 8239
Other variations of the agglomerative hierarchical clustering methods using mixed strate-
gies during merging are also proposed.
Takumi and Miyamoto (2012) have defined two asymmetric similarity measure and
their no reversal criteria, one is a probabilistic model using the top-down method that first
defines the similarity measure between clusters and the similarity between objects is a spe-
cial case of the similarity measure; the other is an extended updating formula using the bot-
tom-up method that first defines the similarity between objects, then the similarity measure
between clusters is defined using the former similarity. An asymmetric dendrogram based
Chi-square test for agglomerative hierarchical clustering has been proposed using aforesaid
asymmetric similarity measure.
In a new hierarchical clustering method proposed by Jalalat-evakilkandi and Mirzaei
(2010), the dendrograms of base hierarchical clustering methods, such as single-linkage
and complete linkage, are converted to matrices, then these matrices are combined in a
weighted procedure to lead to a final description matrix based on scatter matrices and near-
est neighbor criterion. Proficiency and robustness of hierarchical clustering are increased
by combination of base hierarchical clustering methods. Zhao et al. have extended
instance-level constrain to hierarchical constraint in agglomerative hierarchical clustering,
called ordering constraint by which hierarchical side information can be captured and hier-
archical knowledge can be encoded into agglomerative hierarchical algorithms. Ordering
constraint sets merge preferences of objects but doesn’t change similarities during cluster-
ing (Zhao and Qi 2010).
Mao et al. have proposed a new hierarchical clustering method, in which sequence data
are firstly divided into groups using a new landmark-based active hierarchical divisive
(AHDC) method, and then the ESPRIT-Tree clustering method (Cai and Sun 2011), that
is a hierarchical agglomerative method and quasi-linear time complexity, is used to each
group individually to find the correct hierarchy of the group, finally, the hierarchy of the
data can be obtained by assembling hierarchies from the above steps (Mao et al. 2015).
This method has the scalability to handle tens or even hundreds of millions of sequences by
using high-performance parallel computing and requires the linear time complexity regard-
ing the number of input sequence (Mao et al. 2015). It is clear that the method is a hybrid
hierarchical clustering, named as HybridHC, and has two stages, i.e. AHDC division and
ESPRIT-Tree agglomeration.
In the rough set based agglomeration hierarchical clustering algorithm (RAHCA) (Chen
et al. 2006), the data are mapped to a decision table (DT) in light of Rough set theory
(RST). An attribute membership matrix (AMM) is built using DT and then used to accom-
plish merging by the corresponding items. The similarity is the numerical measure of cat-
egorical data based on Euclidean distance on DT. Consistent degree is used to measure
the quality of cluster, agglomerate degree is used as stop criterion of the algorithm. The
clustering level is defined as the comprehension of the consistent degree and the agglomer-
ate degree. The cluster with minimal clustering level value (Dmin ) and the cluster with the
highest similarity between rest clusters and Dmin are merged by updating AMM.
Varshney et al. have proposed a Probabilistic Intuitionistic Fuzzy Hierarchical Cluster-
ing (PIFHC) Algorithm, that considers intuitionistic fuzzy sets (IFSs) to handle the uncer-
tainty in the data and leverages the probabilistic-weighted Euclidean distance measure
(PEDM) to compute the weights between the data points as similarity, then the clusters
are formed in the agglomerative way (Varshney et al. 2022). PTFHC can better identify
13
8240 X. Ran et al.
the uncertain data points using IFS, however, computational cost is higher because proba-
bilistic weight has been computed for each of data points, a suitable membership function
is dependent on the particular problem, and the parameter 𝛼 is determined experimentally
(Varshney et al. 2022). Authors argued that uncertainty in the data may be represented
using fuzzy set variants such as Pythagorean fuzzy sets, interval-valued intuitionistic fuzzy
sets and type-2 fuzzy sets (T2 FSs), therefore, based on these variants, designing new hier-
archical clustering algorithms is worth studying further (Varshney et al. 2022).
Incremental agglomerative hierarchical clustering methods have also been proposed. COB-
WEB (Fisher 1987) is one of the most prominent algorithms that incrementally clusters the
objects to form a conceptual categorial tree. The quality of the cluster is measured by cate-
gory utility which is a probability description of that a document in the parent level cluster
is assigned into the cluster in child level. COBWEB maximizes the category utility when a
new object is inserted. Sahoo et al. (2006) have proposed an improvement of COBWEB by
changing its underlying assumption, that the observation conforms to Normal distribution,
to Katz’s distribution. The improvement method can be suitable for hierarchically cluster-
ing text documents.
BIRCH (Zhang et al. 1996) uses the natural closeness of points to incrementally cluster
the observations. It constructs the CF-Tree of the dataset firstly, the sparse clusters then are
deleted and dense clusters are merged in the leaf nodes. BIRCH is especially suitable for
very large dataset and can handle convex or spherical clusters of uniform size very well,
and effectively deal with noise.
The key properties in which the incremental clustering can detect the structures of clus-
ters are studied, including nice, perfect, refinement (Ackerman and Dasgupta 2014). Nice
clustering refers to that the any data point is closer to the points in its own cluster than the
points in other clusters. In another word, the separability between clusters is always larger
than the compactness in the cluster. The perfect clustering means that minimal separability
between clusters is larger than the maximal diameter of clusters. The refinement requires
that each cluster in the clustering result is contained in some cluster, and it usually allows
additional clusters to carry out.
Zhang et al. (2013) have proposed an agglomerative clustering algorithm that is graph-
structural and define path integral as the structural descriptor of cluster on graph. The path
integral is the sum of all the paths within the clusters on the directed graph. Therefore, the
similarity measure between clusters is defined as the amount of incremental path integral,
that measures the structural change of clusters after merging and is calculated in a closed-
form exact solution, when merging them. Different type of clustering methods can cluster
distinct structure of clusters.
Kobren et al. (2017) have proposed an incremental hierarchical clustering called
PERCH that can non-greedily handle a large number of data points as well as a large num-
ber of clusters. Under separability assumption, PERCH maintains the quality and growing
of the tree by rotating operation if the masking exists when a new point is inserted to leaf
node in the tree. PERCH achieves the higher dendrogram purity and speeds up the cluster-
ing procedure regardless of the accessing order of the items as well as can scale to the large
dataset and large number of clusters.
In the paper of Shimizu and Sakurai (2018), a parallel and distributed hierarchical clus-
tering method based on Actor Model (Agha 1990), called ABIRCH, has been proposed. In
13
Comprehensive survey on hierarchical clustering algorithms… 8241
this method an added point is incrementally received and processed based on BIRCH algo-
rithm (Zhang et al. 1996) by behavior of the actor, and a divisive hierarchical clustering is
depicted as an actor. An actor is as a node, and a set of nodes can be described as a sum
of value called clustering feature (CF), where each node in the tree maintains a CF value.
The CF value is updated when a point is added to a node, and the radius of a CF value is
expanded simultaneously. When the radius is greater than a threshold, the node will be
split.
The incremental clustering method proposed by Narita et al. (2018, 2020) can update
the partial clusters without re-clustering when a point is newly inserted, and saves execu-
tion time. The center and radius of a cluster are defined. If the distance between the point to
be inserted and the center of the cluster is less than or equal the radius, then the clustering
result is updated after inserting a point. Then updating the center and radius of cluster until
insertion requirement is not met from the root of the cluster to leaves.
Motived by a notion of separability of clustering using linkage functions, the GRINCH
method uses generic similarity functions, which measure any affinity between two point
sets, to non-greedily perform hierarchical clustering for large-scale data set (Monath et al.
2019). GRINCH is mainly composed of its rotate subprogram and graft subprogram (Mon-
ath et al. 2019). When new data items arrive, they rearrange the generated hierarchical tree
locally and globally, respectively. It can produce clustering results as consistent as possible
with ground truth when the linkage function is consistent with ground truth regardless of
data arriving in the order and obtain clusters with complex structures (Monath et al. 2019).
Based on graph theory, model-based separation is defined by characterizing the relation-
ship between a linkage function and a dataset. GRINCH can efficiently produce a clusters
tree with higher dendrogram purity in the separated setting (Monath et al. 2019).
The Sub-Cluster Component algorithm (SCC) (Monath et al. 2021) is a scalable to bil-
lions of data points, agglomerative, hierarchical clustering, and can produce the hierarchies
of optimal flat partition. It uses a series of increasing distance thresholds to determine
which sub-clusters should be merged in a given round, under the separability assumption
and non-parametric DP-mean objective. SCC is applied to cluster large scale web queries
and achieves the competitive result with the state-of-the-art incremental agglomerative
hierarchical clustering, and requires less running time.
In the Internet of Things (IoT), the data, which are collected from IoT sensors data
world and then annotated by Resource Description Framework (RDF), are classified and
analyzed by the agents and subsequently represented as data streams (DS). The data pattern
in DS with minimum distance between them are merged. The nearest neighbor chain based
on incremental hierarchical clustering is used to cluster the streaming data (Núñez-Valdéz
et al. 2020). When a new DS is added, it is elaborated starting from any node in the hierar-
chical tree until a pair of data samples with Reciprocal Nearest Neighbor (RNN) is formed,
and then these data samples are aggregated. The same process is continued with RNNs for
the hierarchical tree of previously annotated objects (Núñez-Valdéz et al. 2020). The dis-
tance can be calculated using any of distance function, such as the Marxian, Euclidean or
Minkowski distance functions in D-dimensional space (Núñez-Valdéz et al. 2020).
An incremental supervised learning based clustering method, called dynamic cluster-
ing of data stream with considering concept drift (DCDSCD), was developed some time
ago. In this method, data stream is automatically clustered in a supervised manner, where
the clusters whose values decrease over time are identified and then eliminated (Nikpour
and Asadi 2022). Moreover, the generated clusters can be used to classify unlabeled data.
Each chunk is clustered independently and automatically in a supervised manner that
presents timely results without obtaining the number of clusters from the user. Centroids
13
8242 X. Ran et al.
13
Comprehensive survey on hierarchical clustering algorithms… 8243
as a candidate for the cluster’s label name. At this point, the hierarchy of clusters is gener-
ated (Dixit 2022). This method keeps a separate state for outlier/unsigned documents to be
re-evaluated in the next data iteration (Dixit 2022). Moreover, this does not require rebuild-
ing the clustering procedure from scratch for the previous data. Instead, it is simply a mat-
ter of a new FI update.
The clusters of hierarchical clustering are obtained by cutting the dendrogram from a cer-
tain height. So, how to extract clusters from a given dendrogram is worth studying. Some
optimization methods are talking about global or local cuts, such as the following methods.
One method that can convert dendrogram into reachability plot, and vice versa, can
automatically extract the important clusters from reachability plot by local maximal reach-
ability values separate clusters (Sander et al. 2003). It makes hierarchical clustering a
preprocessing tool in data mining work where downstream tasks are based on clustering
results (Sander et al. 2003).
In the paper of Campello et al. (2013), HDBSCAN constructs a cluster tree composed
of clusters generated by DBSCAN*, which is an improved DBSCAN. A mutual reachabil-
ity graph is constructed by assigning edge weight that is the mutual reachability distance
between two objects in dataset, an MST is computed from the mutual reachability graph,
and then an extended MSText is obtained by adding self-loops for each vertex in the MST.
HDBSCAN hierarchy is extracted form MSText by iteratively removing all edges from
MSText in descent order of edge weights. Significant clusters tree is generated through
simplified HDBSCAN hierarchy and then significant clusters is produced by optimal local
cuts.
Using HDBSCAN hierarchy and proposed stability measure, global optimal significant
clusters can be obtained, and the time complexity can be reduced from O(dn2 ) to O(n2 ).
However, space complexity is increased from O(dn) to O(n2 ), where d is the dimensionality
of data point, and n denotes the number of data points.
Kernel-based method is another hot topic for clustering as it can help the algorithm to
detect complex and non-linear separable clusters. Therefore, many kernel based hierarchi-
cal clustering algorithms are proposed to reveal more complex information in data.
In the paper of Qin et al. (2003), the comparing experiments show that the same hier-
archical clustering algorithm can produce different hierarchical trees in terms of different
kernel functions, and the results of kernel hierarchical clustering are not better than stand-
ard hierarchical clustering according to the internal twofold cross validation evaluation and
external PPR evaluation. Moreover, it also shows that kernelization usually increases the
dimensionality of data, and combining it with SVM and related large margin algorithms
don’t suffer from the curse of dimensionality, but combing other methods and kernelization
should be cautions.
Endo et al. (2004) combine five agglomerative hierarchical clustering methods with ker-
nel functions to construct the new kernel clustering, these hierarchical clustering methods
use square of Euclidean norm between two items in a feature space as dissimilarity when
merging. The new methods are respectively called Kernel Centroid Method, Kernel Ward’s
13
8244 X. Ran et al.
Method, Kernel Nearest Neighbor Method (K-AHC-N), Kernel Furthest Neighbor Method
(K-AHC-F), Kernel Average Linkage between the Merged Group (K-AHC-B), and Kernel
Average Linkage within the Merged Group (K-AHC-I). Because of increasing cost in com-
puting dissimilarity and kernel function after each merging, they reduce the cost by reuse
the dissimilarities which are already computed before merging. Therefore, the calculation
cost in this way is much lower than that completely recalculated after the merging.
Hierarchical Kernel Spectral Clustering (Alzate and Suykens 2012) makes use of ker-
nel spectral clustering to discover the underlying cluster hierarchy in the dataset. A pair
of clustering parameters (k, 𝜎 2 ), the number of clusters and RBF kernel parameter respec-
tively, are found by training clustering model for each (k, 𝜎 2 ) and assessing Fisher crite-
rion on the validation set for out-of-sample data. During this process, the memberships of
all objects are created, and then the hierarchical structure is built by merging two closest
clusters according to a specified linkage criterion. Because of clustering model training in
a learning environment, it ensures the model has good generalization ability, but enumera-
tion model training for (k, 𝜎 2 ) may consume more running time. For any k, find the maxi-
mum value of the BLF criterion across the given range of 𝜎 2 values. If the maximum value
is larger than the BLF threshold 𝜃 , create a set of these optimal (k, 𝜎 2 ) pairs.
Multiple kernel clustering and fusion methods may lead to loss of clustering advanta-
geous details from kernels or graphs to partition matrix due to sudden drop of dimension-
ality (Liu et al. 2021). Hierarchical Multiple Kernel Clustering (HMKC) (Liu et al. 2021)
gradually group the samples into fewer clusters and generate a sequence of intermediate
matrices with a gradually decreasing size. Consensus partition is learned at the same time,
which in turn guides the construction of intermediate matrix. An optimization algorithm
newly designed jointly optimize the intermediate matrices and consensus partition by per-
forming forwarding and back propagation alternatively to update variables. The experi-
mental results show larger intermediate matrices can preserve more informative details of
clusters, and performance is improved with the increasing the layer number of intermediate
matrices, but the computational complexities are increased accordingly. So, authors advise
to use HMKC-2 model, which contains 2-layer of intermediary matrices, in clustering task
for improving stability and lowering complexities.
As a conclusion, kernel hierarchical clustering can discover more complex and non-lin-
ear separable clusters in the dataset, it however increases the computing cost due to intro-
ducing kernel.
Different agglomerative hierarchical clustering methods have also been developed to deal
with different types of datasets.
ROCK (Guha et al. 2000) uses the Goodness Measure between the two different clus-
ters based on links to define the similarity measure for categorical data. Two clusters with
maximal Goodness Measure are merged at each iterative step. ROCK is used on Congres-
sional Votes, and Mushroom U.S. Mutual Fund to discover the inherent distribution in
data, and the results is encouraging. Squeezer (He et al. 2002) hierarchically clusters cate-
gorical data by repeatedly inputting each tuple in the dataset in sequence, the tuple is deter-
mined to assign to an existing cluster or create a single cluster (initially) by the similarity,
which is defined as the sum of each attribute in the tuple in the cluster. Because it reads the
dataset only once, both efficiency and quality of clustering results are high. d-Squeezer,
13
Comprehensive survey on hierarchical clustering algorithms… 8245
an alternative of Squeezer, is also proposed to deal with large databases by the means of
directly writing the cluster identifier of each tuple back to file instead of holding the cluster
in memory (He et al. 2002). Both Squeezer and d-Squeezer are suit to cluster data streams.
A hierarchical clustering framework for clustering categorical data based on Multinomial
and Bernoulli mixture models is also proposed, in which Battacharyya and Kullbach–Lei-
bler distance are used as similarity measure between clusters (Alalyan et al. 2019).
This method is applied in text document clustering and computer vision applications. In
the two tasks, the objects are described as binary vector and the similarities are evaluated
using Bhattacharyya and Kullbach–Leibler distance. The accuracy of clustering results is
improved significantly in contrast to the methods based on Gaussian probability models
and the Bayesian methods. The experiments shows that it can be applied to many other
applications which involve hierarchy structure and count or binary data (Alalyan et al.
2019).
Lerato and Niesler (2015) have proposed a multi-stage agglomerative hierarchical clus-
tering approach (MHHC) aimed at large datasets of speech segments based on the iterative
divide-and-conquer strategy. The dynamic time warping (DTW) algorithm (Myers et al.
1980; Yu et al. 2007) is used to compute similarity measure between two segments (Lerato
and Niesler 2015). In the first stage, the dataset is divided into several subsets, AHC is
separately applied on each subset. In the second stage, the average points computed in the
previous stage are clustered using AHC. In this way, the storage requirement is reduced
and clustering procedure can be implemented in parallel. Therefore, MAHC can be easily
extended to large-scale dataset. MAHC is applied to speech segments and the results show
that the performance of MAHC reaches and often exceeds that of AHC, and MAHC per-
forms well in parallel computing.
Rahman et al. have proposed a highly robust hierarchical clustering algorithm (RHC)
to cluster metabolomics data using covariance matrix of two-stage generalized S-estimator
in presence of cell-wise and case-wise outliers (Rahman et al. 2018). RHC successfully
reveals the original pattern for metabolomics analysis and performs better than the tradi-
tional HC in presence of cell-wise and case-wise outliers. A method has been utilized to
cluster the protocol feature words according to the longest common subsequence, which is
equivalent to similarity in the merging process, between the order sequences with the byte
position information (Li et al. 2019). The methods on the basis of hierarchical clustering
can extract the feature words in unknown protocol and has a higher recall compared to
PoKE (Li et al. 2019).
One method has been proposed aiming at dealing with noise in single-linkage hierarchi-
cal clustering from two aspects (Ros and Guillaume 2019). First, the single link criterion
considers the local density to ensure that the distance involves the core point of each group.
Second, the hierarchical algorithm prohibits merging representative clusters that exceed the
minimum size after determination.
For omics analysis, Hulot et al. (2020) have proposed a mergeTrees method, which com-
bines the agglomerative hierarchical clustering to merge many trees that has the common
leaf nodes at the same height to create a consensus tree. This method needs not specify the
number of clusters in advance, only requires the preprocessing of centering and standardi-
zation, and its time complexity is reduced to O(nqlog(n)) for large dataset, where n is the
number of leaves and q is the number of the tree to be merged. The set of q trees can be
obtained from these data with any HC method, and by C(T) the consensus tree based on
T = T1 , ..., Tq . The analysis shows that mergeTree is robust to the existence of empty data
tables.
13
8246 X. Ran et al.
Mulinka et al. (2020) have built a HUMAN method, using density-based hierarchical
clustering techniques, to detect anomalies in multi-dimensional analyzed data, and to ana-
lyze the dependence and relationships among the data hierarchies to interpret the potential
causes behind the detected behaviors, with minimal guidance and no ground truth.
An agglomerative hierarchical clustering method has been proposed by Fouedjio for
multivariate geostatistical data, the method is model-free and based on the dissimilarity
measure of a kernel estimator that is non-parameters and can reflect the multivariate spatial
dependence structure of data (Fouedjio 2016). This method can cluster irregularly space
data and obtain adjacent clusters without any geometrical constraints. For sparse or small-
scale dataset, however, kernel estimator can not correctly estimate the similarity among
multivariate spatial data, the proposed clustering method also cannot give the member-
ship degree of belonging to a specified cluster due to it is model-free (Fouedjio 2016).
D’Urso and Vitale (2020) have proposed a robust similarity measure which is the exponen-
tial transformation of the kernel estimator proposed by Fouedjio (2016) for agglomerative
hierarchical clustering. The measure is not sensitive to noise and model-free and suitable
to cluster data indexed by geographical coordinates. The agglomerative clustering method
using proposed measure is used to cluster georeferenced data set and successfully gives
locations and top soil heavy metal concentrations.
A new agglomerative hierarchical clustering algorithm has been proposed and applied
to cluster the geochemical data (Yang et al. 2019). The geochemical data meet multivari-
ate approximately normal distribution after preprocessing steps, and the symmetric Kull-
back–Leibler divergences between the distributions are used as dissimilarity measure dur-
ing merging in hierarchical clustering. The proposed method not only provides a tool to
reveal the relationship between geological objects according to geochemical data, but also
reveals that DKLS and its two parts can characterize geochemical differences from differ-
ent angles. These measures are expected to enhance the method of identifying geochemical
patterns.
Graph is getting more and more attention in clustering analysis because of its natural char-
acters of structure in representing the relationships among objects. Moreover, some opera-
tions on the graph can be used to deal with the problems in clustering. Usually the clus-
tering problems can be converted into the construction and the partitioning of the simple
graph in low dimensional space and the hypergraph in high dimensional space. This idea
can also be used in hierarchical clustering and obtains the better results.
Some simple graph-based methods have been proposed, and the task of hierarchical clus-
tering is converted to the construction and the partitioning of the graph, e.g., the clustering
13
Table 5 Comparison of the state-of-the-art agglomerative hierarchical clustering methods
Methods Similarity measure Type of data Time complexity Type of output
Generic_Linkage Any distance update scheme Numerical data O(n3 ) Stepwise dendrogram
HC-OT OT based on distance Numerical data, images O(n3 log(n)) Dendrogram
Zazari et al. The nearest neighbors Numerical data O(n2 ) Dendrogram
RSC Reciprocal-nearest-neighbors Numerical data O(nlogn) Dendrogram
KdMutual Mutual neighborhood Numerical data O(n3 ) Dendrogram
Heller et al. Bayesian hypothesis testing Numerical data, text O(p(n)) Dendrogram
Bubble clustering Associator Numerical data nlog(n) Dendrogram
HA Density Numerical data O(n2 ) Dendrogram
Comprehensive survey on hierarchical clustering algorithms…
13
8248 X. Ran et al.
methods based on MST (Zahn 1971; Guan and Du 1998), CLICK (Sharan and Shamir
2000), Spectral clustering (von Luxburg 2007), improvements of Spectral clustering (Chen
et al. 2011; Cai and Chen 2015; He et al. 2019; Huang et al. 2020), and Kemighan–Lin
(KL) method (Kernighan and Lin 1970).
According to the model of calculating similarity for constructing graph, the methods of
agglomerative hierarchical clustering can be divided into two categories: static similarity
model based method and dynamic similarity model based method.
13
Comprehensive survey on hierarchical clustering algorithms… 8249
Most of the agglomerative clustering methods introduced above are based on static intercon-
nectivity models (Karypis et al. 1999b). Chameleon, as a multi-stage hierarchical cluster-
ing algorithm, is based on a dynamic interconnectivity model that considers both aggregate
interconnectivity and the closeness of the data points when defining the similarity measure
(Karypis et al. 1999b). It initially constructs a graph according to the k-Nearest Neighbor of
the data points, then utilizes a graph partitioning algorithm to divide the objects into a lot of
relatively small sub-clusters so that the objects in each sub-cluster are highly similar. In the
final and most important stage, the Chameleon method uses the interconnectivity and close-
ness measure to continuously merge the sub-clusters until some criteria are met or only one
cluster is formed (Karypis et al. 1999b).
13
8250 X. Ran et al.
where EC(Ci , Cj ) denotes the sum of the weight of the edges that straddle the Ci and Cj , and
EC(Ci ) (or EC(Cj ), the internal interconnectivity, is the minimum weight sum of the edges
that are cut to divide the cluster Ci (or Cj ), into roughly equal two parts.
The relative closeness RC(Ci , Cj ) of two clusters Ci and Cj is the normalized absolute
closeness between Ci and Cj of its internal closeness:
̄
SEC(Ci , Cj )
RC(Ci , Cj ) = |Cj |
,
|Ci | ̄ ̄
|Ci |+|Cj |
SEC(Ci ) + |Ci |+|Cj |
SEC(Cj)
where SEC(C
̄ i , Cj ) is the average weight of the edges that connect the Ci and Cj . Similarly,
̄
SEC(C i ) (or ̄
SEC(C j )) is the average weight of the cut edge that divides Ci (or Cj ) into two
roughly equal parts.
The Chameleon chooses the pairs of clusters with both high RI and RC to merge,
so a natural way is to take their product. Namely, a pair of clusters is selected such that
RI(Ci , Cj ) × RC(Ci , Cj ) is maximum. In this formula, RI and RC have equal importance.
But sometimes we want to give the different measure or higher preference to one of the two
measures. So, Chameleon’s similarity measure is defined as:
S(Ci , Cj ) = RI(Ci , Cj ) × RC(Ci , Cj )𝛼 ,
where 𝛼 is a user-specified parameter, if 𝛼 > 1, the cluster has higher relative closeness, if
𝛼 < 1, Chameleon chooses the two clusters with higher relative interconnectivity to merge.
Zhao et al. have applied the similar Chameleon method to cluster documents, called
constrained agglomerative algorithm which restricts some documents to only being avail-
able from the same cluster, a distance threshold-based graph is first constructed from the
documents dataset and the graph is partitioned to obtain the constraint clusters from which
the final hierarchical solution is constructed using the UPGMA agglomerative scheme
(Zhao and Karypis 2002). The constrained method leads to the better quality of the result-
ing clusters because the neighbors’ quality of each document is improved due to constraint.
Later, Zhao et al. (2005) proposed a Chameleon-like algorithm to cluster the documents
(high-dimensional data). Firstly, a kNN graph is constructed from the dataset, afterwards
the graph is divided into k clusters using a min-cut partitioning algorithm, and finally an
agglomerative hierarchical clustering is obtained using a single-link similarity measure
from the previous phrase.
Borton et al. have proposed MoCham algorithm which is very similar to the Chame-
leon method, a graph is constructed using a kd-tree in the first phrase, then uses a multi-
objective priority queue to determine the similarity measure between the sub-clusters
during the merging phrase (Barton et al. 2016). Cao et al. have introduced an optimized
Chameleon method based on local features and grid structure, firstly an adaptive neighbor
graph is established in terms of the distance between data points, then the graph is parti-
tioned based on local point density to produce many sub-clusters, in the merging phase, the
sub-clusters are repeatedly merged according to the local point density (Cao et al. 2018).
Dong et al. have proposed an improved Chameleon method, which introduces the recursive
13
Comprehensive survey on hierarchical clustering algorithms… 8251
dichotomy, flood fill method, and the quotient 𝛾 of cluster density, and proposes a cutoff
method that automatically selects the best clustering result (Dong et al. 2018). Barton et al.
have proposed Chameleon 2 algorithm, flood fill is used to refine partition and the measure
of internal cluster quality is modified to improve merging (Barton et al. 2019). Chameleon
2 can also be regarded as a general clustering framework, which is suitable for a wide
range of complex clustering problems, and performs well on the data sets commonly used
in clustering literature (Barton et al. 2019). In overlapping hierarchical clustering (OHC),
the dendrogram is constructed by a graph where the edge is gradually added in terms of
an increasing distance threshold 𝛿, that indicates the size of the formed clusters gradually
increases from 0 (Jeantet et al. 2020). Graph density, which is the ratio of the number of
edges in subgraph constructed by 𝛿 and the number of edges in connected graph, is used
as the merging condition. After each increase of 𝛿, the constructed subgraph is removed to
add to the correspondence level of the dendrogram. With the increasing of 𝛿, the edges in
the graph are gradually added until all vertices are linked. In the dendrogram, some nodes
have one or more parent node, so this structure is called a quasi-dendrogram.
Taking advantage of relationships which are found by some data mining techniques
to construct a graph, then the clusters or dendrogram will be obtained by partitioning the
graph using graph partitioning method. It is the generic scheme of graph-based clustering
algorithms. The dynamic similarity model based method has the better quality of clusters
than the static similarity model based method.
In the paper of Toujani and Akaichi (2018), a combination method of the bottom-up
and top-down hierarchical clustering has been discussed in social medias networks. The
social medias networks are preprocessed to generate a weighted and directed graph, and
each node in the graph is a community. The graph is randomly partitioned or use a special
partitioning technique to obtain the initial partitions of the community. The degree of the
influential users is defined as the average of the covariance between edges in the graph and
Jaccard coefficient between nodes. The opinion leader-based modularity function (QOPL ) is
defined based on the influential users’ degree. The reproduction probability of each cluster
(Pri) is defined as the ratio of its QOPL and the QOPL of all the other clusters in the graph.
In the genetic hierarchal bottom-up algorithm (GBUA), two cluster with the highest Pri
are selected to merge. In genetic top-down algorithm (GTDA), the cluster with less Pri is
selected. The genetic hybrid hierarchical algorithm (GHHA) utilizes the GBUA and BTDA
alternatively until the resulting community structure is the same whether the GHHA starts
from GBUA or GTDA.
Taking advantage of relationships which are found by some data mining techniques
to construct a graph, then the clusters or dendrogram will be obtained by partitioning the
graph using graph partitioning method. It is the generic scheme of graph-based clustering
algorithms. The dynamic similarity model based method has the better quality of clusters
than the static similarity model based method.
However, the Chameleon method and its improvements stated above are all based on
simple graphs which represent the pairwise similarity between data points. This may be
because simple graphs can only represent pairwise relationships in data, but cannot repre-
sent high order relationships in high dimensional space. Therefore, hypergraph can be used
to avoid the shortage of simple graph for analyzing the high-dimensional data.
13
8252 X. Ran et al.
13
Comprehensive survey on hierarchical clustering algorithms… 8253
The methods based on density estimation to construct hierarchy have aroused the inter-
est of researchers. This type of method not only retains the advantages of density-based
clustering, but also reveals the hierarchical structure among clusters. We can obtain more
information from clustering result.
RNG-HDBSCAN* has improved the strategy of computing multiple density-based
clustering hierarchies by replacing the Mutual Reachability Graph with a Relative Neigh-
borhood Graph that incrementally leverage the solutions of HDBSCAN* with respect to a
series of values of mpts (Neto et al. 2021). RNG-HDBSCAN* speeds up about 60 times
compared to HDBSCAN* by replacing the complete graph in HDBSCAN* for every mpts
with an equivalent smaller graph for a larger mpts, and has the better scalability for the
larger dataset. In some case, however, it cannot detect all of the cluster structures in the
data in a single hierarchy using a single value of mpts (Neto et al. 2021).
Hierarchical Quick Shift (HQuick-Shift) algorithm (Altinigneli et al. 2020) constructs a
Mode Attraction Graph (MAG) to overcome drawbacks of Quick Shift (QShift), in which
the hierarchy information of flat clusters cannot be reflected and all flat clusters obey the
assumption that groups correspond to modes of invariant density threshold. RNN con-
structs clusters with hierarchy from density estimation function. Using MAG finds the
optimal neighborhoods parameter 𝜏 . The resolution of constrained optimal problem gener-
ates the global optimal resolution for the specific parameter 𝜏 in unsupervised manner and
multi-level-mode-set are automatically extracted. HQuick-Shift can cope with clusters with
arbitrary shape and size, and varied densities, and can recognize noise, as well as can select
the optimal parameter to prevent clustering results from under- or over-fragmentation. It
can also learn the quasitemporal-relationships of the object to a general extractor with
RNN back-end during processing.
Zhu et al. (2022) have introduced a density-connected hierarchical density-peak cluster-
ing (DC-HCP), which can obtain clusters with varied densities from the view of hierarchi-
cal structure of clusters rather than flat clustering, by defining two types of clusters, i.e., 𝜂
-linked clusters and 𝜂-density-connected clusters. 𝜂 indicates the nearest neighbour with
higher density. It gives the formal cluster definitions that don’t exist in previous work, and
remedies the shortages of DP and DBSCAN, namely DBSCAN is not suitable for cluster-
ing datasets with varied densities, DP fails to find 𝜂-linked clusters with varied densities
while all points are not ranked and non 𝜂-linked clusters. Two clusters are merged from
bottom to up only if their modes are density-connected at the current level. DC-HCP has
three parameters to be set, k ∈ 2, 3, ..., 50;𝜖 ∈ 0.1%, 0.2%, ..., 99.9%;𝜏 = 1.
DC-HDP makes use of the respective advantages of DBSCAN and DP, that is, it
enhances the ability to recognize all clusters with arbitrary shape and different density, and
provides more information on the hierarchical structure of clusters. DC-HDP can be widely
used in various applications from a new view. However, DC-HDP only consumes the same
computational time with DP (Zhu et al. 2022).
As can be seen from the above, the density-based hierarchical clustering methods still
use density as similarity measure, can identify clusters of arbitrary shape and size, and
construct hierarchical relationships among clusters. It reflects richer information than tradi-
tional density-based clustering.
13
8254 X. Ran et al.
Some hierarchical clustering methods combines the bottom-up and top-down methods or
clustering method and other advanced techniques to better produce hierarchical clustering
results.
Hierarchical self-organization maps (SOM) (Kohonen 2001) use the architecture of
artificial neural network to cluster data. It has input layer and hidden computational layer,
where a node represents a desired cluster and the topology exists between the clusters.
When the centroid is determined, the data points closest to the centroid are assigned to the
cluster including the centroid. The centroid is subsequently updated while centroids sur-
rounding it are updated accordingly. SOM can supply the data visualization and an elegant
topology diagram.
Geng and Ali (2005) have considered the influence of all clusters when selecting the
clusters to be merged in the proposed stochastic message passing clustering (SMPC)
method based on ensemble probability distribution. SMPC can undo the sub-cluster
in which the objects don’t have good probability distribution to improve the clustering
performance.
Zeng et al. (2009) have proposed a feature selection method based on long tailed distri-
bution (LTD-Selection) in the document and a hierarchical clustering algorithm. The hier-
archical clustering algorithm use LTD-Selection to select the feature words in the docu-
ments and compute the frequency of each word, then the first k feature words sorted in
descent order by the frequency are clustered by k-Means to obtained the topic set, and the
father–child relation between clusters in neighbor levels is set up. Then the next k feature
words are selected to construct tree-like structure until all feature words are contained in
one tree.
Muhr et al. (2010) have developed an algorithm which use growing k-Means method to
cluster the documents in the list to produce the clusters which meet some criteria. Then the
documents are treated as the children of the clusters. According to the specified constraints,
the clusters are split using growing k-Means or merged by the highest similarity between
clusters. Recursively performing above steps, the hierarchy can be generated.
13
Comprehensive survey on hierarchical clustering algorithms… 8255
13
8256 X. Ran et al.
is improved by the LSH-based data partitioning by discovering local clusters in each data
node in Hadoop. The algorithm is a two-stage parallel clustering method that integrates
subspace clustering algorithm and conventional agglomerative hierarchical clustering
algorithm.
In the paper of Cheung and Zhang (2019), they have proposed the growing multilayer
topology training algorithm called GMTT, which trains a collection of seed points that are
linked in a growing and hierarchical manner and can represent the data distribution. The
topology trained by the GMTT can accelerate the similarity measure between data points
and guide to merge. Then a DL linkage, that is a type of density-based metric, based on the
topology trained by GMIT is proposed and used as the similarity measure during merging.
Many sub-MSTs are formed according to the topology and DL, and MST then is formed by
adding links between subset in terms of the sub-MSTs. Finally, the MST is transformed to
a dendrogram according to the corresponding topology and MST. The incremental GMIT,
called IGMIT, also has been proposed to process the streaming data. Both GMIT and
IGMIT improve the time complexity of the agglomerative hierarchical clustering and do
not lose the accuracy of the algorithm.
9 Conclusion
Funding This work is supported by the National Key R&D Program of China (Grant No. 2017YFE0111900,
2018YFB1003205), the Higher Education Innovation Fund of Gansu Province (Grants No. 2020B-214), and
the Natural Science Foundation of Gansu Province (Grants No. 20JR10RG304).
Declarations
Conflict of interest The authors declare that they have no competing interests.
13
Comprehensive survey on hierarchical clustering algorithms… 8257
References
Abdi H, Valentin D (2007) Multiple correspondence analysis. Encycl Meas Stat 2(4):651–657
Ackerman M, Dasgupta S (2014) Incremental clustering: the case for extra clusters. In: Advances in neu-
ral information processing systems 27: annual conference on neural information processing systems
2014, December 8–13 2014, Montreal, QC. pp 307–315
Agarwal S, Lim J, Zelnik-Manor L et al (2005) Beyond pairwise clustering. In: 2005 IEEE computer society
conference on computer vision and pattern recognition (CVPR 2005), 20–26 June 2005, San Diego,
CA. IEEE Computer Society, pp 838–845. https://doi.org/10.1109/CVPR.2005.89
Agha GA (1990) ACTORS—a model of concurrent computation in distributed systems. MIT Press
series in artificial intelligence. MIT Press, Cambridge
Alalyan F, Zamzami N, Bouguila N (2019) Model-based hierarchical clustering for categorical data. In:
28th IEEE international symposium on industrial electronics, ISIE 2019, Vancouver, BC, June
12–14, 2019. IEEE, pp 1424–1429. https://doi.org/10.1109/ISIE.2019.8781307
Altinigneli MC, Miklautz L, Böhm C et al (2020) Hierarchical quick shift guided recurrent clustering.
In: 2020 IEEE 36th international conference on data engineering (ICDE). pp 1842–1845. https://
doi.org/10.1109/ICDE48307.2020.00184
Alzate C, Suykens JA (2012) Hierarchical kernel spectral clustering. Neural Netw 35(C):21–30. https://
doi.org/10.1016/j.neunet.2012.06.007
Anderberg MR (1973) Chapter 6–hierarchical clustering methods, probability and mathematical statis-
tics: a series of monographs and textbooks, vol 19. Academic Press, Cambridge. https://doi.org/
10.1016/B978-0-12-057650-0.50012-0
Averbuch-Elor H, Bar N, Cohen-Or D (2020) Border-peeling clustering. IEEE Trans Pattern Anal Mach
Intell 42(7):1791–1797. https://doi.org/10.1109/TPAMI.2019.2924953
Barton T, Bruna T, Kordík P (2016) Mocham: robust hierarchical clustering based on multi-objective
optimization. In: IEEE international conference on data mining workshops, ICDM workshops
2016, December 12–15, 2016, Barcelona. IEEE Computer Society, pp 831–838. https://doi.org/10.
1109/ICDMW.2016.0123
Barton T, Bruna T, Kordík P (2019) Chameleon 2: an improved graph-based clustering algorithm. ACM
Trans Knowl Discov Data 13(1):10. https://doi.org/10.1145/3299876
Batagelj V (1988) Generalized ward and related clustering problems. Classification and related methods
of data analysis. Jun: 67–74
Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas CK, Teboulle M
(eds) Grouping multidimensional data–recent advances in clustering. Springer, Berlin, pp 25–71.
https://doi.org/10.1007/3-540-28349-8_2
Boley D (1998) Principal direction divisive partitioning. Data Min Knowl Disc 2(4):325–344
Bouguettaya A, Yu Q, Liu X et al (2015) Efficient agglomerative hierarchical clustering. Expert Syst
Appl 42(5):2785–2797
Brans JP, Vincke P (1985) Note—a preference ranking organisation method: (the PROMETHEE method
for multiple criteria decision-making). Manag Sci 31(6):647–656
Brans JP, Vincke P, Mareschal B (1986) How to select and how to rank projects: the PROMETHEE
method. Eur J Oper Res 24(2):228–238
Cai D, Chen X (2015) Large scale spectral clustering via landmark-based sparse representation. IEEE
Trans Cybern 45(8):1669–1680. https://doi.org/10.1109/TCYB.2014.2358564
Cai Q, Liu J (2020) Hierarchical clustering of bipartite networks based on multiobjective optimization.
IEEE Trans Netw Sci Eng 7(1):421–434. https://doi.org/10.1109/TNSE.2018.2830822
Cai Y, Sun Y (2011) ESPRIT-tree: hierarchical clustering analysis of millions of 16s rRNA pyrose-
quences in quasilinear computational time. Nucleic Acids Res 39(14):e95–e95. https://doi.org/10.
1093/nar/gkr349
Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density
estimates. In: Pei J, Tseng VS, Cao L et al (eds) Advances in knowledge discovery and data min-
ing. Springer, Berlin, Heidelberg, pp 160–172
Cao X, Su T, Wang P et al (2018) An optimized chameleon algorithm based on local features. In: Pro-
ceedings of the 10th international conference on machine learning and computing, ICMLC 2018,
Macau, February 26–28, 2018. ACM, pp 184–192
Carpineto C, Romano G (1996) A lattice conceptual clustering system and its application to browsing
retrieval. Mach Learn 24(2):95–122. https://doi.org/10.1007/BF00058654
Chakraborty S, Paul D, Das S (2020) Hierarchical clustering with optimal transport. Stat Probab Lett
163(108):781. https://doi.org/10.1016/j.spl.2020.108781
13
8258 X. Ran et al.
Chen D, Cui DW, Wang CX et al (2006) A rough set-based hierarchical clustering algorithm for cat-
egorical data. Int J Inf Technol 12(3):149–159
Chen W, Song Y, Bai H et al (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pat-
tern Anal Mach Intell 33(3):568–586. https://doi.org/10.1109/TPAMI.2010.88
Cheng Q, Liu Z, Huang J et al (2012) Hierarchical clustering based on hyper-edge similarity for com-
munity detection. In: 2012 IEEE/WIC/ACM international conferences on web intelligence, WI
2012, Macau, December 4–7, 2012. IEEE Computer Society, pp 238–242. https://doi.org/10.1109/
WI-IAT.2012.9
Cheng D, Zhu Q, Huang J et al (2019a) A hierarchical clustering algorithm based on noise removal. Int J
Mach Learn Cybern 10(7):1591–1602. https://doi.org/10.1007/s13042-018-0836-3
Cheng D, Zhu Q, Huang J et al (2019b) A local cores-based hierarchical clustering algorithm for data
sets with complex structures. Neural Comput Appl 31(11):8051–8068. https://doi.org/10.1007/
s00521-018-3641-8
Cherng J, Lo M (2001) A hypergraph based clustering algorithm for spatial data sets. In: Proceedings
of the 2001 IEEE international conference on data mining, 29 November–2 December 2001, San
Jose, CA. IEEE Computer Society, pp 83–90. https://doi.org/10.1109/ICDM.2001.989504
Cheung Y, Zhang Y (2019) Fast and accurate hierarchical clustering based on growing multilayer
topology training. IEEE Trans Neural Netw Learn Syst 30(3):876–890. https://doi.org/10.1109/
TNNLS.2018.2853407
Cho M, Lee J, Lee KM (2009) Feature correspondence and deformable object matching via agglomera-
tive correspondence clustering. In: IEEE 12th international conference on computer vision, ICCV
2009, Kyoto, September 27–October 4, 2009. IEEE Computer Society, pp 1280–1287. https://doi.
org/10.1109/ICCV.2009.5459322
Courty N, Flamary R, Tuia D et al (2017) Optimal transport for domain adaptation. IEEE Trans Pattern
Anal Mach Intell 39(9):1853–1865. https://doi.org/10.1109/TPAMI.2016.2615921
Day WH, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods.
J Classif 1(1):7–24
Dixit V (2022) GCFI++: embedding and frequent itemset based incremental hierarchical clustering with
labels and outliers. In: CODS-COMAD 2022: 5th joint international conference on data science &
management of data (9th ACM IKDD CODS and 27th COMAD), Bangalore, January 8–10, 2022.
ACM, pp 135–143. https://doi.org/10.1145/3493700.3493727
Dong Y, Wang Y, Jiang K (2018) Improvement of partitioning and merging phase in chameleon clus-
tering algorithm. In: 2018 3rd international conference on computer and communication systems
(ICCCS). IEEE, pp 29–32
Duran BS, Odell PL (2013) Cluster analysis: a survey, vol 100. Springer Science & Business Media,
Berlin
D’Urso P, Vitale V (2020) A robust hierarchical clustering for georeferenced data. Spat Stat 35(100):407.
https://doi.org/10.1016/j.spasta.2020.100407
Endo Y, Haruyama H, Okubo T (2004) On some hierarchical clustering algorithms using kernel func-
tions. In: 2004 IEEE international conference on fuzzy systems (IEEE Cat. No.04CH37542), vol
3. pp 1513–1518. https://doi.org/10.1109/FUZZY.2004.1375399
Estivill-Castro V, Lee I (2000) AMOEBA: hierarchical clustering based on spatial proximity using
delaunay diagram. In: Proceedings of the 9th international symposium on spatial data handling.
Beijing, pp 1–16
Everitt B, Landau S, Leese M (2001) Cluster analysis. Arnold, London
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–
172. https://doi.org/10.1007/BF00114265
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifica-
tions. Biometrics 21:768–769
Fouedjio F (2016) A hierarchical clustering method for multivariate geostatistical data. Spat Stat
18:333–351. https://doi.org/10.1016/j.spasta.2016.07.003
Fränti P, Virmajoki O, Hautamäki V (2006) Fast agglomerative clustering using a k-nearest neighbor
graph. IEEE Trans Pattern Anal Mach Intell 28(11):1875–1881. https://doi.org/10.1109/TPAMI.
2006.227
Frigui H, Krishnapuram R (1999) A robust competitive clustering algorithm with applications in computer
vision. IEEE Trans Pattern Anal Mach Intell 21(5):450–465. https://doi.org/10.1109/34.765656
Galdino SML, Maciel PRM (2019) Hierarchical cluster analysis of interval-valued data using width of
range Euclidean distance. In: IEEE Latin American conference on computational intelligence, LA-
CCI 2019, Guayaquil, Ecuador, November 11–15, 2019. IEEE, pp 1–6. https://doi.org/10.1109/
LA-CCI47412.2019.9036754
13
Comprehensive survey on hierarchical clustering algorithms… 8259
Geng H, Ali HH (2005) A new clustering strategy with stochastic merging and removing based on ker-
nel functions. In: Fourth international IEEE computer society computational systems bioinformat-
ics conference workshops & poster abstracts, CSB 2005 workshops, Stanford, CA, August 8–11,
2005. IEEE Computer Society, pp 41–42. https://doi.org/10.1109/CSBW.2005.10
Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad
Sci USA 99(12):7821–7826
Golub GH, Loan CFV (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Govindu VM (2005) A tensor decomposition for geometric grouping and segmentation. In: 2005 IEEE com-
puter society conference on computer vision and pattern recognition (CVPR 2005), 20–26 June 2005,
San Diego, CA. IEEE Computer Society, pp 1150–1157. https://doi.org/10.1109/CVPR.2005.50
Gracia C, Binefa X (2011) On hierarchical clustering for speech phonetic segmentation. In: Proceedings
of the 19th European signal processing conference, EUSIPCO 2011, Barcelona, August 29–Sep-
tember 2, 2011. IEEE, pp 2128–2132
Guan X, Du L (1998) Domain identification by clustering sequence alignments. Bioinformatics
14(9):783–788. https://doi.org/10.1093/bioinformatics/14.9.783
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Haas
LM, Tiwary A (eds) SIGMOD 1998, proceedings ACM SIGMOD international conference on
management of data, June 2–4, 1998, Seattle, Washington. ACM Press, pp 73–84. https://doi.org/
10.1145/276304.276312
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf
Syst 25(5):345–366. https://doi.org/10.1016/S0306-4379(00)00022-3
Gullo F, Ponti G, Tagarelli A et al (2008) A hierarchical algorithm for clustering uncertain data via an
information-theoretic approach. In: Proceedings of the 8th IEEE international conference on data
mining (ICDM 2008), December 15–19, 2008, Pisa. IEEE Computer Society, pp 821–826. https://
doi.org/10.1109/ICDM.2008.115
Guo JF, Zhao YY, Li J (2007) A multi-relational hierarchical clustering algorithm based on shared near-
est neighbor similarity. In: 2007 international conference on machine learning and cybernetics.
IEEE, pp 3951–3955. https://doi.org/10.1109/ICMLC.2007.4370836
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst
17(2–3):107–145. https://doi.org/10.1023/A:1012801612483
Han J, Kamber M, Pei J (2011) Data mining concepts and techniques: third edition. Morgan Kaufmann
Ser Data Manag Syst 5(4):83–124
Han X, Zhu Y, Ting KM et al (2022) Streaming hierarchical clustering based on point-set kernel. In:
KDD ’22: the 28th ACM SIGKDD conference on knowledge discovery and data mining, Wash-
ington, DC, August 14–18, 2022. ACM, pp 525–533. https://doi.org/10.1145/3534678.3539323
He Z, Xu X, Deng S (2002) Squeezer: an efficient algorithm for clustering categorical data. J Comput
Sci Technol 17(5):611–624. https://doi.org/10.1007/BF02948829
He L, Ray N, Guan Y et al (2019) Fast large-scale spectral clustering via explicit feature mapping. IEEE
Trans Cybern 49(3):1058–1071. https://doi.org/10.1109/TCYB.2018.2794998
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: Raedt LD, Wrobel S (eds)
Machine learning, proceedings of the twenty-second international conference (ICML 2005), Bonn,
August 7–11, 2005, ACM international conference proceeding series, vol 119. ACM, pp 297–304.
https://doi.org/10.1145/1102351.1102389
Huang D, Wang C, Wu J et al (2020) Ultra-scalable spectral clustering and ensemble clustering. IEEE
Trans Knowl Data Eng 32(6):1212–1226. https://doi.org/10.1109/TKDE.2019.2903410
Hubert L (1973) Monotone invariant clustering procedures. Psychometrika 38(1):47–62
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Hulot A, Chiquet J, Jaffrézic F et al (2020) Fast tree aggregation for consensus hierarchical clustering.
BMC Bioinform 21(1):120. https://doi.org/10.1186/s12859-020-3453-6
Ishizaka A, Lokman B, Tasiou M (2020) A stochastic multi-criteria divisive hierarchical clustering algo-
rithm. Omega 103:102370
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Hoboken
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal
Mach Intell 22(1):4–37. https://doi.org/10.1109/34.824819
Jalalat-evakilkandi M, Mirzaei A (2010) A new hierarchical-clustering combination scheme based on
scatter matrices and nearest neighbor criterion. In: 2010 5th international symposium on telecom-
munications, IEEE, pp 904–908. https://doi.org/10.1109/ISTEL.2010.5734151
Jambu M, Tan SH, Stern D (1989) Exploration informatique et statistique des données. Dunod, Paris
Jeantet I, Miklós Z, Gross-Amblard D (2020) Overlapping hierarchical clustering (OHC). In: Advances
in intelligent data analysis XVIII—18th international symposium on intelligent data analysis, IDA
13
8260 X. Ran et al.
2020, Konstanz, April 27–29, 2020, proceedings, lecture notes in computer science, vol 12080.
Springer, pp 261–273. https://doi.org/10.1007/978-3-030-44584-3_21
Jeon Y, Yoon S (2015) Multi-threaded hierarchical clustering by parallel nearest-neighbor chaining.
IEEE Trans Parallel Distrib Syst 26(9):2534–2548. https://doi.org/10.1109/TPDS.2014.2355205
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254
Judd D, McKinley PK, Jain AK (1998) Large-scale parallel data clustering. IEEE Trans Pattern Anal
Mach Intell 20(8):871–876. https://doi.org/10.1109/34.709614
Karypis G, Aggarwal R, Kumar V et al (1999a) Multilevel hypergraph partitioning: applications in VLSI
domain. IEEE Trans Very Large Scale Integr Syst 7(1):69–79. https://doi.org/10.1109/92.748202
Karypis G, Han E, Kumar V (1999b) Chameleon: hierarchical clustering using dynamic modeling. Com-
puter 32(8):68–75. https://doi.org/10.1109/2.781637
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hobo-
ken. https://doi.org/10.1002/9780470316801
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344.
Wiley, Hoboken
Kaur PJ et al (2015) Cluster quality based performance evaluation of hierarchical clustering method. In:
2015 1st international conference on next generation computing technologies (NGCT). IEEE, pp
649–653
Kernighan BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J
49(2):291–307. https://doi.org/10.1002/j.1538-7305.1970.tb01770.x
Kobren A, Monath N, Krishnamurthy A et al (2017) A hierarchical algorithm for extreme clustering. In:
Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data
mining, Halifax, NS, August 13–17, 2017. ACM, pp 255–264. https://doi.org/10.1145/3097983.
3098079
Kohonen T (2001) Self-organizing maps, third edition. Springer series in information sciences. Springer,
Cham. https://doi.org/10.1007/978-3-642-56927-2
Kotsiantis S, Pintelas P (2004) Recent advances in clustering: a brief survey. WSEAS Trans Inf Sci Appl
1(1):73–81
Kumar P, Tripathy B (2009) MMeR: an algorithm for clustering heterogeneous data using rough set theory.
Int J Rapid Manuf 1(2):189–207
Kumar T, Vaidyanathan S, Ananthapadmanabhan H et al (2018) Hypergraph clustering: a modularity maxi-
mization approach. CoRR. arXiv:abs/1812.10869
Lance GN, Williams WT (1967) A general theory of classificatory sorting strategies: 1. Hierarchical sys-
tems. Comput J 9(4):373–380. https://doi.org/10.1093/comjnl/9.4.373
Lerato L, Niesler T (2015) Clustering acoustic segments using multi-stage agglomerative hierarchical clus-
tering. PLoS ONE 10(10):e0141756
Lewis-Beck M, Bryman AE, Liao TF (2003) The Sage encyclopedia of social science research methods.
Sage Publications, Thousand Oaks
Li M, Deng S, Wang L et al (2014) Hierarchical clustering algorithm for categorical data using a probabilis-
tic rough set model. Knowl Based Syst 65:60–71. https://doi.org/10.1016/j.knosys.2014.04.008
Li S, Li W, Qiu J (2017) A novel divisive hierarchical clustering algorithm for geospatial analysis. ISPRS
Int J Geo-Inf 6(1):30. https://doi.org/10.3390/ijgi6010030
Li Y, Hong Z, Feng W et al (2019) A hierarchical clustering based feature word extraction method. In: 2019
IEEE 3rd advanced information management, communicates, electronic and automation control con-
ference (IMCEC). IEEE, pp 883–887
Lin Y, Dong X, Zheng L et al (2019) A bottom-up clustering approach to unsupervised person re-identifica-
tion. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first inno-
vative applications of artificial intelligence conference, IAAI 2019, the ninth AAAI symposium on
educational advances in artificial intelligence, EAAI 2019, Honolulu, Hawaii, January 27–February 1,
2019. AAAI Press, pp 8738–8745. https://doi.org/10.1609/aaai.v33i01.33018738
Liu H, Latecki LJ, Yan S (2015) Dense subgraph partition of positive hypergraphs. IEEE Trans Pattern Anal
Mach Intell 37(3):541–554. https://doi.org/10.1109/TPAMI.2014.2346173
Liu J, Liu X, Yang Y et al (2021) Hierarchical multiple kernel clustering. In: Thirty-fifth AAAI conference
on artificial intelligence. AAAI, pp 2–9
Lu Y, Wan Y (2013) PHA: a fast potential-based hierarchical agglomerative clustering method. Pattern Rec-
ognit 46(5):1227–1239. https://doi.org/10.1016/j.patcog.2012.11.017
Lu Y, Hou X, Chen X (2016) A novel travel-time based similarity measure for hierarchical clustering. Neu-
rocomputing 173:3–8. https://doi.org/10.1016/j.neucom.2015.01.090
Ma X, Dhavala S (2018) Hierarchical clustering with prior knowledge. CoRR. arXiv:abs/1806.03432
13
Comprehensive survey on hierarchical clustering algorithms… 8261
13
8262 X. Ran et al.
Qin H, Ma X, Herawan T et al (2014) MGR: an information theory based hierarchical divisive clustering
algorithm for categorical data. Knowl Based Syst 67:401–411. https://doi.org/10.1016/j.knosys.2014.
03.013
Rabin J, Ferradans S, Papadakis N (2014) Adaptive color transfer with relaxed optimal transport. In: 2014
IEEE international conference on image processing, ICIP 2014, Paris, October 27–30, 2014. IEEE, pp
4852–4856. https://doi.org/10.1109/ICIP.2014.7025983
Rahman MA, Rahman MM, Mollah MNH et al (2018) Robust hierarchical clustering for metabolomics
data analysis in presence of cell-wise and case-wise outliers. In: 2018 international conference on
computer, communication, chemical, material and electronic engineering (IC4ME2). IEEE, pp 1–4.
https://doi.org/10.1109/IC4ME2.2018.8465616
Reddy CK, Vinzamuri B (2013) A survey of partitional and hierarchical clustering algorithms. In: Aggar-
wal CC, Reddy CK (eds) Data clustering: algorithms and applications. CRC Press, Boca Raton, pp
87–110
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev
26(2):195–239
Rocha C, Dias LC (2013) MPOC: an agglomerative algorithm for multicriteria partially ordered clustering.
4OR 11(3):253–273. https://doi.org/10.1007/s10288-013-0228-1
Ros F, Guillaume S (2019) A hierarchical clustering algorithm and an improvement of the single linkage
criterion to deal with noise. Expert Syst Appl 128:96–108
Ros F, Guillaume S, Hajji ME et al (2020) KdMutual: a novel clustering algorithm combining mutual neigh-
boring and hierarchical approaches using a new selection criterion. Knowl Based Syst 204(106):220.
https://doi.org/10.1016/j.knosys.2020.106220
Roux M (2018) A comparative study of divisive and agglomerative hierarchical clustering algorithms. J
Classif 35(2):345–366. https://doi.org/10.1007/s00357-018-9259-9
Sabarish B, Karthi R, Kumar TG (2020) Graph similarity-based hierarchical clustering of trajectory data.
Procedia Comput Sci 171:32–41. https://doi.org/10.1016/j.procs.2020.04.004
Sahoo N, Callan J, Krishnan R et al (2006) Incremental hierarchical clustering of text documents. In: Pro-
ceedings of the 2006 ACM CIKM international conference on information and knowledge manage-
ment, Arlington, VA, November 6–11, 2006. ACM, pp 357–366. https://doi.org/10.1145/1183614.
1183667
Salton G (1975) A vector space model for information retrieval. J ASIS 18(11): 613–620
Sander J, Qin X, Lu Z et al (2003) Automatic extraction of clusters from hierarchical clustering representa-
tions. In: Whang KY, Jeon J, Shim K et al (eds) Advances in knowledge discovery and data mining.
Springer, Berlin, Heidelberg, pp 75–87
Saunders A, Ashlock DA, Houghten SK (2018) Hierarchical clustering and tree stability. In: 2018 IEEE
conference on computational intelligence in bioinformatics and computational biology, CIBCB 2018,
Saint Louis, MO, May 30–June 2, 2018. IEEE, pp 1–8. https://doi.org/10.1109/CIBCB.2018.8404978
Sharan R, Shamir R (2000) Center CLICK: a clustering algorithm with applications to gene expression
analysis. In: Proceedings of the eighth international conference on intelligent systems for molecular
biology, August 19–23, 2000, La Jolla/San Diego, CA. AAAI, pp 307–316
Sharma S, Batra N et al (2019) Comparative study of single linkage, complete linkage, and ward method of
agglomerative clustering. In: 2019 international conference on machine learning, big data, cloud and
parallel computing (COMITCon). IEEE, pp 568–573
Shimizu T, Sakurai K (2018) Comprehensive data tree by actor messaging for incremental hierarchical
clustering. In: 2018 IEEE 42nd annual computer software and applications conference, COMPSAC
2018, Tokyo, 23–27 July 2018, vol 1. IEEE Computer Society, pp 801–802. https://doi.org/10.1109/
COMPSAC.2018.00127
Sisodia D, Singh L, Sisodia S et al (2012) Clustering techniques: a brief survey of different clustering algo-
rithms. Int J Latest Trends Eng Technol 1(3):82–87
Sneath PH, Sokal RR (1975) Numerical taxonomy. The principles and practice of numerical classification,
vol 50. Williams WT published in association with Stony Brook University. https://doi.org/10.1086/
408956
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. https://hdl.
handle.net/11299/215421, May 23, 2000
Székely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending Ward’s
minimum variance method. J Classif 22(2):151–183. https://doi.org/10.1007/s00357-005-0012-9
Takumi S, Miyamoto S (2012) Top-down vs bottom-up methods of linkage for asymmetric agglomerative
hierarchical clustering. In: 2012 IEEE international conference on granular computing, GrC 2012,
Hangzhou, August 11–13, 2012. IEEE Computer Society, pp 459–464. https://doi.org/10.1109/GrC.
2012.6468689
13
Comprehensive survey on hierarchical clustering algorithms… 8263
Tan P, Steinbach M, Karpatne A et al (2019) Introduction to data mining, Second Edition. Pearson, Harlow
Toujani R, Akaichi J (2018) GHHP: genetic hybrid hierarchical partitioning for community structure in
social medias networks. In: 2018 IEEE smartWorld, ubiquitous intelligence & computing, advanced
& trusted computing, scalable computing & communications, cloud & big data computing, internet
of people and smart city innovation, SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI 2018,
Guangzhou, October 8–12, 2018. IEEE, pp 1146–1153. https://doi.org/10.1109/SmartWorld.2018.
00199
Tripathy B, Ghosh A (2011a) SDR: an algorithm for clustering categorical data using rough set theory. In:
2011 IEEE recent advances in intelligent computational systems. IEEE, pp 867–872
Tripathy B, Ghosh A (2011b) SSDR: an algorithm for clustering categorical data using rough set theory.
Adv Appl Sci Res 2(3):314–326
Tripathy B, Goyal A, Chowdhury R et al (2017) MMeMeR: an algorithm for clustering heterogeneous
data using rough set theory. Int J Intell Syst Appl 9(8):25
Tsekouras G, Kotoulas P, Tsirekis C et al (2008) A pattern recognition methodology for evalua-
tion of load profiles and typical days of large electricity customers. Electr Power Syst Res
78(9):1494–1510
Turi R (2001) Clustering-based colour image segmentation. PhD Thesis, Monash University
Varshney AK, Muhuri PK, Lohani QMD (2022) PIFHC: the probabilistic intuitionistic fuzzy hierarchical
clustering algorithm. Appl Soft Comput 120(108):584. https://doi.org/10.1016/j.asoc.2022.108584
Veldt N, Benson AR, Kleinberg JM (2020) Localized flow-based clustering in hypergraphs. CoRR.
arXiv:abs/2002.09441
Vidal E, Granitto PM, Bayá A (2014) Discussing a new divisive hierarchical clustering algorithm. In:
XLIII Jornadas Argentinas de Informática e Investigación Operativa (43JAIIO)-XV Argentine
symposium on artificial intelligence (ASAI)(Buenos Aires, 2014)
von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416. https://doi.org/10.
1007/s11222-007-9033-z
Wang T, Lu Y, Han Y (2017) Clustering of high dimensional handwritten data by an improved hyper-
graph partition method. In: Intelligent computing methodologies—13th international conference,
ICIC 2017, Liverpool, August 7–10, 2017, proceedings, part III, lecture notes in computer sci-
ence, vol 10363. Springer, pp 323–334. https://doi.org/10.1007/978-3-319-63315-2_28
Wishart D (1969) An algorithm for hierarchical classifications. Biometrics 25:165–170
Xi Y, Lu Y (2020) Multi-stage hierarchical clustering method based on hypergraph. In: Intelligent com-
puting methodologies—16th international conference, ICIC 2020, Bari, October 2–5, 2020, pro-
ceedings, part III, lecture notes in computer science, vol 12465. Springer, pp 432–443. https://doi.
org/10.1007/978-3-030-60796-8_37
Xiong T, Wang S, Mayers A et al (2012) DHCC: divisive hierarchical clustering of categorical data.
Data Min Knowl Discov 24(1):103–135. https://doi.org/10.1007/s10618-011-0221-2
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193
Xu R, Wunsch DC II (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678.
https://doi.org/10.1109/TNN.2005.845141
Yamada Y, Masuyama N, Amako N et al (2020) Divisive hierarchical clustering based on adaptive reso-
nance theory. In: International symposium on community-centric systems, CcS 2020, Hachioji,
Tokyo, September 23–26, 2020. IEEE, pp 1–6. https://doi.org/10.1109/CcS49175.2020.9231474
Yang J, Grunsky E, Cheng Q (2019) A novel hierarchical clustering analysis method based on Kullback–
Leibler divergence and application on dalaimiao geochemical exploration data. Comput Geosci
123:10–19. https://doi.org/10.1016/j.cageo.2018.11.003
Yu F, Dong K, Chen F et al (2007) Clustering time series with granular dynamic time warping method.
In: 2007 IEEE international conference on granular computing, GrC 2007, San Jose, CA, 2–4
November 2007. IEEE Computer Society, pp 393–398. https://doi.org/10.1109/GrC.2007.34
Yu M, Hillebrand A, Tewarie P et al (2015) Hierarchical clustering in minimum spanning trees. Chaos
25(2):023107
Zahn CT (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans
Comput 20(1):68–86. https://doi.org/10.1109/T-C.1971.223083
Zeng J, Gong L, Wang Q et al (2009) Hierarchical clustering for topic analysis based on variable fea-
ture selection. In: 2009 sixth international conference on fuzzy systems and knowledge discovery.
IEEE, pp 477–481
Zeng K, Ning M, Wang Y et al (2020) Hierarchical clustering with hard-batch triplet loss for person re-
identification. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR
2020, Seattle, WA, June 13–19, 2020. IEEE, pp 13654–13662. https://doi.org/10.1109/CVPR4
2600.2020.01367
13
8264 X. Ran et al.
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large data-
bases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data,
Montreal, QC, June 4–6, 1996. ACM Press, pp 103–114. https://doi.org/10.1145/233269.233324
Zhang W, Wang X, Zhao D et al (2012) Graph degree linkage: agglomerative clustering on a directed
graph. In: Computer vision—ECCV 2012—12th European conference on computer vision, Flor-
ence, October 7–13, 2012, proceedings, part I, lecture notes in computer science, vol 7572.
Springer, pp 428–441. https://doi.org/10.1007/978-3-642-33718-5_31
Zhang W, Zhao D, Wang X (2013) Agglomerative clustering via maximum incremental path integral.
Pattern Recogn 46(11):3056–3065. https://doi.org/10.1016/j.patcog.2013.04.013
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Pro-
ceedings of the 2002 ACM CIKM international conference on information and knowledge manage-
ment, McLean, VA, November 4–9, 2002. ACM, pp 515–524. https://doi.org/10.1145/584792.584877
Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: Third international
conference on knowledge discovery and data mining, WKDD 2010, Phuket, 9–10 January 2010.
IEEE Computer Society, pp 195–199. https://doi.org/10.1109/WKDD.2010.123
Zhao D, Tang X (2008) Cyclizing clusters via zeta function of a graph. In: Advances in neural information
processing systems 21, proceedings of the twenty-second annual conference on neural information
processing systems, Vancouver, BC, December 8–11, 2008. Curran Associates, Inc., pp 1953–1960
Zhao Y, Karypis G, Fayyad UM (2005) Hierarchical clustering algorithms for document datasets. Data Min
Knowl Discov 10(2):141–168. https://doi.org/10.1007/s10618-005-0361-3
Zhao W, Li B, Gu Q et al (2020) Improved hierarchical clustering with non-locally enhanced features for
unsupervised person re-identification. In: 2020 international joint conference on neural networks,
IJCNN 2020, Glasgow, July 19–24, 2020. IEEE, pp 1–8. https://doi.org/10.1109/IJCNN48605.2020.
9206722
Zhou D, Huang J, Schölkopf B (2006) Learning with hypergraphs: clustering, classification, and embed-
ding. In: Advances in neural information processing systems 19, proceedings of the twentieth annual
conference on neural information processing systems, Vancouver, BC, December 4–7, 2006. MIT
Press, pp 1601–1608
Zhou R, Zhang Y, Feng S et al (2018) A novel hierarchical clustering algorithm based on density peaks for
complex datasets. Complex. https://doi.org/10.1155/2018/2032461
Zhu Y, Ting KM, Jin Y et al (2022) Hierarchical clustering that takes advantage of both density-peak and
density-connectivity. Inf Syst 103(C):101871. https://doi.org/10.1016/j.is.2021.101871
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
13