0% found this document useful (0 votes)
138 views46 pages

Comprehensive Survey On Hierarchical Clustering Algorithms and The Recent Developments

The document provides a comprehensive survey of hierarchical clustering algorithms. It discusses divisive and agglomerative approaches, as well as graph-based and density-based methods. It also covers recent developments like combining hierarchical clustering with deep learning. Key aspects reviewed include similarity measures, advantages and drawbacks of different methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views46 pages

Comprehensive Survey On Hierarchical Clustering Algorithms and The Recent Developments

The document provides a comprehensive survey of hierarchical clustering algorithms. It discusses divisive and agglomerative approaches, as well as graph-based and density-based methods. It also covers recent developments like combining hierarchical clustering with deep learning. Key aspects reviewed include similarity measures, advantages and drawbacks of different methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Artificial Intelligence Review (2023) 56:8219–8264

https://doi.org/10.1007/s10462-022-10366-3

Comprehensive survey on hierarchical clustering algorithms


and the recent developments

Xingcheng Ran1,2 · Yue Xi1 · Yonggang Lu1 · Xiangwen Wang1 · Zhenyu Lu1

Accepted: 5 December 2022 / Published online: 26 December 2022


© The Author(s), under exclusive licence to Springer Nature B.V. 2022

Abstract
Data clustering is a commonly used data processing technique in many fields, which
divides objects into different clusters in terms of some similarity measure between data
points. Comparing to partitioning clustering methods which give a flat partition of the data,
hierarchical clustering methods can give multiple consistent partitions of the data at differ-
ent levels for the same data without rerunning clustering, it can be used to better analyze
the complex structure of the data. There are usually two kinds of hierarchical clustering
methods: divisive and agglomerative. For the divisive clustering, the key issue is how to
select a cluster for the next splitting procedure according to dissimilarity and how to divide
the selected cluster. For agglomerative hierarchical clustering, the key issue is the similar-
ity measure that is used to select the two most similar clusters for the next merge. Although
both types of the methods produce the dendrogram of the data as output, the clustering
results may be very different depending on the dissimilarity or similarity measure used
in the clustering, and different types of methods should be selected according to different
types of the data and different application scenarios. So, we have reviewed various hierar-
chical clustering methods comprehensively, especially the most recently developed meth-
ods, in this work. The similarity measure plays a crucial role during hierarchical clustering
process, we have reviewed different types of the similarity measure along with the hierar-
chical clustering. More specifically, different types of hierarchical clustering methods are
comprehensively reviewed from six aspects, and their advantages and drawbacks are ana-
lyzed. The application of some methods in real life is also discussed. Furthermore, we have
also included some recent works in combining deep learning techniques and hierarchical
clustering, which is worth serious attention and may improve the hierarchical clustering
significantly in the future.

Keywords Hierarchical clustering · Divisive · Agglomerative · Dissimilarity · Similarity

* Yonggang Lu
[email protected]
Xingcheng Ran
[email protected]
1
School of Information Science and Engineering, Lanzhou University, No. 222 South Tianshui
Road, Lanzhou 730000, Gansu, China
2
Center of Information Technology, Hexi University, No. 846 North Ring Road, Zhangye 734000,
Gansu, China

13
Vol.:(0123456789)
8220 X. Ran et al.

1 Introduction

Data clustering, also known as clustering analysis, is the process of dividing objects into
sets (clusters), such that objects from the same set are similar, while the objects from dif-
ferent sets are dissimilar to each other. It is widely used in data mining (Judd et al. 1998),
machine learning, pattern recognition (Carpineto and Romano 1996), and many other
fields. Data clustering is a difficult unsupervised learning because the data labels are not
available during learning, and different clusters may have different shapes and sizes (Jain
et al. 2000).
There are two different types of clustering techniques: partitional clustering and hier-
archical clustering (Frigui and Krishnapuram 1999). Partitional clustering divides a set of
data (or objects) into a specified number of clusters which is non-overlapping, so that each
element belongs to one cluster (Jain and Dubes 1988). Most of partitional algorithms such
as K-means (Forgy 1965), EM (Redner and Walker 1984) need to specify the number of
clusters K beforehand, but K is usually unknown in practice (Omran et al. 2007). Hierar-
chical clustering obtains a set of nested clusters that are organized as a cluster tree called
dendrogram. Each leaf node of the tree usually contains only one object, while each inner
node represents the union of its children nodes (Tan et al. 2019). From different levels of
the dendrogram, consistent clusters can be obtained at different granularities. So, hierarchi-
cal clustering can be used to detect and analyze the complex structures of the data. Hierar-
chical clustering method can be divided into two types: divisive and agglomerative (Everitt
et al. 2001).
In this paper, the hierarchical clustering methods, including the recently developed
methods, are introduced in the following order:

(1) Divisive hierarchical clustering methods: The clustering starts with one cluster contain-
ing all objects, and then recursively divides it (or each sub-cluster) into two sub-clusters
according to the dissimilarity measure until each sub-cluster includes a single object
(Turi 2001).
(2) Agglomerative hierarchical clustering methods: The clustering begins with each cluster
containing one object and recursively merges the two most similar clusters in terms of
the similarity measure until all objects are included in a single cluster (Turi 2001).
(3) Graph-based hierarchical clustering methods: In some applications, the relationship
between objects can be better represented by a graph or hypergraph, the task of hier-
archical clustering can be converted into the construction and the partitioning of the
graph or hypergraph (von Luxburg 2007).
(4) Density-based hierarchical clustering methods: The methods based on density esti-
mation to construct hierarchy have aroused the interest of researchers. This type of
method not only retains the advantages of density-based clustering, but also reveals the
hierarchical structure among clusters. We can obtain more information from clustering
result.
(5) Combination of the hierarchical clustering methods: Some hierarchical clustering meth-
ods are improved by combining the agglomerative and divisive hierarchical clustering
techniques, or by combining the clustering method and other advanced techniques.
(6) Improving the efficiency of the hierarchical clustering methods: Hierarchical clustering
usually needs expensive computing cost (Reddy and Vinzamuri 2013). So, improving
the efficiency of the hierarchical clustering methods is important for analyzing large
scale datasets.

13
Comprehensive survey on hierarchical clustering algorithms… 8221

Finally, the developing trend of hierarchical clustering is analyzed based on the above
survey.
The rest of this article is organized as follows. In Sect. 2, the generic algorithm scheme
and the distance-based similarities commonly used in hierarchical clustering are intro-
duced. In Sects. 3 and 4, the divisive and agglomerative hierarchical clustering methods are
surveyed respectively. Section 5 introduces the graph based hierarchical clustering methods
including simple graph based and hypergraph based. Section 6 surveys the density based
hierarchical clustering methods. An overview on the combination of the hierarchical clus-
tering methods is given in Sect. 7. The methods of improving the efficiency of the hierar-
chical clustering methods are surveyed in Sect. 8. Section 9 concludes this article with a
summary of the contributions and the direction for future work.

2 Hierarchical clustering

The basic idea of hierarchical clustering algorithms is to construct the hierarchical relation-
ship among data according to a dissimilarity/similarity measure between clusters (Johnson
1967). Hierarchical clustering method is divided into two types: divisive and agglomera-
tive (Everitt et al. 2001), as shown in Fig. 1.

Fig. 1  Two types of hierarchical clustering and flowchart

13
8222 X. Ran et al.

The divisive hierarchical clustering method first sets all data points into one initial clus-
ter, then divides the initial cluster into several sub-clusters, and iteratively partitions these
sub-clusters into smaller ones until each cluster contains only one data point or data points
within each cluster are similar enough (Turi 2001). The left branch in Fig. 1 demonstrates
the procedure in detail. When to select a cluster to bisect, the dissimilarities need be com-
puted according certain criterion. The dissimilarity is a key function that strongly exerts
influence on the subsequent operations and the resulting clusters.

Definition 1 The dissimilarity D of a dataset X, X = {x1 , ⋯ , xn }, is defined as the function


that need meet the following conditions (Xu and Wunsch II 2005; Murtagh and Contreras
2012, 2017):

(1) Symmetric, D(xi , xj ) = D(xj , xi );


(2) Positivity, D(xi , xj ) ≥ 0 for all xi and xj;
(3) Triangle inequality, D(xi , xj ) ≤ D(xi , xk ) + D(xk , xj ) for all xi , xk and xj;
(4) D(xi , xj ) = 0 iff xi = xj.

Contrary to divisive clustering, the agglomerative hierarchical clustering begins with each
cluster containing only one data point, and then iteratively merges them into larger clusters
until all data points are in one cluster or some conditions are satisfied (Turi 2001). The
right branch in Fig. 1 demonstrates the procedure in detail. When to select two clusters
to merge, the similarities need be computed according to certain criterion. How to calcu-
late the similarity exerts strong influence on the subsequent operations and the resulting
clusters.
Definition 2 The similarity of data X is defined a function that meets the following condi-
tions (Xu and Wunsch II 2005; Murtagh and Contreras 2012, 2017) :

(1) Symmetric, S(xi , xj ) = S(xj , xi );


(2) Positivity, 0 ≤ S(xi , xj ) ≤ 1 for all xi and xj;
(3) S(xi , xj )S(xj , xk ) ≤ [S(xi , xj ) + S(xj , xk )]S(xi , xk ) for all xi , xk and xj;
(4) S(xi , xj ) = 1 iff xi = xj.

For N data points of X, the dissimilarity/similarity matrix can be defined as a N × N matrix,


whose entity D(i, j) or S(i, j) represents the dissimilarity/similarity measure between data
point i and j, i and j = 1, ⋯ , N . The selection of different measures is problem dependent.
For binary features, Dij = 1 − Sij (Xu and Wunsch II 2005).
Definition 3 A dendrogram is a tree which splits the data set recursively into smaller sub-
sets, and its structure satisfies the following condition (Ma and Dhavala 2018):

(1) 𝜃(0) = {{x1 }, ⋯ , {xn }};


(2) There exists t0 such that 𝜃(t) contains only single cluster for t ≥ t0;
(3) If r ≤ s,then 𝜃(r) refines 𝜃(s);
(4) For all r, there exists 𝜖 > 0 such that 𝜃(r) = 𝜃(t) for t ∈ [r, r + 𝜖].

The four conditions above formulate the structure of clustering tree, i.e. dendrogram.
Condition 1 ensures each object forms a cluster initially. In condition 2, when t is enough
large, all objects are contained in a single cluster. Condition 3 guarantees the nested
structure of dendrogram. Last condition shows the partition is stable when existing small

13
Comprehensive survey on hierarchical clustering algorithms… 8223

perturbation 𝜖 . 𝜃 is a scale parameter of dendrogram and reflects the height of different lev-
els. An example of dendrogram is shown in Fig. 2 (Sisodia et al. 2012).
Both divisive and agglomerative hierarchical clustering generate a dendrogram of rela-
tionships of data points and quickly terminate, other advantages are as follows (Kotsiantis
and Pintelas 2004; Berkhin 2006):

• Don’t need to specify the number of clusters in advance.


• The complete hierarchy of clusters can be obtained.
• Clustering results can be visualized.
• Easy to handle any form of similarity or distance.
• Suitable for clusters of any data type and arbitrary shape.
• A flat partition can be obtained at different granularities by cutting on different levels of
the dendrogram.

However, there are several disadvantages of the hierarchical clustering (Kotsiantis and Pin-
telas 2004; Berkhin 2006):

• Once a merging or division is done on one level of the hierarchy, it cannot be undone
later.
• It is computationally expensive in time and memory, especially for large scale prob-
lems. Generally, the time complexity of hierarchical clustering is quadratic about the
number of data points clustered.
• Termination criteria are ambiguous.

So, various hierarchical clustering algorithms are designed, based on keeping above advan-
tages or reducing the influence of above disadvantages.
During splitting in the divisive hierarchical clustering methods or merging in the
agglomerative hierarchical clustering methods, the dissimilarity or similarity meas-
ure will affect the resulting clusters directly and is usually determined according to the

Fig. 2  An example of dendrogram

13
8224 X. Ran et al.

different application and feature attribute of data to be processed. The dissimilarities/sim-


ilarity based on distance measure are most commonly used, such as Euclidian distance,
Minkowski distance, Cosine distance, City-block distance, Mahalanobis distance, and so
on. The details of those distance measure are shown in Table 1.

3 Divisive hierarchical clustering methods and the applications

Divisive hierarchical clustering starts from a cluster (root) containing all objects, a cluster
with the largest dissimilarity is iteratively selected to be divided into two sub-clusters to
form its children-nodes until each cluster includes only one object. The dissimilarity is usu-
ally computed in terms of the different types of distance measure. A lot of divisive meth-
ods have been proposed according to the above dissimilarity measure or newly designed
dissimilarity.

3.1 Methods based on division steps

A divisive hierarchical clustering can usually be divided into three steps: computing the
dissimilarity, splitting the selected cluster, and determining the node level of the new
clusters (Roux 2018). For the divisive method, the node levels in the dendrogram cannot
be naturally obtained like in the agglomerative method, so a special method needs to be
designed to determine the node levels.

3.1.1 Methods of computing dissimilarity

During the evaluation of sub-partitions, a criterion needs to be defined to select the


subpartition for the next splitting procedure. Bisecting k-Means is a divisive hierarchi-
cal clustering method (Steinbach et al. 2000). The method first treats all data points
as a cluster, and then uses k-means (k = 2) to iteratively divide the cluster into two
sub-clusters, and the sub-cluster with the Maximum Sum of Squares Error (SSE) is
selected for the next division. This process is repeated until certain termination condi-
tions are satisfied. This method has the linear time complexity regarding the number of
documents because of the bisecting k-Means and global similarity, and it can produce

Table 1  The distance measure commonly used as dissimilarity/similarity measure


Distance measure Calculation formula
�∑
Euclidean distance d(zu , zv ) =
Nd
(zu,j − zv,j )2 = � �
j=1 � z u − zv �
Minkowski distance �∑ �1 ∕𝛼
d𝛼 (zu , zv ) =
Nd
(z − zv,j )𝛼 =� �𝛼
j=1 u,j � z u − zv �
City-block distance ∑Nd
d(zu , zv ) = j=1
�zu,j − zv,j �
Cosine distance
∑Nd
z z
j=1 u,j v,j
d(zu , zv ) = cos 𝛼 =
‖zu ‖‖zv ‖
Mahalanobis distance d(zu , zv ) = (zu − zv )S−1 (zu − zv )T
Where S is the within-group covariance matrix.

13
Comprehensive survey on hierarchical clustering algorithms… 8225

the continuously good quality of the clusters. It is obvious that bisecting k-Means is an
excellent algorithm for clustering the large-scale data of documents.
Other methods that evaluate the sub-cluster to be split are proposed. A modified
definition of SSE, called ΔSSE is used,
|sj ||sj+1 | ‖ ‖2
ΔSSE(X, j) = ‖m̂j − m
̂ j+1 ‖
(|sj | + |sj+1 |) ‖ ‖

and the sub-cluster with the maximal ΔSSE will be divided in each step (Gracia and Binefa
2011). This method can achieve high accuracy and low over-segmentation rate, and is suit-
able for clustering large video sequence and can quickly provide response for work in real
time, it however requires the final number of the cluster as the prior knowledge (Gracia and
Binefa 2011).
A divisive hierarchical multi-criteria clustering algorithm based on PROMETHEE
(Brans and Vincke 1985; Brans et al. 1986) has been proposed by Ishizaka et al. (2020).
The algorithm takes advantage of the Stochastic Multiobjective Acceptability Analysis
(SMAA) and cluster ensemble methods to avoid the uncertainty and imprecision of
clustering solution. The matrix of the degree of preference in the PROMETHEE is
used as similarity matrix in Ishizaka et al. (2020). When dividing a cluster, pairwise
actions with the highest preference degree in the similarity matrix are selected as the
centroids of two new sub-clusters. Then the alternatives are assigned to the cluster that
has the smallest preference degree to the centroid. The division is accomplished by
comparing the degree of the preference between the action to classify and the centroid.
The algorithm is more stable and robust, and can select the most proper number of
preference clusters for clustering US banks. SMAA generates a large number of solu-
tions by randomly changing the PROMETHEE parameters, and then uses ensemble
clustering to obtain consistent solutions.
To sum up, the dissimilarity measure in divisive hierarchical clustering is deter-
mined in terms of type of dataset, objective of clustering task, easy implementation of
computing, and so forth. It is dependent on the problem.

3.1.2 Methods of splitting the selected cluster

In the splitting procedure, the main problem is how to divide a selected partition into
sub-partitions. Macnaughton-Smith et al. (1964) have proposed a divisive method by
defining a dissimilarity measure between any two data points or sub-clusters. The dis-
similarity between a data point or sub-cluster and a sub-cluster is the average of the
dissimilarities between this data point and the data points in the sub-cluster. In the
divisive process, the method first selects the farthest data point from the target cluster
as an initial seed. And then it aggregates the seed by adding data points that is closer
to the seed than the other data points in the current cluster. This dissimilarity measure
is suitable for measuring the dissimilarity of variables and attributes. Hubert (1973)
has further developed the idea by using a pair of data points as initial sub-clusters. He
chose the two most dissimilar data points, and then created two new sub-clusters based
on the distances from other data points to the two initial sub-clusters. Kaufman and
Rousseeuw (1990) reused this idea in program DIANA.

13
8226 X. Ran et al.

3.1.3 Determining the node level

In the agglomerative hierarchical clustering, the level of nodes is determined automatically


by the value of the merging criteria, unfortunately, it becomes problematic in the divisive
hierarchical methods. A consistent rule is needed to determine the node levels and a true
tree representation (Roux 2018). Kaufman and Rousseeuw (2009) use the diameter of con-
tinuous clusters as the node level in their program DIANA. Therefore, the two subsets cre-
ated by splitting C are always associated with the lower (or equal) node levels. Another
way to determine the node level is that the level of node is judged according to the rank of
node created during splitting, from the root node n − 1 down to the bottom level 1 of the
last created node (Roux 2018). But this way may be not satisfying due to some small nodes
with higher dissimilarity are split at early stage and the high node level is assigned to their
children node. Another way to use ranking is to renumber the nodes from bottom to top
after all the splits are completed, but this is still difficult while implementing (Roux 2018).

3.2 Other variations of the divisive methods

Many other theories have been used in divisive hierarchical clustering to handle the dif-
ferent kinds of problems, such as Spectral Clustering (von Luxburg 2007), density peaks
(Zhou et al. 2018), rough set theory (Parmar et al. 2007), adaptive resonance theory (Yam-
ada et al. 2020), etc., and some variations of the divisive methods have also been proposed
by combing the above theories.
Vidal et al. (2014) have put forward a divisive hierarchical clustering method that can
identify clusters with arbitrary shapes. The method divides a larger cluster into two smaller
sub-clusters using Spectral Clustering (von Luxburg 2007) associated to random walk.
The Laplacian matrix is defined as L = P−1 W , wherein P is the diagonal matrix of nodes
degree and W is the similarity matrix −d(xi ,xj among
)2
pknng
nodes. The similarity named RBF-PKNNG
is calculated by Spknng (xi , xj , 𝜃) = e 2⋅𝜎2 , wherein the PKNNG metric d(xi , xjpknng ) is the
geodesic distance between points xi and xj , it is very effective for find clusters with arbi-
trary shape in high dimensional data. The method can find a good number of clusters auto-
matically and work well under different scale of clusters because of using the structure of
the tree.
In the hierarchical clustering based on density peaks (HCDP) method (Zhou et al.
2018), the local k-nearest density is computed to construct a tree where the data points are
nodes and edges are generated between a point and its k-nearest neighbors with high den-
sity values. The root point of the tree is of maximal density, and children points have lower
density than its parent point. All the edges in the subtree are sorted in descent order and are
iteratively removed according to the specified number of children. The modified stability
is proposed to extract and evaluate the suitable clusters. Finally, a clustering tree contain-
ing all partitions is produced. By using k-NN when calculating local density, unnecessary
parameter selections are avoided. The new stability is
( )
∑ 1 1
S(Ci ) = ( )− ( ) ,
xj ∈Ci 𝛿exclued xj , Ci 𝛿emerge Ci

where 𝛿exclude is the maximal weight of removing edges by excluding xj from the cluster Ci ,
𝛿emerge is the maximal weight by removing edges of which cluster Ci emerges.

13
Comprehensive survey on hierarchical clustering algorithms… 8227

Every time an edge is cut from the tree, and the current number of clusters is deter-
mined according to whether the calculated stability before splitting is more than the sum of
the stability of clusters after splitting or not.
Nasiriani et al. (2019) have proposed a divisive hierarchical clustering to discover pos-
sible discrimination upon selection of the high impact feature set. In the splitting phase,
a sub-cluster with high impact feature set, that is the set of all high impact attributes, is
selected to divide into two parts. The divisive process is recursively performed until
homogeneous clusters are found. This method can not only identify several instances of
discrimination, but also cluster discrimination data even though providing little relevant
information.
The min-min-roughness (MMR) is a divisive hierarchical clustering algorithm based
on rough set theory (Parmar et al. 2007). The mean roughness, min-roughness, and min-
min-roughness are proposed based on roughness that is calculated by lower approxima-
tion and upper approximation in information system and can represent how strongly an
object belongs to a cluster. The min-roughness (MR) is the minimum among mean rough-
ness values and min-min-roughness (MMR) is the minimum of the min-roughness. MMR
can reflect the closeness of a cluster, the smaller the MMR is, the crisper the cluster is. A
cluster is selected to split binarily in terms of its MMR. MMR is capable of better deal-
ing with the uncertainty for clustering categorical attributes and only requires to input the
number of desired clusters. Later, an improved method named MMeR modified the MMR
to handle not only the categorical data but also the numerical data (Kumar and Tripathy
2009). SDR (standard deviation roughness) is a further improvement of MMeR, standard
deviation of equivalence class is used as similarity in the splitting phase, a total of purity
ratio, which is the sum of the ratio between the number of the data existing in both each
cluster and its correspondence class and the number of the data in dataset, is used to evalu-
ate the results. SDR has capability of clustering heterogeneous data and also handling the
uncertainty during clustering, the efficiency and purity ratio of SDR is also higher over
MMR and MMeR (Tripathy and Ghosh 2011a). Standard Deviation of Standard Deviation
Roughness (SSDR) algorithm further optimizes SDR (Tripathy and Ghosh 2011b). It holds
the advantages above methods and achieves the best efficiency and total purity ratio. Min-
Mean-Mean-Roughness (MMeMeR) is another improvement of SDR, mean mean rough-
ness, which is the mean of mean roughness while the same attribute is assigned to different
values, is the similarity measure, and its efficiency is the highest among MMR, MMeR,
and SDR (Tripathy et al. 2017).
Li et al. (2014) have proposed an attribute selecting and node splitting method named
MTMDP for categorical data. The mean distribution precision (MDP) and cohesion degree
(CD) are proposed, and are used to select the partition attribute based on probabilistic
rough set theory and to determine which child node should be selected to further cluster,
respectively. The total mean distribution precision is the sum of the MDP of all attrib-
utes in the attribute set. MDP reflects the coupling between the equivalence classes, the
larger the MDP is, the smaller the coupling between equivalence classes is. The clustering
starts according to the attribute with maximal total MDP. The child node with minimal
CP should be selected to bi-partition. In the paper of Qin et al. (2014), mean gain ratio
(MGR) is used to select clustering attribute, and the entropy of clusters is used to deter-
mine which equivalence class about the selected attribute is output as a cluster. The value
of MGR reflects how much the partition about an attribute shares information with other
partitions respect to the rest attributes. The larger MGR is, the closer the partition defined
by the attribute is and should be held. In the partition with the highest closeness, the
entropy between objects is the lowest. Therefore, the attribute with the largest MGR value

13
8228 X. Ran et al.

is selected as the clustering attribute to define the partitions and equivalent classes, the one
with the smallest entropy among the equivalent classes defined by the selected attribute is
output as a cluster and removed from the dataset. For each attribute in the dataset, above
operation is repeated until the terminal criterion is met. MGR can complete clustering of
categorical data by specifying or not specifying the number of clusters, and has better clus-
tering accuracy, stability, efficiency, and scalability, as well as can be applied to small- or
large-scale dataset with imbalance or balance class distribution (Qin et al. 2014).
Hierarchical fast topological CIM-based ATR (HFTCA) is a divisive hierarchical clus-
tering algorithm based on adaptive resonance theory-based clustering (Yamada et al. 2020).
During clustering, a network is firstly generated from all the data points by FTCA, fol-
lowing, the data points split by nodes in the previous layer is divided using independently
FTCA by sharing a new similarity threshold in this layer, this procedure is recursively per-
formed until the desired layer is achieved or the network is stable. Obviously, HFTCA is
able to produce the proper number of nodes, and to independently manage increasing of
hierarchy structure basis on data distribution. But there also exists the main shortage that
the similarity threshold in every layer is set in different value.

3.3 Applications of the divisive hierarchical clustering methods

The divisive hierarchical clustering methods can be utilized to deal with the different types
of data, e.g. community data, document sets, categorical data, and geospatial datasets.
The Principal Direction Divisive Partitioning (PDDP) method (Boley 1998) is proposed
for clustering document sets. It uses Vector Space Model (VSM) (Salton 1975) to represent
documents. In the process of clustering, the method introduces Singular Value Decomposi-
tions (SVD) (Golub and Loan 1996), and calculates the principal direction of each cluster
of each level of the hierarchical clustering. Then it uses a hyperplane, that passes through
the origin and is perpendicular to the principal direction, to divide the cluster into two
parts. Girvan and Newman (2002) have proposed a divisive hierarchical clustering method
to detect community within the real network. The betweenness for all edges is defined, and
the final clusters is obtained by repeatedly removing the edge with the highest between-
ness and recomputing the betweenness influenced by removed edge. Authors applied the
PDDP method on two real-world networks, a collaboration network of scientists and a food
web of marine organisms, and the results were very much in line with expectations. Xiong
et al. (2012) regarded the clustering categorical data as an optimization problem. In the
splitting procedures, the method used Multiple Correspondence Analysis (MCA) (Abdi
and Valentin 2007) to implement the initial splitting, and then optimized the global objec-
tive function the sum of Chi-square error (SCE) to refine the initial splitting of clusters. In
DHCC algorithm based on multiple correspondence analysis (MAC), Xiong et al. (2012)
view the issue of clustering categorical data from optimization standpoint, and the initiali-
zation and refinement of splitting of clusters base on MCA are proposed. Chi-square dis-
tance between a single object and a group of objects is exploited as dissimilarity measure
to calculate MAC on the indicator matrix. In the refinement phase, objective is to improve
the splitting quality by minimizing the sum of Chi-square error (SCE) as the algorithm
iteratively selects one cluster to split into two sub-cultures until no cluster can be split or
not further improve clustering quality. DHCC is fully automated and parameterless, regard-
less of the data processing order. In addition, its time complexity is linear with the num-
ber of clustering objects and the total number of categorical values. DHCC is applied to
detect the natural structure of data on four real data sets, and high-quality clustering is

13
Comprehensive survey on hierarchical clustering algorithms… 8229

obtained. However, DHCC needs to consume more memory, cannot be optimized globally,
and may be affected by the abnormal object in the data. Li et al. (2017) have proposed a
Cell-Dividing Hierarchical Clustering (CDHC) method to handle geospatial datasets. The
method uses the global spatial context to identify the noise points and multi-density points,
and uses a boundary retraction structure to split the dataset into two parts interactively until
the terminating condition is met. CDHC is used to perform the spatial analysis of retail
agglomeration in business circles in Wuhan, and the analysis results reflect well the current
situation of commercial development in Wuhan.

4 Agglomerative hierarchical clustering methods

Contrary to agglomerative hierarchical clustering, divisive hierarchical clustering is more


difficult in computing the dissimilarity of the node to be split, and need to determine the
node level of children nodes after division. Moreover, how to divide the selected node is
also a considerable problem. The most important issue of agglomerative hierarchical clus-
tering is how to calculate the similarity between nodes to be merged.

4.1 The agglomerative hierarchical clustering methods based on similarity


measure

A lot of agglomerative hierarchical methods have been developed by designing the differ-
ent types of similarity measures, e.g. distance-based similarity measure, based on nearest
neighbor similarity measure, based on the quality evaluation of the partition after merging,
and other similarity measure.

4.1.1 The methods using distance‑based similarity measure

Many agglomerative clustering algorithms have been proposed in terms of the different
way that similarity measure is defined based on various distance measure. Single linkage
(Anderberg 1973) uses the shortest distance between data points from two different clus-
ters to define the similarity measure; it works well for the small dataset; complete linkage
(Sneath and Sokal 1975) uses the largest distance between data points from two different
clusters to define the similarity measure; group average linkage (Jain and Dubes 1988) uses
the average distance between data points from two different clusters to define the similar-
ity measure; centroid linkage (Lewis-Beck et al. 2003) uses the shortest distance between
the centroids from two different clusters to define the similarity measure. Ward’s method
(Murtagh and Legendre 2014) uses the smallest increase of the sum of square error when
two clusters are merged. The update of the similarity measure in all of the methods afore-
mentioned is conveniently formulated by Lance–Williams dissimilarity update formula
(Lance and Williams 1967), where the parameters can define the different agglomerative
criterion (Reddy and Vinzamuri 2013; Murtagh and Contreras 2012) and its generaliza-
tions (Wishart 1969; Batagelj 1988; Jambu et al. 1989; Székely and Rizzo 2005). The
Lance–Williams dissimilarity update formula is as follows:
d(i ∪ j, k) = ai d(i, k) + aj d(j, k) + 𝛽d(i, j) + 𝛾|d(i, k) − d(j, k)|,

13
8230 X. Ran et al.

where i and j are the objects or data points to be agglomerated into cluster i ∪ j , 𝛼i , 𝛼j , 𝛽 ,
and 𝛾 are the coefficients defining the agglomerative criterion. The coefficient values of dif-
ferent kinds of the methods we surveyed above are list in Table 2:
CURE (Guha et al. 1998) uses the shortest distance between the representative data
points from two different clusters to define the similarity measure.
Müllner (2011) has proposed a new agglomerative hierarchical clustering algorithm,
called Generic_Linkage, which is suitable for any distance update scheme and achieves
better performance for the “centroid” and “median” clustering schemes. It can handle
inversion of the dendrogram. A priority queue from list of nearest neighbors and mini-
mal distances is generated in the Generic_Linkage algorithm, and a pair of globally clos-
est nodes are found in each iteration. In addition, MST performs best using single-link-
age to compute similarity, NN-chain achieves best performance using complete, average,
weighted, and Ward schemes, in the same environment (Müllner 2011).
Furthermore, it is also known that Rohlf’s algorithm MST-LINKAGE for single clus-
tering, Murtagh’s algorithm NN-CHAIN-LINKAGE for the “complete”, “average”,
“weighted” and “Ward” scenarios are the most efficient among algorithms based on dis-
tance dissimilarities (Müllner 2011).
Gullo et al. (2008) have proposed a centroid-linkage-based agglomerative hierarchi-
cal algorithm for clustering uncertain objects. They define the multivariate and univariate
uncertain prototypes distance between cluster prototypes based on the Bhattacharyya coef-
ficient, the means of these distances are used as the similarity measure.
Sharma et al. (2019) have compared the single linkage, complete linkage, and Ward
method on agglomerative hierarchical clustering, and they have concluded that the com-
plete linkage and Ward method can found the more accurate similarity than the single-
linkage during merging.
In a new agglomerative hierarchical clustering algorithm called HC-OT, hierarchi-
cal clustering is combined with optimal transport (OT) (Rabin et al. 2014; Courty et al.
2017) for the first time, and the OT based distance is also proposed to be similarity meas-
ure used in HC-OT, this new distance considers the intra-cluster distribution of the data
(Chakraborty et al. 2020).
Cho et al. have utilized a new agglomerative clustering to implement the feature match-
ing and deformable object matching (Cho et al. 2009). In the agglomerative clustering,
a comprehensive pairwise dissimilarity function is defined based on photometric dissimi-
larity that are the Euclidean distance between objects and geometric dissimilarity that is
the mean of the conditional geometric dissimilarity between two feature correspondences.
But this dissimilarity cannot deal well with matching of the deformable object because
of the deformation of objects causes the different compactness on geometric dissimilarity.
Based on this dissimilarity, the improved dissimilarity measures, kNN linkage model and
adaptive kNN linkage model (AP-link), are then proposed (Cho et al. 2009). kNN linkage

Table 2  The coefficient Methods The values of coefficients


values in the Lance–Williams
formula for different kinds of
Single linkage 𝛼i = 0.5, 𝛽 = 0.5, 𝛾 = 0.5
the agglomerative hierarchical
methods Complete linkage 𝛼i = 0.5, 𝛽 = 0, 𝛾 = 0.5
Group average 𝛼i = |i|
, 𝛽 = 0, 𝛾 = 0
|i|+|j|
Centroid linkage 𝛼i = |i|
, 𝛽 = − (|i|+|j|)
|i|
2, 𝛾 = 0
|i|+|j|
Ward’s 𝛼i = |i|+|k|
, 𝛽 = − |i|+|j|+|k|
|k|
, 𝛾=0
|i|+|j|+|k|

13
Comprehensive survey on hierarchical clustering algorithms… 8231

model enhances the robust on outliers and compactness owing to considering k nearest
linkage. The k is adaptively determined in the AP-link in terms of a computed threshold,
so the straggling compactness is effectively avoided when the number of the matching pair
between two images is larger than k. The proposed bottom-up clustering framework based
on AP-link can effectively cluster multi-class task with many distracting outliers.
Galdino and Maciel (2019) have proposed hierarchical clustering for interval-valued
data using Width of Range Euclidean Distance, and they have found that different combi-
nations of representative interval-valued distance measures and linkage methods are best
for clusters of particular shapes, and different linkage algorithm can yield totally different
results when used on the same dataset because of the specie properties. They use the width
from Range Euclidean Distance Matrix between pairs of objects to produce dissimilarities
matrix.
In summary, in this type of methods, the similarity is calculated directly or indirectly
based on certain distance measure according to the type of dataset. For the numerical
data, the similarity between objects is computed using distance measures directly, such as
Euclidean distance, Geodesic distance. For the data whose attributes can be converted into
numerical metrics, after converting, the similarity between objects can also be computed
based on distance measures, e.g., Cosine distance. When merging, two objects with maxi-
mal similarity are selected to merge, the similarity matrix is then updated based on merged
clusters using the same distance measure and is used in the next merging.

4.1.2 The methods based on nearest neighbor similarity measure

The nearest neighbors of an object can reflect the affinity around it. A variety of methods
have been proposed based on nearest neighbor as similarity measure during agglomerative
clustering.
Nazari et al. have proposed an agglomerative hierarchical clustering algorithm, in which
the nearest neighbor of each data point is found to form a pair, and all the pairs includ-
ing intersection point are repeatedly jointed to form a cluster until the number of desired
clusters is achieved, and the intersection points between the nearest neighbor pairs is inter-
preted as the similarity measure between two clusters (Nazari and Kang 2018). The method
can run with high efficiency, the time complexity is O(n2 ), and results in clusters with good
quality in most cases regardless of dimensionality and number of data points.
A reciprocal-nearest-neighbors supported clustering algorithm called RSC has been
developed (Cheng et al. 2019b). The RSC is based on a simple hypothesis that the recipro-
cal nearest objects should be in same cluster. The RSC first constructs sub-clustering trees
(SCTs) according to the chain where each data point links to its nearest neighbor until all
data points are in the SCTs, then the data points to which the path length from the root is
larger than a calculated threshold in the SCTs are pruned and are linked to artificial roots.
After the above construction and pruning processes, each obtained SCTs is treated as a
node, and the above construction and pruning processes are repeated to obtain a high-level
clustering until forming one cluster. Because of adopting the reciprocal nearest neighbors
to construct SCTs and pruning, the RSC’s efficiency and effectivity are improved, the time
complexity is O(nlogn), the memory complexity is O(n). The RSC also has a strong inter-
pretation power owing to the hierarchy tree of the result.
Ros et al. (2020) have proposed a novel clustering algorithm called KdMutual which
combined the mutual neighborhood and agglomerative hierarchical methods using mutual
neighbor similarities based on single linkage. The KdMutual firstly merge the data points

13
8232 X. Ran et al.

by their mutual neighborhood to identify potential core clusters. The core clusters then
grow based on a constrained hierarchical merging process like single linkage to handle
noise. In the last phase, the resulting clusters are obtained by ranking the core clusters
according to the new similarity measure, which contains the cluster size, the compactness
of the cluster, and the separability between clusters.
In summary, the methods based on nearest neighbor have higher efficiency and lower
memory requirement, and is robust to noise. But for the large scale dataset, their compu-
tational cost is expensive owing to the huge number of samples, and are influenced by the
curse of dimensionality in high dimensional space.

4.1.3 The methods based on the quality evaluation of the partition after merging

Because the results of hierarchical clustering cannot be undone once the merging or divi-
sion accomplished, the quality of merging or division has significantly influence to the
resulting clustering. The validity evaluation index is usually used to assess the validity of
the clustering method, and used to determine the optimal number of clusters in general,
such as Mean Square Error (MSE) (Han et al. 2011; Tsekouras et al. 2008), Clustering
Dispersion Index (CDI) (Han et al. 2011; Tsekouras et al. 2008; Duran and Odell 2013),
and within cluster sum of squares to between cluster variation ratio (WCBCR) (Tsekouras
et al. 2008). According to whether the ground truth labels are known or not, the clustering
validity evaluation indices can be divided into two types, external evaluation indices and
internal evaluation indices.

4.1.3.1 External evaluation indices Given the knowledge of the ground truth class assign-
ments and our clustering algorithm assignments of the same samples, we can evaluate the
quality of our clustering results using below external evaluation indices. The seven com-
monly used evaluation indices in clustering are listed in Table 3 (Xu and Tian 2015).

4.1.3.2 Internal evaluation index When the ground truth class assignments are unknown,
our clustering algorithm assignments of the same samples are acquired only, we usually
utilize the features of original data to assess the clustering results. Some internal evaluation
indices commonly used are shown in Table 4 (Hubert and Arabie 1985).

4.1.3.3 The measures for directly evaluating the dendrogram quality Besides using exter-
nal evaluation and internal evaluation to assess the quality of clusters, the quality of a cluster
tree can be also evaluated directly using a more holistic measure, such as dendrogram purity
(Kobren et al. 2017). Suppose the ground truth of a dataset, X = {xi }Ni=1, is known, the data
points are assigned into K clusters, C∗ = {Ck∗ }Ki=1. P∗ = {(xi , xj )|C∗ (xi ) = C∗ (xi )} denotes
the pairs of points which are in the same cluster in the ground truth, for the cluster tree T
generated by hierarchical clustering algorithm, the dendrogram purity of T is defined as:
K
1 ∑ ∑
DP(T) = pur(lvs(LCA(xi , xj ))Ck∗ ),
| P∗ | k=1 x ,x ∈C∗
i j k

wherein LCA(xi , xj ) is the least common parent of xi and xj in T, lvs(z) ∈ X is the set of
leaves for any internal node z in T, and pur(S1 , S2 ) = |S1 ∩ S2 |∕|S1 |. The larger the DP(T),
the higher the quality of clustering result.

13
Table 3  External evaluation indices
Evaluation indices Formula Explanation

TP+TN
Rand indicator RI = TP+FP+FN+TN
TP is the number of true positives
TN is the number of true negatives
Fowlked–Mallows indicator TP TP FP is the number of false positives
FM = TP+FP
⋅ TP+FN

FN is the number of false negatives


Jaccard indicator TP

�A B�

2
Comprehensive survey on hierarchical clustering algorithms…

TP TP
J(A, B) = �A ⋃ B� = TP+FP+FN
F indicator F𝛽 = (𝛽𝛽 2+1)⋅P⋅R P= TP+FP
, R = TP+FN
⋅P+R
RI−E[RI] TP+TN
ARI ARI = max(RI)−E[RI] RI = TP+FP+TN+FN = (TP + TN)∕ N2
( )

E[RI] is the expected value of the RI


Mutual information By measuring how much information the two clusters share, the nonlinear correla-
tion between them can be detected
Confusion matrix To find the difference between clusters and ground truth clusters
8233

13
Table 4  Internal evaluation indices
8234

Evaluation indices Formula Explanation

13
n
Silhouette coefficient 1 ai +bi ai : The mean distance between a sample and all other points in the same class
S= n max(ai ,bi )
i=1

bi : The mean distance between a sample and all other points in the next nearest cluster
n: the number of samples
tr(Bk ) nE −k
Calinski–Harabasz Index s= tr(Wk )
× k−1
tr(Bk ): trace of the between group dispersion matrix, tr(Wk ): the trace of the within-cluster dispersion
matrix,
k
Wk = (x − cq )(x − cq )T
∑ ∑
q=1 x∈Cq
k
Bk = nq (cq − cE )(cq − cE )T
q=1

Cq : the set of points in cluster q, cq : the center of cluster q,


cE : the center of E, and nq : the number of points in cluster q
Davies–Bouldin Index k si : the average distance between each point of cluster I and the centroid of that cluster, dij : the distance
DB = max Rij s +s
k between cluster centroids I and j, Rij = id j
i=1 i≠j
1∑

ij

Dunn’s Index min1≤i<j≤n d(i,j) Is the distance between any cluster i and j,
DVI = max1≤k≤n d� (k) �
d (k) is intracluster distance of cluster k
X. Ran et al.
Comprehensive survey on hierarchical clustering algorithms… 8235

4.1.3.4 The methods based on the quality evaluation of the partition after merging For
improving the accuracy of agglomerative hierarchical clustering, some methods are devel-
oped by means of assessing the quality of the partition using some evaluation index after
each merging.
Heller and Ghahramani (2005) have proposed a Bayesian agglomerative hierarchical
clustering algorithm, that evaluates marginal likelihoods of a probabilistic model of the
data, and uses a model-based criterion to determine which clusters to be merged. Bayesian
hypothesis testing is used as similarity measure to decide which merges are advantageous
rather than distance measure in the algorithm, and to output the recommended depth of
the tree. Compared to traditional hierarchical clustering, this similarity measure provides a
guide of choosing the correct number of desired clusters and answers how good a cluster-
ing is.
Rocha and Dias (2013) have proposed that the evaluation on the quality of the partition
after merging two clusters is used as the similarity measure in agglomerative hierarchical
clustering. The quality of the partition is the ratio of pairs of alternatives that are indiffer-
ent, comparable and consistent with the decision maker’ preference model.
In the paper of Kaur et al. (2015), the performance of the agglomerative hierarchical
method is evaluated in the view of cluster quality. Cohesion measurement, elapsed time,
and silhouette index, for simplicity called quality index afterward, are adopted to assess
the cluster quality. For a given dataset, an agglomerative hierarchical clustering algorithm
whose quality indices are the best on the dataset has better fitness for the dataset among
alternative algorithms.
PERCH (Sander et al. 2003) use dendrogram purity to evaluate the quality of cluster
tree. The measure assumes that the ground truth of dataset is given. The larger dendrogram
purity, the higher the quality of tree, and the better the clustering result.
This type of methods must calculate the evaluation index to determine whether the
merging is good or not for every merging, their computational time complexity is very
high for the large scale dataset. Besides that, the evaluation index will affect the quality of
resulting clustering.

4.1.4 The methods based on other similarity measure

In addition to the aforementioned methods of similarity measure, other similarity measures


are also defined in clustering according to types, attributes, characteristics, natures, rela-
tions among the data points, and so on.
Saunders et al. (2018) have utilized an evolutionary algorithm to create extremely sta-
ble and unstable data sets for standard neighbor-joining algorithm, and then a new cluster-
ing called bubble clustering is developed to examine the stability. An associator is used
to measure the similarity between pairs of observations. Bubble clustering constructs the
association matrix by the sum of the outcomes of all the associators. Then the hierarchical
tree is constructed in terms of the most closely associated pairs from the association matrix
in a desired manner. Neighbor joining is usually used to produce hierarchical clustering
with weighted average between points.
Cai and Liu (2020) have proposed an agglomerative hierarchical clustering algorithm
called HiClub, which is a universal clustering framework applicable to multi-objective
bipartite networks. In HiClub, the similarity measure is defined with regard to Com-
mon Neighbor (CN), Preferential Attachment (PA), Cosine Index (COS), and Euclidean

13
8236 X. Ran et al.

Distance (EuD) between objects, it not only considers the neighborhood and weights of
objects, but also avoids the situation that EuD equals zero by adding COS. This similar-
ity can be robustly applied to generate the dendrogram of the complex structure bipartite
networks.

4.2 The agglomerative hierarchical clustering methods based on density


estimation

The density of the data region is also used as similarity measure in the hierarchical cluster-
ing for clustering the data with complex structure and noise. Many agglomerative hierar-
chical clustering methods based on density estimation have been developed to deal with
different kinds of problems.
Lu et al. have used the estimated density values to help the hierarchical clustering, in
which an edge-weighted tree is first constructed from the data using the distance matrix
and the density values, then the hierarchical clustering result is efficiently obtained using
the sorted edges based on the edge weights (Lu and Wan 2013). The proposed method
PHA can handle non-spherical shape clusters, overlapping clusters, and clusters containing
noise, and can run faster and produce the high quality results.
Some agglomerative hierarchical clustering methods are based on dividing points
into core points with higher densities and border points with lower densities. Cheng
et al. (2019b) have proposed a local cores-based hierarchical clustering algorithm called
H-CLORE, which firstly divides the dataset into several clusters by finding local cores that
are the points with the local maximum density in the local neighbors. Then the lower den-
sity points are temporarily removed so that the boundary between clusters is clearer, and
clusters with the highest similarity measure are merged. Finally, the removed points are
assigned to the same clusters as their local cores belong to. The similarity measure between
clusters is defined as a function of the inter-connectivity and closeness to be merged. The
algorithm is effective and efficient for finding the clusters with complex structures. The
similarity function is as follows:

Sim(Ci , Cj ) = Conn(Ci , Cj ) × Close(Ci , Cj )2 ,



where Conn(Ci , Cj ) = w(vi , vj ), CE(Ci , Cj ) is the cut edges between cluster Ci
e(vi ,vj )∈CE(Ci ,Cj )
and Cj , and w(vi , vj ) = 1+dist(v1
,
i ,vj )

w(vi ,vj )
e(vi ,vj )∈CE(Ci ,Cj )
Close(Ci , Cj ) = �CE(Ci ,Cj )�

.
H-CLORE initially divides dataset into many smaller clusters by searching the local
core instead of iteratively optimizing the partition like k-means, it reduces the total com-
puting cost to O(NlogN) (Cheng et al. 2019b). The local core is with the maximal density
around its neighbor points, the lower density points are removed temporally, the bound-
ary between clusters is clear, so H-CLORE is insensitive to outliers (Cheng et al. 2019b).
H-CLORE can also discover the clusters with complex structure efficiently and effectively,
and be applied in many fields such as pattern recognition, data analysis in 3D reconstruc-
tion and image segmentation (Cheng et al. 2019b).

13
Comprehensive survey on hierarchical clustering algorithms… 8237

In the paper of Cheng et al. (2019a), a novel hierarchical clustering algorithm based on
noise removal (HCBNR) has been proposed. HCBNR marks the points with lower density
as noise by density curve and remove them from the dataset. Then a saturated neighbor
graph (SNG) is constructed, and the connected sub-graphs are divided into initial clusters
using M-Partition according to the number of clusters. The initial clusters are repeatedly
merged by a newly defined similarity measure between clusters until the desired number of
clusters is obtained. The similarity measure simultaneously takes into account the connec-
tivity and closeness like in the Chameleon algorithm between sub-clusters to be merged.
HCBNR can not only better handle noise data points and clusters with arbitrary shapes, but
also run faster than DBSCAN, Chameleon and CURE.
The Border-peeling clustering method iteratively peel border points of the clusters until
core points remain (Averbuch-Elor et al. 2020). During the peeling process, the associa-
tions between the peeled points and points in the inner layer are created by estimating the
local density of data points. After the peeling process terminates, the remaining set of core
points are merged with the close reachable neighborhood of the points according to the
associations. The border points at iteration t are a set given by XB(t) = {xi ∈ X (t) ∶ B(t)
i
= 1},
where X (t) is the set of unpeeled points by the start of the tth iteration, B(t)
i
is the border
classification value of the point when the value is 1 if xi is a border point and 0 otherwise.
Bi is determined by the reverse k nearest neighbors of xi , i.e., RNk (xi ) = {xj |xi ∈ Nk(t) (xj )}
(t) (t)

and a pairwise relationship function f between points xi and xj (Averbuch-Elor et al. 2020).
The unpeeled points set for next iteration is X (t+1) = X (t) �XB(t).
The association between each identified border point xi ∈ XB(t) and a neighboring non-
border point 𝜌i ∈(t+1) is computed as follows (Averbuch-Elor et al. 2020):

⎧ ⎫
⎪ � ⎪
li = min ⎨ Ck 𝜉(xi , xj ), 𝜆⎬,
⎪ (t)
xj ∈NNB,k ⎪
⎩ ⎭

where 𝜆 is a parameter, NNB,k


(t)
(xi ) is the set of points which were peeled up to the current
iteration and not outliers.
The Border-peeling clustering method can deal with multiple distribution models,
and has been proved to be stable because of being insensitive to the values of hard coded
parameters.
To sum up, the density-based method can find the cluster with complex structure and
noise, and is stable.

4.3 The agglomerative hierarchical clustering methods based on deep learning

Deep learning is a popular technique in machine learning, and has been applied in cluster-
ing analysis. The features are extracted by deep learning method and are then used to com-
pute more accurate similarity required during clustering, it can improve the accuracy of
clustering results. This idea is also utilized in agglomerative hierarchical clustering.
In the paper of Zeng et al. (2020), the method HCT combines the hierarchical clus-
tering with hard-batch triplet loss for Person re-identification, the hierarchical clustering
method is used to generate the pseudo labels by iteratively agglomerating two samples
according to their minimal unweighted average linkage distance, the PK sampling algo-
rithm then produces a new dataset using the pseudo labels for fine-tune training in CNN.
The similarity among images is fully used in the target domain by hierarchical clustering.

13
8238 X. Ran et al.

The hard-batch triplet loss with PK can efficiently increase the distance between dissimilar
samples while reducing the distance between similar samples, so that the hard samples are
better distinguished.
The distance used in the unweighted average linkage distance between clusters is
defined as

Dab = n 1n D(Cai , Cbj ),
a b
i∈Ca ,j∈Cb

where Cai , Caj are two samples in the cluster Ca, Cb respectively, na, nb are the number of
data points in the cluster Ca, Cb respectively, D(⋅) indicates the Euclidean distance. This
distance considers all the pairwise distance between two clusters and assigns the same
weight when merging.
Zhao et al. combines the bottom-up hierarchical cluster and CNN for person re-identifica-
tion (Zhao et al. 2020). The proposed non-locally enhanced feature network can extract the
feature of images sufficiently by embedding non-local blocks into the CNN. The similarity
between clusters is determined using inter-cluster distance called intermediate distance (IMD)
and intra-cluster distance called compactness degree (CPD) simultaneously. IMD is the mean
of the farthest distance and the nearest distance between two clusters, it can avoid some wrong
merging. IMD considers the compactness by the mean intra-cluster distance. After being
trained by proposed CNN, two clusters with the largest similarity are merged, training and
agglomerative hierarchical clustering are iteratively performed, images of same person can be
grouped into a cluster.
To avoid the false merging due to the minimum distance criterion of BUC (Lin et al. 2019),
the proposed intermediate distance (IMD) is defined as follows:
( )
1
IMD(A, B) = 2 min d(xa , xb ) + max d(xa , xb ) ,
xa ∈A,xb ∈B xa ∈A,xb ∈B

wherein d(xa , xb ) indicates the Euclidean distance between the feature embeddings of two
images.
The compactness degree (CPD) which evaluates intra-cluster distance is defined as

CPD(A) = 1n d(xi , xj ),
i,j∈A

wherein n is the number of samples in cluster A, d(xi , xj ) indicates the Euclidean distance
between the feature embeddings of two images in cluster A.
Considering the inter-class and intra-class distance simultaneously, the final distance
between cluster A and B is formulized as
D(A, B) = IMD(A, B) + 𝜆(CPD(A) + CPD(B)),
wherein 𝜆 is an effect parameter between IMD and CPD.
Clusters with small D(A, B) should be merged during hierarchically clustering. The method
proposed efficiently reduce the incorrect merging of early stages in light of the newly proposed
measure composing of IDM and CPD, it promotes the quality of bottom-up clustering.
The above methods based on deep learning all adopt ResNet50 as backbone to learn the
feature space of the re-ID objects. ResNet50 is with initialization pre-trained on ImageNet, the
dropout rate to be 0.5, and SGD with a momentum of 0.9 is utilized to optimize the model.

13
Comprehensive survey on hierarchical clustering algorithms… 8239

4.4 The agglomerative hierarchical clustering methods using hybrid strategy

Other variations of the agglomerative hierarchical clustering methods using mixed strate-
gies during merging are also proposed.
Takumi and Miyamoto (2012) have defined two asymmetric similarity measure and
their no reversal criteria, one is a probabilistic model using the top-down method that first
defines the similarity measure between clusters and the similarity between objects is a spe-
cial case of the similarity measure; the other is an extended updating formula using the bot-
tom-up method that first defines the similarity between objects, then the similarity measure
between clusters is defined using the former similarity. An asymmetric dendrogram based
Chi-square test for agglomerative hierarchical clustering has been proposed using aforesaid
asymmetric similarity measure.
In a new hierarchical clustering method proposed by Jalalat-evakilkandi and Mirzaei
(2010), the dendrograms of base hierarchical clustering methods, such as single-linkage
and complete linkage, are converted to matrices, then these matrices are combined in a
weighted procedure to lead to a final description matrix based on scatter matrices and near-
est neighbor criterion. Proficiency and robustness of hierarchical clustering are increased
by combination of base hierarchical clustering methods. Zhao et al. have extended
instance-level constrain to hierarchical constraint in agglomerative hierarchical clustering,
called ordering constraint by which hierarchical side information can be captured and hier-
archical knowledge can be encoded into agglomerative hierarchical algorithms. Ordering
constraint sets merge preferences of objects but doesn’t change similarities during cluster-
ing (Zhao and Qi 2010).
Mao et al. have proposed a new hierarchical clustering method, in which sequence data
are firstly divided into groups using a new landmark-based active hierarchical divisive
(AHDC) method, and then the ESPRIT-Tree clustering method (Cai and Sun 2011), that
is a hierarchical agglomerative method and quasi-linear time complexity, is used to each
group individually to find the correct hierarchy of the group, finally, the hierarchy of the
data can be obtained by assembling hierarchies from the above steps (Mao et al. 2015).
This method has the scalability to handle tens or even hundreds of millions of sequences by
using high-performance parallel computing and requires the linear time complexity regard-
ing the number of input sequence (Mao et al. 2015). It is clear that the method is a hybrid
hierarchical clustering, named as HybridHC, and has two stages, i.e. AHDC division and
ESPRIT-Tree agglomeration.
In the rough set based agglomeration hierarchical clustering algorithm (RAHCA) (Chen
et al. 2006), the data are mapped to a decision table (DT) in light of Rough set theory
(RST). An attribute membership matrix (AMM) is built using DT and then used to accom-
plish merging by the corresponding items. The similarity is the numerical measure of cat-
egorical data based on Euclidean distance on DT. Consistent degree is used to measure
the quality of cluster, agglomerate degree is used as stop criterion of the algorithm. The
clustering level is defined as the comprehension of the consistent degree and the agglomer-
ate degree. The cluster with minimal clustering level value (Dmin ) and the cluster with the
highest similarity between rest clusters and Dmin are merged by updating AMM.
Varshney et al. have proposed a Probabilistic Intuitionistic Fuzzy Hierarchical Cluster-
ing (PIFHC) Algorithm, that considers intuitionistic fuzzy sets (IFSs) to handle the uncer-
tainty in the data and leverages the probabilistic-weighted Euclidean distance measure
(PEDM) to compute the weights between the data points as similarity, then the clusters
are formed in the agglomerative way (Varshney et al. 2022). PTFHC can better identify

13
8240 X. Ran et al.

the uncertain data points using IFS, however, computational cost is higher because proba-
bilistic weight has been computed for each of data points, a suitable membership function
is dependent on the particular problem, and the parameter 𝛼 is determined experimentally
(Varshney et al. 2022). Authors argued that uncertainty in the data may be represented
using fuzzy set variants such as Pythagorean fuzzy sets, interval-valued intuitionistic fuzzy
sets and type-2 fuzzy sets (T2 FSs), therefore, based on these variants, designing new hier-
archical clustering algorithms is worth studying further (Varshney et al. 2022).

4.5 Incremental agglomerative hierarchical clustering methods

Incremental agglomerative hierarchical clustering methods have also been proposed. COB-
WEB (Fisher 1987) is one of the most prominent algorithms that incrementally clusters the
objects to form a conceptual categorial tree. The quality of the cluster is measured by cate-
gory utility which is a probability description of that a document in the parent level cluster
is assigned into the cluster in child level. COBWEB maximizes the category utility when a
new object is inserted. Sahoo et al. (2006) have proposed an improvement of COBWEB by
changing its underlying assumption, that the observation conforms to Normal distribution,
to Katz’s distribution. The improvement method can be suitable for hierarchically cluster-
ing text documents.
BIRCH (Zhang et al. 1996) uses the natural closeness of points to incrementally cluster
the observations. It constructs the CF-Tree of the dataset firstly, the sparse clusters then are
deleted and dense clusters are merged in the leaf nodes. BIRCH is especially suitable for
very large dataset and can handle convex or spherical clusters of uniform size very well,
and effectively deal with noise.
The key properties in which the incremental clustering can detect the structures of clus-
ters are studied, including nice, perfect, refinement (Ackerman and Dasgupta 2014). Nice
clustering refers to that the any data point is closer to the points in its own cluster than the
points in other clusters. In another word, the separability between clusters is always larger
than the compactness in the cluster. The perfect clustering means that minimal separability
between clusters is larger than the maximal diameter of clusters. The refinement requires
that each cluster in the clustering result is contained in some cluster, and it usually allows
additional clusters to carry out.
Zhang et al. (2013) have proposed an agglomerative clustering algorithm that is graph-
structural and define path integral as the structural descriptor of cluster on graph. The path
integral is the sum of all the paths within the clusters on the directed graph. Therefore, the
similarity measure between clusters is defined as the amount of incremental path integral,
that measures the structural change of clusters after merging and is calculated in a closed-
form exact solution, when merging them. Different type of clustering methods can cluster
distinct structure of clusters.
Kobren et al. (2017) have proposed an incremental hierarchical clustering called
PERCH that can non-greedily handle a large number of data points as well as a large num-
ber of clusters. Under separability assumption, PERCH maintains the quality and growing
of the tree by rotating operation if the masking exists when a new point is inserted to leaf
node in the tree. PERCH achieves the higher dendrogram purity and speeds up the cluster-
ing procedure regardless of the accessing order of the items as well as can scale to the large
dataset and large number of clusters.
In the paper of Shimizu and Sakurai (2018), a parallel and distributed hierarchical clus-
tering method based on Actor Model (Agha 1990), called ABIRCH, has been proposed. In

13
Comprehensive survey on hierarchical clustering algorithms… 8241

this method an added point is incrementally received and processed based on BIRCH algo-
rithm (Zhang et al. 1996) by behavior of the actor, and a divisive hierarchical clustering is
depicted as an actor. An actor is as a node, and a set of nodes can be described as a sum
of value called clustering feature (CF), where each node in the tree maintains a CF value.
The CF value is updated when a point is added to a node, and the radius of a CF value is
expanded simultaneously. When the radius is greater than a threshold, the node will be
split.
The incremental clustering method proposed by Narita et al. (2018, 2020) can update
the partial clusters without re-clustering when a point is newly inserted, and saves execu-
tion time. The center and radius of a cluster are defined. If the distance between the point to
be inserted and the center of the cluster is less than or equal the radius, then the clustering
result is updated after inserting a point. Then updating the center and radius of cluster until
insertion requirement is not met from the root of the cluster to leaves.
Motived by a notion of separability of clustering using linkage functions, the GRINCH
method uses generic similarity functions, which measure any affinity between two point
sets, to non-greedily perform hierarchical clustering for large-scale data set (Monath et al.
2019). GRINCH is mainly composed of its rotate subprogram and graft subprogram (Mon-
ath et al. 2019). When new data items arrive, they rearrange the generated hierarchical tree
locally and globally, respectively. It can produce clustering results as consistent as possible
with ground truth when the linkage function is consistent with ground truth regardless of
data arriving in the order and obtain clusters with complex structures (Monath et al. 2019).
Based on graph theory, model-based separation is defined by characterizing the relation-
ship between a linkage function and a dataset. GRINCH can efficiently produce a clusters
tree with higher dendrogram purity in the separated setting (Monath et al. 2019).
The Sub-Cluster Component algorithm (SCC) (Monath et al. 2021) is a scalable to bil-
lions of data points, agglomerative, hierarchical clustering, and can produce the hierarchies
of optimal flat partition. It uses a series of increasing distance thresholds to determine
which sub-clusters should be merged in a given round, under the separability assumption
and non-parametric DP-mean objective. SCC is applied to cluster large scale web queries
and achieves the competitive result with the state-of-the-art incremental agglomerative
hierarchical clustering, and requires less running time.
In the Internet of Things (IoT), the data, which are collected from IoT sensors data
world and then annotated by Resource Description Framework (RDF), are classified and
analyzed by the agents and subsequently represented as data streams (DS). The data pattern
in DS with minimum distance between them are merged. The nearest neighbor chain based
on incremental hierarchical clustering is used to cluster the streaming data (Núñez-Valdéz
et al. 2020). When a new DS is added, it is elaborated starting from any node in the hierar-
chical tree until a pair of data samples with Reciprocal Nearest Neighbor (RNN) is formed,
and then these data samples are aggregated. The same process is continued with RNNs for
the hierarchical tree of previously annotated objects (Núñez-Valdéz et al. 2020). The dis-
tance can be calculated using any of distance function, such as the Marxian, Euclidean or
Minkowski distance functions in D-dimensional space (Núñez-Valdéz et al. 2020).
An incremental supervised learning based clustering method, called dynamic cluster-
ing of data stream with considering concept drift (DCDSCD), was developed some time
ago. In this method, data stream is automatically clustered in a supervised manner, where
the clusters whose values decrease over time are identified and then eliminated (Nikpour
and Asadi 2022). Moreover, the generated clusters can be used to classify unlabeled data.
Each chunk is clustered independently and automatically in a supervised manner that
presents timely results without obtaining the number of clusters from the user. Centroids

13
8242 X. Ran et al.

in consecutive chunks are merged so it is scalable in number of sequences (Nikpour and


Asadi 2022).
A criterion based on the weight of centroids and a predefined decay rate for detect con-
cept drift is proposed to eliminate or could ignore the outdated clusters, thus the method
has sequential adaptability with rapid changes as well (Nikpour and Asadi 2022).
While any chunk of data sets arrives, the points with different number of classes in
the chunk are split into the subsets with the same label. The similarities of incoming
chunks are defined on the basis of Euclidean metric and class information of each data
point (Nikpour and Asadi 2022). The optimal number of clusters in each subset is deter-
mined by computing the silhouette criterion. The clusters in consecutive chunks are
merged hierarchically upon the similarity criterion. Classless data points in each chunk
are assigned the labels according to the nearest centroid (Nikpour and Asadi 2022). All
centroid after finishing merge in all subset are stored to save memory (Nikpour and
Asadi 2022).
DCDSCD can automatically determine the optimal number of clusters of a data
stream instead of specifying it in advance, and superpasses the static supervised clus-
tering algorithms in terms of clustering purity, running time and the number of clus-
ters (Nikpour and Asadi 2022). It is also able to deal with the old clusters compared to
SAIC, and can detect any type of concept drift efficiently and correctly using the cluster
weights and decay rate compared to the PHT algorithm and SNDC algorithm (Nikpour
and Asadi 2022).
StreaKHC, a kernel-based incremental hierarchical clustering for clustering massive
streaming data, adds a point by using the Isolation kernel to rearrange the cluster tree in
a top-down search mode within the linear time complexity when a new data point arrives
(Han et al. 2022). Isolation kernel can measure the similarity between the new point and a
node existing in the cluster tree (Han et al. 2022). SteaKHC is capable of clustering large-
scale and different density stream data hierarchically. The top-down search mode only
searches one path from the root of hierarchy tree to a leaf node 𝜂 at each level when a new
point arrives, and the new point and its most affinity leaf node 𝜂 are then merged to form
a subtree to replace 𝜂 . StreaKHC has a time complexity of O(n) and runs more efficiently
because of the top-down search mode (Han et al. 2022).
In the method GCFI++, ticket data streams are incrementally mined and then soft
hierarchical clustering is performed to obtain interpretable clusters with auto-generated
issue topics (Dixit 2022). Graph representation of customer issue types is used to build
an incremental model to generate and detect problem categories and their labels. Moreo-
ver, with the power of embeddings, hierarchical clustering based on Generalized Closed
Frequent Itemset is utilized to obtain overlapping, limited as well as interpretable cluster
assignments. The keywords in a document are mapped to items. The features with top TP-
IDF scores from the documents are selected. The best topics for a document are selected
according to label assignment scores, which can also detect an outlier document in terms of
a singleton label. Finally, the label names with grammatical meaning are obtained for the
selected label nodes in the final clustering graph, which is a directed acyclic graph (Dixit
2022).
Frequent itemset algorithm Apriori is employed to mine frequent itemsets and construct
a frequent pattern graph for the first pass. For the further pass, the higher levels of itemset
are generated by selecting a frequent itemset node based on generalized closed frequent
itemset property to eliminate superfluous nodes in the graph (Dixit 2022).
Next, the getTermOrder procedure extracts the order of terms in n_grams for the cluster
key nodes, transformByOrder procedure constructs a lexically valid phrase for these terms

13
Comprehensive survey on hierarchical clustering algorithms… 8243

as a candidate for the cluster’s label name. At this point, the hierarchy of clusters is gener-
ated (Dixit 2022). This method keeps a separate state for outlier/unsigned documents to be
re-evaluated in the next data iteration (Dixit 2022). Moreover, this does not require rebuild-
ing the clustering procedure from scratch for the previous data. Instead, it is simply a mat-
ter of a new FI update.

4.6 Methods for extracting clusters from a given dendrogram

The clusters of hierarchical clustering are obtained by cutting the dendrogram from a cer-
tain height. So, how to extract clusters from a given dendrogram is worth studying. Some
optimization methods are talking about global or local cuts, such as the following methods.
One method that can convert dendrogram into reachability plot, and vice versa, can
automatically extract the important clusters from reachability plot by local maximal reach-
ability values separate clusters (Sander et al. 2003). It makes hierarchical clustering a
preprocessing tool in data mining work where downstream tasks are based on clustering
results (Sander et al. 2003).
In the paper of Campello et al. (2013), HDBSCAN constructs a cluster tree composed
of clusters generated by DBSCAN*, which is an improved DBSCAN. A mutual reachabil-
ity graph is constructed by assigning edge weight that is the mutual reachability distance
between two objects in dataset, an MST is computed from the mutual reachability graph,
and then an extended MSText is obtained by adding self-loops for each vertex in the MST.
HDBSCAN hierarchy is extracted form MSText by iteratively removing all edges from
MSText in descent order of edge weights. Significant clusters tree is generated through
simplified HDBSCAN hierarchy and then significant clusters is produced by optimal local
cuts.
Using HDBSCAN hierarchy and proposed stability measure, global optimal significant
clusters can be obtained, and the time complexity can be reduced from O(dn2 ) to O(n2 ).
However, space complexity is increased from O(dn) to O(n2 ), where d is the dimensionality
of data point, and n denotes the number of data points.

4.7 Kernel based agglomerative hierarchical clustering algorithm

Kernel-based method is another hot topic for clustering as it can help the algorithm to
detect complex and non-linear separable clusters. Therefore, many kernel based hierarchi-
cal clustering algorithms are proposed to reveal more complex information in data.
In the paper of Qin et al. (2003), the comparing experiments show that the same hier-
archical clustering algorithm can produce different hierarchical trees in terms of different
kernel functions, and the results of kernel hierarchical clustering are not better than stand-
ard hierarchical clustering according to the internal twofold cross validation evaluation and
external PPR evaluation. Moreover, it also shows that kernelization usually increases the
dimensionality of data, and combining it with SVM and related large margin algorithms
don’t suffer from the curse of dimensionality, but combing other methods and kernelization
should be cautions.
Endo et al. (2004) combine five agglomerative hierarchical clustering methods with ker-
nel functions to construct the new kernel clustering, these hierarchical clustering methods
use square of Euclidean norm between two items in a feature space as dissimilarity when
merging. The new methods are respectively called Kernel Centroid Method, Kernel Ward’s

13
8244 X. Ran et al.

Method, Kernel Nearest Neighbor Method (K-AHC-N), Kernel Furthest Neighbor Method
(K-AHC-F), Kernel Average Linkage between the Merged Group (K-AHC-B), and Kernel
Average Linkage within the Merged Group (K-AHC-I). Because of increasing cost in com-
puting dissimilarity and kernel function after each merging, they reduce the cost by reuse
the dissimilarities which are already computed before merging. Therefore, the calculation
cost in this way is much lower than that completely recalculated after the merging.
Hierarchical Kernel Spectral Clustering (Alzate and Suykens 2012) makes use of ker-
nel spectral clustering to discover the underlying cluster hierarchy in the dataset. A pair
of clustering parameters (k, 𝜎 2 ), the number of clusters and RBF kernel parameter respec-
tively, are found by training clustering model for each (k, 𝜎 2 ) and assessing Fisher crite-
rion on the validation set for out-of-sample data. During this process, the memberships of
all objects are created, and then the hierarchical structure is built by merging two closest
clusters according to a specified linkage criterion. Because of clustering model training in
a learning environment, it ensures the model has good generalization ability, but enumera-
tion model training for (k, 𝜎 2 ) may consume more running time. For any k, find the maxi-
mum value of the BLF criterion across the given range of 𝜎 2 values. If the maximum value
is larger than the BLF threshold 𝜃 , create a set of these optimal (k, 𝜎 2 ) pairs.
Multiple kernel clustering and fusion methods may lead to loss of clustering advanta-
geous details from kernels or graphs to partition matrix due to sudden drop of dimension-
ality (Liu et al. 2021). Hierarchical Multiple Kernel Clustering (HMKC) (Liu et al. 2021)
gradually group the samples into fewer clusters and generate a sequence of intermediate
matrices with a gradually decreasing size. Consensus partition is learned at the same time,
which in turn guides the construction of intermediate matrix. An optimization algorithm
newly designed jointly optimize the intermediate matrices and consensus partition by per-
forming forwarding and back propagation alternatively to update variables. The experi-
mental results show larger intermediate matrices can preserve more informative details of
clusters, and performance is improved with the increasing the layer number of intermediate
matrices, but the computational complexities are increased accordingly. So, authors advise
to use HMKC-2 model, which contains 2-layer of intermediary matrices, in clustering task
for improving stability and lowering complexities.
As a conclusion, kernel hierarchical clustering can discover more complex and non-lin-
ear separable clusters in the dataset, it however increases the computing cost due to intro-
ducing kernel.

4.8 Applications of agglomerative hierarchical clustering algorithm

Different agglomerative hierarchical clustering methods have also been developed to deal
with different types of datasets.
ROCK (Guha et al. 2000) uses the Goodness Measure between the two different clus-
ters based on links to define the similarity measure for categorical data. Two clusters with
maximal Goodness Measure are merged at each iterative step. ROCK is used on Congres-
sional Votes, and Mushroom U.S. Mutual Fund to discover the inherent distribution in
data, and the results is encouraging. Squeezer (He et al. 2002) hierarchically clusters cate-
gorical data by repeatedly inputting each tuple in the dataset in sequence, the tuple is deter-
mined to assign to an existing cluster or create a single cluster (initially) by the similarity,
which is defined as the sum of each attribute in the tuple in the cluster. Because it reads the
dataset only once, both efficiency and quality of clustering results are high. d-Squeezer,

13
Comprehensive survey on hierarchical clustering algorithms… 8245

an alternative of Squeezer, is also proposed to deal with large databases by the means of
directly writing the cluster identifier of each tuple back to file instead of holding the cluster
in memory (He et al. 2002). Both Squeezer and d-Squeezer are suit to cluster data streams.
A hierarchical clustering framework for clustering categorical data based on Multinomial
and Bernoulli mixture models is also proposed, in which Battacharyya and Kullbach–Lei-
bler distance are used as similarity measure between clusters (Alalyan et al. 2019).
This method is applied in text document clustering and computer vision applications. In
the two tasks, the objects are described as binary vector and the similarities are evaluated
using Bhattacharyya and Kullbach–Leibler distance. The accuracy of clustering results is
improved significantly in contrast to the methods based on Gaussian probability models
and the Bayesian methods. The experiments shows that it can be applied to many other
applications which involve hierarchy structure and count or binary data (Alalyan et al.
2019).
Lerato and Niesler (2015) have proposed a multi-stage agglomerative hierarchical clus-
tering approach (MHHC) aimed at large datasets of speech segments based on the iterative
divide-and-conquer strategy. The dynamic time warping (DTW) algorithm (Myers et al.
1980; Yu et al. 2007) is used to compute similarity measure between two segments (Lerato
and Niesler 2015). In the first stage, the dataset is divided into several subsets, AHC is
separately applied on each subset. In the second stage, the average points computed in the
previous stage are clustered using AHC. In this way, the storage requirement is reduced
and clustering procedure can be implemented in parallel. Therefore, MAHC can be easily
extended to large-scale dataset. MAHC is applied to speech segments and the results show
that the performance of MAHC reaches and often exceeds that of AHC, and MAHC per-
forms well in parallel computing.
Rahman et al. have proposed a highly robust hierarchical clustering algorithm (RHC)
to cluster metabolomics data using covariance matrix of two-stage generalized S-estimator
in presence of cell-wise and case-wise outliers (Rahman et al. 2018). RHC successfully
reveals the original pattern for metabolomics analysis and performs better than the tradi-
tional HC in presence of cell-wise and case-wise outliers. A method has been utilized to
cluster the protocol feature words according to the longest common subsequence, which is
equivalent to similarity in the merging process, between the order sequences with the byte
position information (Li et al. 2019). The methods on the basis of hierarchical clustering
can extract the feature words in unknown protocol and has a higher recall compared to
PoKE (Li et al. 2019).
One method has been proposed aiming at dealing with noise in single-linkage hierarchi-
cal clustering from two aspects (Ros and Guillaume 2019). First, the single link criterion
considers the local density to ensure that the distance involves the core point of each group.
Second, the hierarchical algorithm prohibits merging representative clusters that exceed the
minimum size after determination.
For omics analysis, Hulot et al. (2020) have proposed a mergeTrees method, which com-
bines the agglomerative hierarchical clustering to merge many trees that has the common
leaf nodes at the same height to create a consensus tree. This method needs not specify the
number of clusters in advance, only requires the preprocessing of centering and standardi-
zation, and its time complexity is reduced to O(nqlog(n)) for large dataset, where n is the
number of leaves and q is the number of the tree to be merged. The set of q trees can be
obtained from these data with any HC method, and by C(T) the consensus tree based on
T = T1 , ..., Tq . The analysis shows that mergeTree is robust to the existence of empty data
tables.

13
8246 X. Ran et al.

Mulinka et al. (2020) have built a HUMAN method, using density-based hierarchical
clustering techniques, to detect anomalies in multi-dimensional analyzed data, and to ana-
lyze the dependence and relationships among the data hierarchies to interpret the potential
causes behind the detected behaviors, with minimal guidance and no ground truth.
An agglomerative hierarchical clustering method has been proposed by Fouedjio for
multivariate geostatistical data, the method is model-free and based on the dissimilarity
measure of a kernel estimator that is non-parameters and can reflect the multivariate spatial
dependence structure of data (Fouedjio 2016). This method can cluster irregularly space
data and obtain adjacent clusters without any geometrical constraints. For sparse or small-
scale dataset, however, kernel estimator can not correctly estimate the similarity among
multivariate spatial data, the proposed clustering method also cannot give the member-
ship degree of belonging to a specified cluster due to it is model-free (Fouedjio 2016).
D’Urso and Vitale (2020) have proposed a robust similarity measure which is the exponen-
tial transformation of the kernel estimator proposed by Fouedjio (2016) for agglomerative
hierarchical clustering. The measure is not sensitive to noise and model-free and suitable
to cluster data indexed by geographical coordinates. The agglomerative clustering method
using proposed measure is used to cluster georeferenced data set and successfully gives
locations and top soil heavy metal concentrations.
A new agglomerative hierarchical clustering algorithm has been proposed and applied
to cluster the geochemical data (Yang et al. 2019). The geochemical data meet multivari-
ate approximately normal distribution after preprocessing steps, and the symmetric Kull-
back–Leibler divergences between the distributions are used as dissimilarity measure dur-
ing merging in hierarchical clustering. The proposed method not only provides a tool to
reveal the relationship between geological objects according to geochemical data, but also
reveals that DKLS and its two parts can characterize geochemical differences from differ-
ent angles. These measures are expected to enhance the method of identifying geochemical
patterns.

4.9 Comparison of agglomerative hierarchical clustering

Some of the state-of-the-art agglomerative hierarchical clustering methods proposed in


recent years are compared from the similarity measure, the type of data that can be clus-
tered, the time complexity, and the type of output. The details are listed in Table 5.

5 Graph based hierarchical clustering

Graph is getting more and more attention in clustering analysis because of its natural char-
acters of structure in representing the relationships among objects. Moreover, some opera-
tions on the graph can be used to deal with the problems in clustering. Usually the clus-
tering problems can be converted into the construction and the partitioning of the simple
graph in low dimensional space and the hypergraph in high dimensional space. This idea
can also be used in hierarchical clustering and obtains the better results.

5.1 Simple graph based hierarchical clustering

Some simple graph-based methods have been proposed, and the task of hierarchical clus-
tering is converted to the construction and the partitioning of the graph, e.g., the clustering

13
Table 5  Comparison of the state-of-the-art agglomerative hierarchical clustering methods
Methods Similarity measure Type of data Time complexity Type of output

Generic_Linkage Any distance update scheme Numerical data O(n3 ) Stepwise dendrogram
HC-OT OT based on distance Numerical data, images O(n3 log(n)) Dendrogram
Zazari et al. The nearest neighbors Numerical data O(n2 ) Dendrogram
RSC Reciprocal-nearest-neighbors Numerical data O(nlogn) Dendrogram
KdMutual Mutual neighborhood Numerical data O(n3 ) Dendrogram
Heller et al. Bayesian hypothesis testing Numerical data, text O(p(n)) Dendrogram
Bubble clustering Associator Numerical data nlog(n) Dendrogram
HA Density Numerical data O(n2 ) Dendrogram
Comprehensive survey on hierarchical clustering algorithms…

Border-peeling clustering Density Numerical data O(n) Dendrogram


DC-HCP Density-peak Numerical data O(n2 ) Dendrogram
RAHCA Rough set Categorical data O(n3 ) Dendrogram
COBWEB Category utility Categorical data O(n2 log(n)) Dendrogram
SCC A series of increasing distance thresholds Web queries O(n) Dendrogram
HDBSCAN Mutual reachability distance Numerical data O(n2 ) Significant clusters tree
Sander et al. Local maximal reachability Numerical data – Significant clusters tree
HMKC Kernel matrix Numerical data, text O(n3 ) Dendrogram
8247

13
8248 X. Ran et al.

methods based on MST (Zahn 1971; Guan and Du 1998), CLICK (Sharan and Shamir
2000), Spectral clustering (von Luxburg 2007), improvements of Spectral clustering (Chen
et al. 2011; Cai and Chen 2015; He et al. 2019; Huang et al. 2020), and Kemighan–Lin
(KL) method (Kernighan and Lin 1970).
According to the model of calculating similarity for constructing graph, the methods of
agglomerative hierarchical clustering can be divided into two categories: static similarity
model based method and dynamic similarity model based method.

5.1.1 Static similarity model based methods

In the paper of Murtagh (1983), a general framework for hierarchical, agglomera-


tive clustering algorithms is discussed, and the designs of algorithms for hierarchical
clustering are also discussed. These algorithms include general hierarchical cluster-
ing algorithm using dissimilarity-update formula, hierarchical clustering algorithms
using original data and cluster data, multiple cluster algorithm for geometric cluster-
ing method, single cluster algorithm for geometric clustering method, constructing the
minimal spanning tree (MST) by subgraphs, single fragment hierarchical clustering
algorithm, and multiple fragments hierarchical clustering algorithm. These methods are
designed by leveraging the nearest neighbours graph (NN-graph) or the reciprocal near-
est neighbours (RNNs), and the dissimilarity among the data points are examined using
Lance–Williams combinatorial formula, thus their time complexity and space complex-
ity also are improved. Due to the reducibility property is obeyed, it ensures that the
resulting hierarchy is unique and exact regardless of the order of data presented. How-
ever, the MST method discussed in (Murtagh 1983) is only suitable for very low dimen-
sional space.
In Multi-relational Hierarchical Clustering Algorithm Based on Shared Nearest
Neighbor Similarity (MHSNNS) Guo et al. (2007), the shared nearest neighbor similar-
ity (SNNS) is used to construct SNNS graph among objects. In relational database, the
tables’ relation is constructed by the tuple ID propagation. The SNNS graph with sparse
similarity threshold is then used to produce the smaller clusters by finding the connected
subgraph with the specified size based on the above table relation. The cluster is divided
into two parts if its vertices number and cohesion, the sum of the weights of links in
the cluster, are larger than the specified thresholds respectively. Then the clusters are
repeatedly merged by the separation, the sum of the weights of links between clusters,
until the number of remaining clusters is desired. By dividing the larger sub-clusters,
the incorrect partitioning generated previously can be undo. It helps to improve the
accuracy of clustering result. MHSNNS algorithm refers to three parameters, for most
of the data sets that we encountered, setting specified number of vertices MINSIZE to
about 1–5% of the overall number of data points worked fairly well, similarity threshold
𝜑 and the end number of clusters k are initialized by the user’s need.
A novel clustering method called Zeta l-links (or Zell) has been developed in order
to hierarchically cluster complex data (Zhao and Tang 2008). l-link and Zeta merging
are also proposed. l-link constructs acyclic directed subgraphs by finds the shortest link
among the kNN points of the point. A weighted adjacent graph P is then constructed
according to the directional connectivity of the l-links. K-means and Affinity Propaga-
tion are combined to produces smaller clusters to input the Zeta merging. Zeta merging
recursively agglomerates these smaller clusters to form a dendrogram using the maxi-
mum incremental popularity as similarity between clusters. The criterion of incremental

13
Comprehensive survey on hierarchical clustering algorithms… 8249

popularity is formulized based on P. Because of the global interaction in integration


cycles using Zeta function, the algorithm is insensitive to the variation of local data
scales, and has the robustness.
A kNN graph based hierarchical clustering algorithm for high-dimensional data has
been proposed by Zhang et al. (2012). The average indegree and average outdegree in
directed graph are used to represent the density and the local topology around an object
in the high-dimensional space, respectively. Therefore, the similarity measure between
clusters is defined as a product of the average indegree and average outdegree. During
clustering, the clusters with the highest similarity are selected to merge recursively. The
experimental results demonstrate that the algorithm is simple, fast, effective, and robust
on image clustering and object matching. The algorithm is robust to noise because of
the product similarity, and is efficient computationally, and has good performance and
easy implementation.
Yu et al. (2015) have proposed a tree agglomerative hierarchical clustering method (TAHC)
to detect the clusters in undirected minimum spanning tree (MST) with unweighted links.
Trees are maximally sparse connected graphs. Clustering of MSTs exposes the hierarchical
structure of weighted graphs (Yu et al. 2015). The cluster is defined by star-motifs and line-
motifs in general trees. The geodesic distance matrix C of the data is input into the agglomera-
tive hierarchical clustering, then used to compute the similarity between pairs of data points
using Spearman distance, which is the similarity between all row pairs of C. Links between
node pairs are then added in order of descent similarity, beginning with the node pairs with the
largest similarity. The average-linkage is used when two objects are merged into one cluster.
The results of the TAHC surpasses the previously reported clustering, and its time complex-
ity scales as O(n2 ) on a MST, however, none of an objective metric is present to evaluate the
clustering results.
In clustering of trajectory data (Sabarish et al. 2020), the trajectories are preprocessed into
equal length representation using the Douglas–Peucker algorithm, a graph is then generated
from the string-based representation of the trajectory. Similarity measures between graphs
are computed using edge based and vertex based similarity. The edge based similarity is
the number of common edges in the corresponding graphs of two trajectories, and the ver-
tex based similarity is the number of the union vertices in two graphs. Finally, by applying
agglomerative clustering, the dendrogram of trajectories is generated, and the clustering result
is validated by the clustering validity index, i.e. Cophenetic Correlation Coefficient (CPCC),
Davies–Bouldin Index (DBI) and Dunn Index (DNI) (Halkidi et al. 2001), to choose the opti-
mal result.

5.1.2 Dynamic similarity model based methods

Most of the agglomerative clustering methods introduced above are based on static intercon-
nectivity models (Karypis et al. 1999b). Chameleon, as a multi-stage hierarchical cluster-
ing algorithm, is based on a dynamic interconnectivity model that considers both aggregate
interconnectivity and the closeness of the data points when defining the similarity measure
(Karypis et al. 1999b). It initially constructs a graph according to the k-Nearest Neighbor of
the data points, then utilizes a graph partitioning algorithm to divide the objects into a lot of
relatively small sub-clusters so that the objects in each sub-cluster are highly similar. In the
final and most important stage, the Chameleon method uses the interconnectivity and close-
ness measure to continuously merge the sub-clusters until some criteria are met or only one
cluster is formed (Karypis et al. 1999b).

13
8250 X. Ran et al.

The relative interconnectivity RI(Ci , Cj ) of two clusters Ci and Cj is normalized absolute


interconnectivity between cluster Ci and Cj with respect to their internal interconnectivity:
|EC(Ci , Cj )|
RI(Ci , Cj ) = |(EC(Ci )|+|EC(Cj )|
,
2

where EC(Ci , Cj ) denotes the sum of the weight of the edges that straddle the Ci and Cj , and
EC(Ci ) (or EC(Cj ), the internal interconnectivity, is the minimum weight sum of the edges
that are cut to divide the cluster Ci (or Cj ), into roughly equal two parts.
The relative closeness RC(Ci , Cj ) of two clusters Ci and Cj is the normalized absolute
closeness between Ci and Cj of its internal closeness:
̄
SEC(Ci , Cj )
RC(Ci , Cj ) = |Cj |
,
|Ci | ̄ ̄
|Ci |+|Cj |
SEC(Ci ) + |Ci |+|Cj |
SEC(Cj)

where SEC(C
̄ i , Cj ) is the average weight of the edges that connect the Ci and Cj . Similarly,
̄
SEC(C i ) (or ̄
SEC(C j )) is the average weight of the cut edge that divides Ci (or Cj ) into two
roughly equal parts.
The Chameleon chooses the pairs of clusters with both high RI and RC to merge,
so a natural way is to take their product. Namely, a pair of clusters is selected such that
RI(Ci , Cj ) × RC(Ci , Cj ) is maximum. In this formula, RI and RC have equal importance.
But sometimes we want to give the different measure or higher preference to one of the two
measures. So, Chameleon’s similarity measure is defined as:
S(Ci , Cj ) = RI(Ci , Cj ) × RC(Ci , Cj )𝛼 ,

where 𝛼 is a user-specified parameter, if 𝛼 > 1, the cluster has higher relative closeness, if
𝛼 < 1, Chameleon chooses the two clusters with higher relative interconnectivity to merge.
Zhao et al. have applied the similar Chameleon method to cluster documents, called
constrained agglomerative algorithm which restricts some documents to only being avail-
able from the same cluster, a distance threshold-based graph is first constructed from the
documents dataset and the graph is partitioned to obtain the constraint clusters from which
the final hierarchical solution is constructed using the UPGMA agglomerative scheme
(Zhao and Karypis 2002). The constrained method leads to the better quality of the result-
ing clusters because the neighbors’ quality of each document is improved due to constraint.
Later, Zhao et al. (2005) proposed a Chameleon-like algorithm to cluster the documents
(high-dimensional data). Firstly, a kNN graph is constructed from the dataset, afterwards
the graph is divided into k clusters using a min-cut partitioning algorithm, and finally an
agglomerative hierarchical clustering is obtained using a single-link similarity measure
from the previous phrase.
Borton et al. have proposed MoCham algorithm which is very similar to the Chame-
leon method, a graph is constructed using a kd-tree in the first phrase, then uses a multi-
objective priority queue to determine the similarity measure between the sub-clusters
during the merging phrase (Barton et al. 2016). Cao et al. have introduced an optimized
Chameleon method based on local features and grid structure, firstly an adaptive neighbor
graph is established in terms of the distance between data points, then the graph is parti-
tioned based on local point density to produce many sub-clusters, in the merging phase, the
sub-clusters are repeatedly merged according to the local point density (Cao et al. 2018).
Dong et al. have proposed an improved Chameleon method, which introduces the recursive

13
Comprehensive survey on hierarchical clustering algorithms… 8251

dichotomy, flood fill method, and the quotient 𝛾 of cluster density, and proposes a cutoff
method that automatically selects the best clustering result (Dong et al. 2018). Barton et al.
have proposed Chameleon 2 algorithm, flood fill is used to refine partition and the measure
of internal cluster quality is modified to improve merging (Barton et al. 2019). Chameleon
2 can also be regarded as a general clustering framework, which is suitable for a wide
range of complex clustering problems, and performs well on the data sets commonly used
in clustering literature (Barton et al. 2019). In overlapping hierarchical clustering (OHC),
the dendrogram is constructed by a graph where the edge is gradually added in terms of
an increasing distance threshold 𝛿, that indicates the size of the formed clusters gradually
increases from 0 (Jeantet et al. 2020). Graph density, which is the ratio of the number of
edges in subgraph constructed by 𝛿 and the number of edges in connected graph, is used
as the merging condition. After each increase of 𝛿, the constructed subgraph is removed to
add to the correspondence level of the dendrogram. With the increasing of 𝛿, the edges in
the graph are gradually added until all vertices are linked. In the dendrogram, some nodes
have one or more parent node, so this structure is called a quasi-dendrogram.
Taking advantage of relationships which are found by some data mining techniques
to construct a graph, then the clusters or dendrogram will be obtained by partitioning the
graph using graph partitioning method. It is the generic scheme of graph-based clustering
algorithms. The dynamic similarity model based method has the better quality of clusters
than the static similarity model based method.
In the paper of Toujani and Akaichi (2018), a combination method of the bottom-up
and top-down hierarchical clustering has been discussed in social medias networks. The
social medias networks are preprocessed to generate a weighted and directed graph, and
each node in the graph is a community. The graph is randomly partitioned or use a special
partitioning technique to obtain the initial partitions of the community. The degree of the
influential users is defined as the average of the covariance between edges in the graph and
Jaccard coefficient between nodes. The opinion leader-based modularity function (QOPL ) is
defined based on the influential users’ degree. The reproduction probability of each cluster
(Pri) is defined as the ratio of its QOPL and the QOPL of all the other clusters in the graph.
In the genetic hierarchal bottom-up algorithm (GBUA), two cluster with the highest Pri
are selected to merge. In genetic top-down algorithm (GTDA), the cluster with less Pri is
selected. The genetic hybrid hierarchical algorithm (GHHA) utilizes the GBUA and BTDA
alternatively until the resulting community structure is the same whether the GHHA starts
from GBUA or GTDA.
Taking advantage of relationships which are found by some data mining techniques
to construct a graph, then the clusters or dendrogram will be obtained by partitioning the
graph using graph partitioning method. It is the generic scheme of graph-based clustering
algorithms. The dynamic similarity model based method has the better quality of clusters
than the static similarity model based method.

5.2 Hypergraph based hierarchical clustering

However, the Chameleon method and its improvements stated above are all based on
simple graphs which represent the pairwise similarity between data points. This may be
because simple graphs can only represent pairwise relationships in data, but cannot repre-
sent high order relationships in high dimensional space. Therefore, hypergraph can be used
to avoid the shortage of simple graph for analyzing the high-dimensional data.

13
8252 X. Ran et al.

Hypergraph is the generalization of the simple graph. A hypergraph contains many


hyperedges which can represent the similarity of more than two data points, and of which
the weights represent the strength of the similarity. It is widely used as an ideal tool to
cluster the data in high dimensional space due to the attribute of hyperedge (Agarwal et al.
2005). Hypergraph-based clustering results can be obtained by using the hypergraph par-
tition method (Govindu 2005). Many hypergraph partition methods have been proposed.
hMETIS is a fast, high-quality hypergraph partitioning algorithm, which directly parti-
tions the nodes of hypergraph to clusters so that the size of the hyperedge cut is minimum
(Karypis et al. 1999a). Zhou et al. (2006) have proposed normalized hypergraph cut that
is a generalized version of the spectral clustering method. Liu et al. (2015) have proposed
dense subgraph partition of positive hypergraph (DSP), that can efficiently, exactly, volun-
tarily divide a positive hypergraph into dense subgraphs. Wang et al. (2017) have proposed
a merging dense subgraphs (MDSG) method in which an improved hypergraph partition
method is used to partition the hypergraph.
Hypergraph has already been used in the clustering process. Veldt et al. (2020) have
proposed a framework for local clustering in hypergraphs based on minimum cuts and
maximum flows. Kumar et al. (2018) have introduced a hypergraph null model that is used
to define a modularity function, and a refinement over clustering is proposed by iteratively
reweighting cut hyperedges. These two methods only applied the hypergraph to partitional
clustering methods. Cherng and Lo (2001) have applied hypergraph to the hierarchical
clustering, in which a hypergraph is constructed from the Delaunay triangulation graph of
the dataset, and then the hypergraph is partitioned by MMP algorithm to produce a large
number of the relatively small sub-clusters which are merged recursively until the genuine
clusters are found according to a similarity measure similar to the Chameleon method.
Cheng et al. (2012) have introduced hypergraph to hierarchical clustering and have pro-
posed a hierarchical clustering method based on Hyperedge Similarity (HCHS) to inves-
tigate the overlapping and hierarchy of complex community structure. In HCHS, a com-
munity is defined as a set of hyperedges, hyperedge similarity is defined as the cosine
similarity between hyperedges. Initially each hyperedge is assigned to its own community,
then two communities in which the hyperedges are with the highest hyperedge similarity
are choose to merge until only one community is formed. Finally, community density is
utilized to cut the hierarchical tree to obtain the desired clusters.
However, this method only uses hypergraph with hyperedges that only contains three
data points, which cannot capture the higher order relationships beyond the third order
among the data points. Moreover, it is difficult to apply the method to high-dimensional
data because of the difficulties in computing the Delaunay triangulation graph from the
high-dimensional data. Xi and Lu (2020) have proposed a multi-stage hypergraph-based
hierarchical clustering algorithm using the Chameleon similarity measure for analyzing
data in high dimensional space. The proposed method first constructs a hypergraph from
the shared-nearest-neighbor (SNN) graph of the dataset and then employs a hypergraph
partitioning method hMETIS to obtain a series of sub-hypergraphs, finally those sub-
hypergraphs are merged sequentially to get the final hierarchical clustering results.
To sum up, the methods all utilize the hypergraph feature that represents high order
relationships in high dimensional space. Hypergraph is also an ideal tool for agglomerative
hierarchical clustering in high dimensional space.

13
Comprehensive survey on hierarchical clustering algorithms… 8253

6 Density‑based hierarchical clustering

The methods based on density estimation to construct hierarchy have aroused the inter-
est of researchers. This type of method not only retains the advantages of density-based
clustering, but also reveals the hierarchical structure among clusters. We can obtain more
information from clustering result.
RNG-HDBSCAN* has improved the strategy of computing multiple density-based
clustering hierarchies by replacing the Mutual Reachability Graph with a Relative Neigh-
borhood Graph that incrementally leverage the solutions of HDBSCAN* with respect to a
series of values of mpts (Neto et al. 2021). RNG-HDBSCAN* speeds up about 60 times
compared to HDBSCAN* by replacing the complete graph in HDBSCAN* for every mpts
with an equivalent smaller graph for a larger mpts, and has the better scalability for the
larger dataset. In some case, however, it cannot detect all of the cluster structures in the
data in a single hierarchy using a single value of mpts (Neto et al. 2021).
Hierarchical Quick Shift (HQuick-Shift) algorithm (Altinigneli et al. 2020) constructs a
Mode Attraction Graph (MAG) to overcome drawbacks of Quick Shift (QShift), in which
the hierarchy information of flat clusters cannot be reflected and all flat clusters obey the
assumption that groups correspond to modes of invariant density threshold. RNN con-
structs clusters with hierarchy from density estimation function. Using MAG finds the
optimal neighborhoods parameter 𝜏 . The resolution of constrained optimal problem gener-
ates the global optimal resolution for the specific parameter 𝜏 in unsupervised manner and
multi-level-mode-set are automatically extracted. HQuick-Shift can cope with clusters with
arbitrary shape and size, and varied densities, and can recognize noise, as well as can select
the optimal parameter to prevent clustering results from under- or over-fragmentation. It
can also learn the quasitemporal-relationships of the object to a general extractor with
RNN back-end during processing.
Zhu et al. (2022) have introduced a density-connected hierarchical density-peak cluster-
ing (DC-HCP), which can obtain clusters with varied densities from the view of hierarchi-
cal structure of clusters rather than flat clustering, by defining two types of clusters, i.e., 𝜂
-linked clusters and 𝜂-density-connected clusters. 𝜂 indicates the nearest neighbour with
higher density. It gives the formal cluster definitions that don’t exist in previous work, and
remedies the shortages of DP and DBSCAN, namely DBSCAN is not suitable for cluster-
ing datasets with varied densities, DP fails to find 𝜂-linked clusters with varied densities
while all points are not ranked and non 𝜂-linked clusters. Two clusters are merged from
bottom to up only if their modes are density-connected at the current level. DC-HCP has
three parameters to be set, k ∈ 2, 3, ..., 50;𝜖 ∈ 0.1%, 0.2%, ..., 99.9%;𝜏 = 1.
DC-HDP makes use of the respective advantages of DBSCAN and DP, that is, it
enhances the ability to recognize all clusters with arbitrary shape and different density, and
provides more information on the hierarchical structure of clusters. DC-HDP can be widely
used in various applications from a new view. However, DC-HDP only consumes the same
computational time with DP (Zhu et al. 2022).
As can be seen from the above, the density-based hierarchical clustering methods still
use density as similarity measure, can identify clusters of arbitrary shape and size, and
construct hierarchical relationships among clusters. It reflects richer information than tradi-
tional density-based clustering.

13
8254 X. Ran et al.

7 Combination of the hierarchical clustering with other techniques

Some hierarchical clustering methods combines the bottom-up and top-down methods or
clustering method and other advanced techniques to better produce hierarchical clustering
results.
Hierarchical self-organization maps (SOM) (Kohonen 2001) use the architecture of
artificial neural network to cluster data. It has input layer and hidden computational layer,
where a node represents a desired cluster and the topology exists between the clusters.
When the centroid is determined, the data points closest to the centroid are assigned to the
cluster including the centroid. The centroid is subsequently updated while centroids sur-
rounding it are updated accordingly. SOM can supply the data visualization and an elegant
topology diagram.
Geng and Ali (2005) have considered the influence of all clusters when selecting the
clusters to be merged in the proposed stochastic message passing clustering (SMPC)
method based on ensemble probability distribution. SMPC can undo the sub-cluster
in which the objects don’t have good probability distribution to improve the clustering
performance.
Zeng et al. (2009) have proposed a feature selection method based on long tailed distri-
bution (LTD-Selection) in the document and a hierarchical clustering algorithm. The hier-
archical clustering algorithm use LTD-Selection to select the feature words in the docu-
ments and compute the frequency of each word, then the first k feature words sorted in
descent order by the frequency are clustered by k-Means to obtained the topic set, and the
father–child relation between clusters in neighbor levels is set up. Then the next k feature
words are selected to construct tree-like structure until all feature words are contained in
one tree.
Muhr et al. (2010) have developed an algorithm which use growing k-Means method to
cluster the documents in the list to produce the clusters which meet some criteria. Then the
documents are treated as the children of the clusters. According to the specified constraints,
the clusters are split using growing k-Means or merged by the highest similarity between
clusters. Recursively performing above steps, the hierarchy can be generated.

8 The methods of improving the efficiency of hierarchical clustering

Compared to partitioning clustering, hierarchical clustering is more time consuming and


space consuming, which limits its application in some specific cases, especially for large-
scale data sets. So, many research works have been done to improve the efficiency of the
hierarchical clustering methods. This section introduces the methods of improving the effi-
ciency of hierarchical clustering, some of them improve the time complexity, while some
of them reduce the memory requirements during clustering, and the others improve both
efficiencies.
Day and Edelsbrunner (1984) have obtained a reasonably general combinatorial
Sequential, Agglomerative, Hierarchical, Nonoverlapping (SAHN) clustering algorithm
that requires O(n2 ) time and space complexities in the worst case using the stored matrix
approach, and the SAHN is based on the efficient construction of nearest neighbor chains.
The combinational SAHN method is insensitive to the distortions of the data space,
and surpasses the group average (UPGMA), weighted average (WPGMA) and flexible

13
Comprehensive survey on hierarchical clustering algorithms… 8255

combinatorial SAHN clustering algorithms. Moreover, they have designed an efficient


agglomerative hierarchical clustering algorithm, called centroid SAHN clustering algo-
rithm, which is based on a geometric model in which clusters are represented by points in
k-dimensional real space and points being agglomerated are replaced by a single (centroid)
point. The centroid SAHN clustering algorithm requires O(n2 ) time complexity and O(n)
space complexity in the worst case, and is suitable for problems of using the stored data
approach and any of a large family of dissimilarity measure. The efficiency of agglomera-
tive hierarchical clustering can also be improved by a hierarchy based on a group of cen-
troids instead of a hierarchy based on raw data points (Bouguettaya et al. 2015). Because of
using the centroid to represent a group of adjacent points, it is suitable for clustering high
dimensional data without dimensionality reduction and insensitive to the distribution of
the dataset and distance measure. However, it cannot effectively control the clusters due to
lacking the prior knowledge on the number of clusters.
Zhu et al. (2022) have concluded some common information about traditional hierarchi-
cal clustering methods as follows. The time complexities of agglomerative and divisive
hierarchical clustering are O(n2 ) and O(2n−1 ) respectively, where n is the number of objects
clustered. In time series clustering, the most widely used similarity measures are the Pear-
son’s correlation, Spearman distance, Dynamic Time Warping (DTW), and Euclidean
distance.
In the AMOEBA method (Estivill-Castro and Lee 2000), the Delaunay diagram is used
as similarity measure to discover multi-level clusters in exploratory data analysis and the
time complexity is also reduced to O(nlogn). The method doesn’t need any prior knowl-
edge on the dataset and parameters set by user, and has lower sensitivity to noise, outliers
and the type of data distribution.
An approximate k-nearest neighbor graph is used in agglomerative clustering method
to improve the time complexity (Fränti et al. 2006). A weighted directed graph is first con-
structed using K-d tree, divide-and-conquer, and projection-based search. In the graph,
each node is a single cluster and the edges represent the pointers from nodes to their
k-nearest neighbor nodes. The two nodes to which an edge is incident and have minimal
weight in the graph are selected to merge to form a new node, then the children nodes of
two merged nodes all link to the new node. Based above, a double linked algorithm (DLA)
is also proposed by adding back pointers which point to the particular clusters of a part of
their k nearest neighbors. DLA reduced the time consumption of searching the neighbors,
thus the efficiency of the agglomerative hierarchical clustering is improved. However, the
graph creation is still a bottleneck and challenge.
Jeon et al. have proposed multi-threaded hierarchical clustering based on multi-threaded
shared-memory machines to improve the cost of executing hierarchical agglomerative clus-
tering on large-scale data (Jeon and Yoon 2015). The method is improved later by using a
travel-time based similarity measure for building the edge-weighted tree (Lu et al. 2016).
Jeon et al. have proposed a new parallelization agglomerative hierarchical clustering for
multi-threaded shared-memory machines based on the nearest-neighbor (NN) chains algo-
rithm (Jeon and Yoon 2015). The proposed method grows multiple chains simultaneously
by thread partitioning that is defined that for N threads, some threads are assigned to grow
chain and remaining threads are assigned to update the distance matrix, and the role of
each thread is changeless. This method is much faster than the compared alternatives in
terms of the runtime in experiments.
Pang et al. (2019) have proposed a MapReduce-based agglomerative hierarchical sub-
space-clustering algorithm, in which data are partitioned using locality sensitive hashing
(LSH) into a set of non-overlapped subdatasets among Hadoop nodes. Clustering efficiency

13
8256 X. Ran et al.

is improved by the LSH-based data partitioning by discovering local clusters in each data
node in Hadoop. The algorithm is a two-stage parallel clustering method that integrates
subspace clustering algorithm and conventional agglomerative hierarchical clustering
algorithm.
In the paper of Cheung and Zhang (2019), they have proposed the growing multilayer
topology training algorithm called GMTT, which trains a collection of seed points that are
linked in a growing and hierarchical manner and can represent the data distribution. The
topology trained by the GMTT can accelerate the similarity measure between data points
and guide to merge. Then a DL linkage, that is a type of density-based metric, based on the
topology trained by GMIT is proposed and used as the similarity measure during merging.
Many sub-MSTs are formed according to the topology and DL, and MST then is formed by
adding links between subset in terms of the sub-MSTs. Finally, the MST is transformed to
a dendrogram according to the corresponding topology and MST. The incremental GMIT,
called IGMIT, also has been proposed to process the streaming data. Both GMIT and
IGMIT improve the time complexity of the agglomerative hierarchical clustering and do
not lose the accuracy of the algorithm.

9 Conclusion

In this paper, a relatively comprehensive survey on hierarchical clustering algorithms is


conducted, especially the most recently developed methods. For the divisive clustering, the
key issue is how to select a cluster for the next splitting procedure and how to divide the
selected sub-cluster. The dissimilarity measure between any two objects in a cluster is usu-
ally used to select the cluster to be split. For agglomerative hierarchical clustering, the key
issue is the similarity measure that is used to select the two most similar clusters for the
next merge. The similarity measure based on distance is usually used for numerical data,
and the similarity measure based on attribute is applied to the categorical data.
For the divisive hierarchical clustering, how to compute the dissimilarity and how to
exactly determine the node level of children nodes produced after splitting need to be stud-
ied further. For the agglomerative hierarchical clustering, how to compute the similarity
measure between clusters is still one of the key issues which need to be studied further,
especially in the high-dimensional space.
The methods which combine traditional hierarchical clustering with CNN or the hier-
archical clustering based on deep learning techniques have been proposed recently (Zeng
et al. 2020; Zhao et al. 2020; Lin et al. 2019). This is an interesting idea, and how to use
deep learning techniques to improve hierarchical clustering needs further study.
The available source code links for some representative algorithms are listed in Table S1
in the supplement.
Supplementary Information The online version contains supplementary material available at https://​doi.​
org/​10.​1007/​s10462-​022-​10366-3.

Funding This work is supported by the National Key R&D Program of China (Grant No. 2017YFE0111900,
2018YFB1003205), the Higher Education Innovation Fund of Gansu Province (Grants No. 2020B-214), and
the Natural Science Foundation of Gansu Province (Grants No. 20JR10RG304).

Declarations
Conflict of interest The authors declare that they have no competing interests.

13
Comprehensive survey on hierarchical clustering algorithms… 8257

References
Abdi H, Valentin D (2007) Multiple correspondence analysis. Encycl Meas Stat 2(4):651–657
Ackerman M, Dasgupta S (2014) Incremental clustering: the case for extra clusters. In: Advances in neu-
ral information processing systems 27: annual conference on neural information processing systems
2014, December 8–13 2014, Montreal, QC. pp 307–315
Agarwal S, Lim J, Zelnik-Manor L et al (2005) Beyond pairwise clustering. In: 2005 IEEE computer society
conference on computer vision and pattern recognition (CVPR 2005), 20–26 June 2005, San Diego,
CA. IEEE Computer Society, pp 838–845. https://​doi.​org/​10.​1109/​CVPR.​2005.​89
Agha GA (1990) ACTORS—a model of concurrent computation in distributed systems. MIT Press
series in artificial intelligence. MIT Press, Cambridge
Alalyan F, Zamzami N, Bouguila N (2019) Model-based hierarchical clustering for categorical data. In:
28th IEEE international symposium on industrial electronics, ISIE 2019, Vancouver, BC, June
12–14, 2019. IEEE, pp 1424–1429. https://​doi.​org/​10.​1109/​ISIE.​2019.​87813​07
Altinigneli MC, Miklautz L, Böhm C et al (2020) Hierarchical quick shift guided recurrent clustering.
In: 2020 IEEE 36th international conference on data engineering (ICDE). pp 1842–1845. https://​
doi.​org/​10.​1109/​ICDE4​8307.​2020.​00184
Alzate C, Suykens JA (2012) Hierarchical kernel spectral clustering. Neural Netw 35(C):21–30. https://​
doi.​org/​10.​1016/j.​neunet.​2012.​06.​007
Anderberg MR (1973) Chapter 6–hierarchical clustering methods, probability and mathematical statis-
tics: a series of monographs and textbooks, vol 19. Academic Press, Cambridge. https://​doi.​org/​
10.​1016/​B978-0-​12-​057650-​0.​50012-0
Averbuch-Elor H, Bar N, Cohen-Or D (2020) Border-peeling clustering. IEEE Trans Pattern Anal Mach
Intell 42(7):1791–1797. https://​doi.​org/​10.​1109/​TPAMI.​2019.​29249​53
Barton T, Bruna T, Kordík P (2016) Mocham: robust hierarchical clustering based on multi-objective
optimization. In: IEEE international conference on data mining workshops, ICDM workshops
2016, December 12–15, 2016, Barcelona. IEEE Computer Society, pp 831–838. https://​doi.​org/​10.​
1109/​ICDMW.​2016.​0123
Barton T, Bruna T, Kordík P (2019) Chameleon 2: an improved graph-based clustering algorithm. ACM
Trans Knowl Discov Data 13(1):10. https://​doi.​org/​10.​1145/​32998​76
Batagelj V (1988) Generalized ward and related clustering problems. Classification and related methods
of data analysis. Jun: 67–74
Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas CK, Teboulle M
(eds) Grouping multidimensional data–recent advances in clustering. Springer, Berlin, pp 25–71.
https://​doi.​org/​10.​1007/3-​540-​28349-8_2
Boley D (1998) Principal direction divisive partitioning. Data Min Knowl Disc 2(4):325–344
Bouguettaya A, Yu Q, Liu X et al (2015) Efficient agglomerative hierarchical clustering. Expert Syst
Appl 42(5):2785–2797
Brans JP, Vincke P (1985) Note—a preference ranking organisation method: (the PROMETHEE method
for multiple criteria decision-making). Manag Sci 31(6):647–656
Brans JP, Vincke P, Mareschal B (1986) How to select and how to rank projects: the PROMETHEE
method. Eur J Oper Res 24(2):228–238
Cai D, Chen X (2015) Large scale spectral clustering via landmark-based sparse representation. IEEE
Trans Cybern 45(8):1669–1680. https://​doi.​org/​10.​1109/​TCYB.​2014.​23585​64
Cai Q, Liu J (2020) Hierarchical clustering of bipartite networks based on multiobjective optimization.
IEEE Trans Netw Sci Eng 7(1):421–434. https://​doi.​org/​10.​1109/​TNSE.​2018.​28308​22
Cai Y, Sun Y (2011) ESPRIT-tree: hierarchical clustering analysis of millions of 16s rRNA pyrose-
quences in quasilinear computational time. Nucleic Acids Res 39(14):e95–e95. https://​doi.​org/​10.​
1093/​nar/​gkr349
Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density
estimates. In: Pei J, Tseng VS, Cao L et al (eds) Advances in knowledge discovery and data min-
ing. Springer, Berlin, Heidelberg, pp 160–172
Cao X, Su T, Wang P et al (2018) An optimized chameleon algorithm based on local features. In: Pro-
ceedings of the 10th international conference on machine learning and computing, ICMLC 2018,
Macau, February 26–28, 2018. ACM, pp 184–192
Carpineto C, Romano G (1996) A lattice conceptual clustering system and its application to browsing
retrieval. Mach Learn 24(2):95–122. https://​doi.​org/​10.​1007/​BF000​58654
Chakraborty S, Paul D, Das S (2020) Hierarchical clustering with optimal transport. Stat Probab Lett
163(108):781. https://​doi.​org/​10.​1016/j.​spl.​2020.​108781

13
8258 X. Ran et al.

Chen D, Cui DW, Wang CX et al (2006) A rough set-based hierarchical clustering algorithm for cat-
egorical data. Int J Inf Technol 12(3):149–159
Chen W, Song Y, Bai H et al (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pat-
tern Anal Mach Intell 33(3):568–586. https://​doi.​org/​10.​1109/​TPAMI.​2010.​88
Cheng Q, Liu Z, Huang J et al (2012) Hierarchical clustering based on hyper-edge similarity for com-
munity detection. In: 2012 IEEE/WIC/ACM international conferences on web intelligence, WI
2012, Macau, December 4–7, 2012. IEEE Computer Society, pp 238–242. https://​doi.​org/​10.​1109/​
WI-​IAT.​2012.9
Cheng D, Zhu Q, Huang J et al (2019a) A hierarchical clustering algorithm based on noise removal. Int J
Mach Learn Cybern 10(7):1591–1602. https://​doi.​org/​10.​1007/​s13042-​018-​0836-3
Cheng D, Zhu Q, Huang J et al (2019b) A local cores-based hierarchical clustering algorithm for data
sets with complex structures. Neural Comput Appl 31(11):8051–8068. https://​doi.​org/​10.​1007/​
s00521-​018-​3641-8
Cherng J, Lo M (2001) A hypergraph based clustering algorithm for spatial data sets. In: Proceedings
of the 2001 IEEE international conference on data mining, 29 November–2 December 2001, San
Jose, CA. IEEE Computer Society, pp 83–90. https://​doi.​org/​10.​1109/​ICDM.​2001.​989504
Cheung Y, Zhang Y (2019) Fast and accurate hierarchical clustering based on growing multilayer
topology training. IEEE Trans Neural Netw Learn Syst 30(3):876–890. https://​doi.​org/​10.​1109/​
TNNLS.​2018.​28534​07
Cho M, Lee J, Lee KM (2009) Feature correspondence and deformable object matching via agglomera-
tive correspondence clustering. In: IEEE 12th international conference on computer vision, ICCV
2009, Kyoto, September 27–October 4, 2009. IEEE Computer Society, pp 1280–1287. https://​doi.​
org/​10.​1109/​ICCV.​2009.​54593​22
Courty N, Flamary R, Tuia D et al (2017) Optimal transport for domain adaptation. IEEE Trans Pattern
Anal Mach Intell 39(9):1853–1865. https://​doi.​org/​10.​1109/​TPAMI.​2016.​26159​21
Day WH, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods.
J Classif 1(1):7–24
Dixit V (2022) GCFI++: embedding and frequent itemset based incremental hierarchical clustering with
labels and outliers. In: CODS-COMAD 2022: 5th joint international conference on data science &
management of data (9th ACM IKDD CODS and 27th COMAD), Bangalore, January 8–10, 2022.
ACM, pp 135–143. https://​doi.​org/​10.​1145/​34937​00.​34937​27
Dong Y, Wang Y, Jiang K (2018) Improvement of partitioning and merging phase in chameleon clus-
tering algorithm. In: 2018 3rd international conference on computer and communication systems
(ICCCS). IEEE, pp 29–32
Duran BS, Odell PL (2013) Cluster analysis: a survey, vol 100. Springer Science & Business Media,
Berlin
D’Urso P, Vitale V (2020) A robust hierarchical clustering for georeferenced data. Spat Stat 35(100):407.
https://​doi.​org/​10.​1016/j.​spasta.​2020.​100407
Endo Y, Haruyama H, Okubo T (2004) On some hierarchical clustering algorithms using kernel func-
tions. In: 2004 IEEE international conference on fuzzy systems (IEEE Cat. No.04CH37542), vol
3. pp 1513–1518. https://​doi.​org/​10.​1109/​FUZZY.​2004.​13753​99
Estivill-Castro V, Lee I (2000) AMOEBA: hierarchical clustering based on spatial proximity using
delaunay diagram. In: Proceedings of the 9th international symposium on spatial data handling.
Beijing, pp 1–16
Everitt B, Landau S, Leese M (2001) Cluster analysis. Arnold, London
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–
172. https://​doi.​org/​10.​1007/​BF001​14265
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifica-
tions. Biometrics 21:768–769
Fouedjio F (2016) A hierarchical clustering method for multivariate geostatistical data. Spat Stat
18:333–351. https://​doi.​org/​10.​1016/j.​spasta.​2016.​07.​003
Fränti P, Virmajoki O, Hautamäki V (2006) Fast agglomerative clustering using a k-nearest neighbor
graph. IEEE Trans Pattern Anal Mach Intell 28(11):1875–1881. https://​doi.​org/​10.​1109/​TPAMI.​
2006.​227
Frigui H, Krishnapuram R (1999) A robust competitive clustering algorithm with applications in computer
vision. IEEE Trans Pattern Anal Mach Intell 21(5):450–465. https://​doi.​org/​10.​1109/​34.​765656
Galdino SML, Maciel PRM (2019) Hierarchical cluster analysis of interval-valued data using width of
range Euclidean distance. In: IEEE Latin American conference on computational intelligence, LA-
CCI 2019, Guayaquil, Ecuador, November 11–15, 2019. IEEE, pp 1–6. https://​doi.​org/​10.​1109/​
LA-​CCI47​412.​2019.​90367​54

13
Comprehensive survey on hierarchical clustering algorithms… 8259

Geng H, Ali HH (2005) A new clustering strategy with stochastic merging and removing based on ker-
nel functions. In: Fourth international IEEE computer society computational systems bioinformat-
ics conference workshops & poster abstracts, CSB 2005 workshops, Stanford, CA, August 8–11,
2005. IEEE Computer Society, pp 41–42. https://​doi.​org/​10.​1109/​CSBW.​2005.​10
Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad
Sci USA 99(12):7821–7826
Golub GH, Loan CFV (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Govindu VM (2005) A tensor decomposition for geometric grouping and segmentation. In: 2005 IEEE com-
puter society conference on computer vision and pattern recognition (CVPR 2005), 20–26 June 2005,
San Diego, CA. IEEE Computer Society, pp 1150–1157. https://​doi.​org/​10.​1109/​CVPR.​2005.​50
Gracia C, Binefa X (2011) On hierarchical clustering for speech phonetic segmentation. In: Proceedings
of the 19th European signal processing conference, EUSIPCO 2011, Barcelona, August 29–Sep-
tember 2, 2011. IEEE, pp 2128–2132
Guan X, Du L (1998) Domain identification by clustering sequence alignments. Bioinformatics
14(9):783–788. https://​doi.​org/​10.​1093/​bioin​forma​tics/​14.9.​783
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Haas
LM, Tiwary A (eds) SIGMOD 1998, proceedings ACM SIGMOD international conference on
management of data, June 2–4, 1998, Seattle, Washington. ACM Press, pp 73–84. https://​doi.​org/​
10.​1145/​276304.​276312
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf
Syst 25(5):345–366. https://​doi.​org/​10.​1016/​S0306-​4379(00)​00022-3
Gullo F, Ponti G, Tagarelli A et al (2008) A hierarchical algorithm for clustering uncertain data via an
information-theoretic approach. In: Proceedings of the 8th IEEE international conference on data
mining (ICDM 2008), December 15–19, 2008, Pisa. IEEE Computer Society, pp 821–826. https://​
doi.​org/​10.​1109/​ICDM.​2008.​115
Guo JF, Zhao YY, Li J (2007) A multi-relational hierarchical clustering algorithm based on shared near-
est neighbor similarity. In: 2007 international conference on machine learning and cybernetics.
IEEE, pp 3951–3955. https://​doi.​org/​10.​1109/​ICMLC.​2007.​43708​36
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst
17(2–3):107–145. https://​doi.​org/​10.​1023/A:​10128​01612​483
Han J, Kamber M, Pei J (2011) Data mining concepts and techniques: third edition. Morgan Kaufmann
Ser Data Manag Syst 5(4):83–124
Han X, Zhu Y, Ting KM et al (2022) Streaming hierarchical clustering based on point-set kernel. In:
KDD ’22: the 28th ACM SIGKDD conference on knowledge discovery and data mining, Wash-
ington, DC, August 14–18, 2022. ACM, pp 525–533. https://​doi.​org/​10.​1145/​35346​78.​35393​23
He Z, Xu X, Deng S (2002) Squeezer: an efficient algorithm for clustering categorical data. J Comput
Sci Technol 17(5):611–624. https://​doi.​org/​10.​1007/​BF029​48829
He L, Ray N, Guan Y et al (2019) Fast large-scale spectral clustering via explicit feature mapping. IEEE
Trans Cybern 49(3):1058–1071. https://​doi.​org/​10.​1109/​TCYB.​2018.​27949​98
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: Raedt LD, Wrobel S (eds)
Machine learning, proceedings of the twenty-second international conference (ICML 2005), Bonn,
August 7–11, 2005, ACM international conference proceeding series, vol 119. ACM, pp 297–304.
https://​doi.​org/​10.​1145/​11023​51.​11023​89
Huang D, Wang C, Wu J et al (2020) Ultra-scalable spectral clustering and ensemble clustering. IEEE
Trans Knowl Data Eng 32(6):1212–1226. https://​doi.​org/​10.​1109/​TKDE.​2019.​29034​10
Hubert L (1973) Monotone invariant clustering procedures. Psychometrika 38(1):47–62
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Hulot A, Chiquet J, Jaffrézic F et al (2020) Fast tree aggregation for consensus hierarchical clustering.
BMC Bioinform 21(1):120. https://​doi.​org/​10.​1186/​s12859-​020-​3453-6
Ishizaka A, Lokman B, Tasiou M (2020) A stochastic multi-criteria divisive hierarchical clustering algo-
rithm. Omega 103:102370
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Hoboken
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal
Mach Intell 22(1):4–37. https://​doi.​org/​10.​1109/​34.​824819
Jalalat-evakilkandi M, Mirzaei A (2010) A new hierarchical-clustering combination scheme based on
scatter matrices and nearest neighbor criterion. In: 2010 5th international symposium on telecom-
munications, IEEE, pp 904–908. https://​doi.​org/​10.​1109/​ISTEL.​2010.​57341​51
Jambu M, Tan SH, Stern D (1989) Exploration informatique et statistique des données. Dunod, Paris
Jeantet I, Miklós Z, Gross-Amblard D (2020) Overlapping hierarchical clustering (OHC). In: Advances
in intelligent data analysis XVIII—18th international symposium on intelligent data analysis, IDA

13
8260 X. Ran et al.

2020, Konstanz, April 27–29, 2020, proceedings, lecture notes in computer science, vol 12080.
Springer, pp 261–273. https://​doi.​org/​10.​1007/​978-3-​030-​44584-3_​21
Jeon Y, Yoon S (2015) Multi-threaded hierarchical clustering by parallel nearest-neighbor chaining.
IEEE Trans Parallel Distrib Syst 26(9):2534–2548. https://​doi.​org/​10.​1109/​TPDS.​2014.​23552​05
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254
Judd D, McKinley PK, Jain AK (1998) Large-scale parallel data clustering. IEEE Trans Pattern Anal
Mach Intell 20(8):871–876. https://​doi.​org/​10.​1109/​34.​709614
Karypis G, Aggarwal R, Kumar V et al (1999a) Multilevel hypergraph partitioning: applications in VLSI
domain. IEEE Trans Very Large Scale Integr Syst 7(1):69–79. https://​doi.​org/​10.​1109/​92.​748202
Karypis G, Han E, Kumar V (1999b) Chameleon: hierarchical clustering using dynamic modeling. Com-
puter 32(8):68–75. https://​doi.​org/​10.​1109/2.​781637
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hobo-
ken. https://​doi.​org/​10.​1002/​97804​70316​801
Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis, vol 344.
Wiley, Hoboken
Kaur PJ et al (2015) Cluster quality based performance evaluation of hierarchical clustering method. In:
2015 1st international conference on next generation computing technologies (NGCT). IEEE, pp
649–653
Kernighan BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J
49(2):291–307. https://​doi.​org/​10.​1002/j.​1538-​7305.​1970.​tb017​70.x
Kobren A, Monath N, Krishnamurthy A et al (2017) A hierarchical algorithm for extreme clustering. In:
Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data
mining, Halifax, NS, August 13–17, 2017. ACM, pp 255–264. https://​doi.​org/​10.​1145/​30979​83.​
30980​79
Kohonen T (2001) Self-organizing maps, third edition. Springer series in information sciences. Springer,
Cham. https://​doi.​org/​10.​1007/​978-3-​642-​56927-2
Kotsiantis S, Pintelas P (2004) Recent advances in clustering: a brief survey. WSEAS Trans Inf Sci Appl
1(1):73–81
Kumar P, Tripathy B (2009) MMeR: an algorithm for clustering heterogeneous data using rough set theory.
Int J Rapid Manuf 1(2):189–207
Kumar T, Vaidyanathan S, Ananthapadmanabhan H et al (2018) Hypergraph clustering: a modularity maxi-
mization approach. CoRR. arXiv:​abs/​1812.​10869
Lance GN, Williams WT (1967) A general theory of classificatory sorting strategies: 1. Hierarchical sys-
tems. Comput J 9(4):373–380. https://​doi.​org/​10.​1093/​comjnl/​9.4.​373
Lerato L, Niesler T (2015) Clustering acoustic segments using multi-stage agglomerative hierarchical clus-
tering. PLoS ONE 10(10):e0141756
Lewis-Beck M, Bryman AE, Liao TF (2003) The Sage encyclopedia of social science research methods.
Sage Publications, Thousand Oaks
Li M, Deng S, Wang L et al (2014) Hierarchical clustering algorithm for categorical data using a probabilis-
tic rough set model. Knowl Based Syst 65:60–71. https://​doi.​org/​10.​1016/j.​knosys.​2014.​04.​008
Li S, Li W, Qiu J (2017) A novel divisive hierarchical clustering algorithm for geospatial analysis. ISPRS
Int J Geo-Inf 6(1):30. https://​doi.​org/​10.​3390/​ijgi6​010030
Li Y, Hong Z, Feng W et al (2019) A hierarchical clustering based feature word extraction method. In: 2019
IEEE 3rd advanced information management, communicates, electronic and automation control con-
ference (IMCEC). IEEE, pp 883–887
Lin Y, Dong X, Zheng L et al (2019) A bottom-up clustering approach to unsupervised person re-identifica-
tion. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first inno-
vative applications of artificial intelligence conference, IAAI 2019, the ninth AAAI symposium on
educational advances in artificial intelligence, EAAI 2019, Honolulu, Hawaii, January 27–February 1,
2019. AAAI Press, pp 8738–8745. https://​doi.​org/​10.​1609/​aaai.​v33i01.​33018​738
Liu H, Latecki LJ, Yan S (2015) Dense subgraph partition of positive hypergraphs. IEEE Trans Pattern Anal
Mach Intell 37(3):541–554. https://​doi.​org/​10.​1109/​TPAMI.​2014.​23461​73
Liu J, Liu X, Yang Y et al (2021) Hierarchical multiple kernel clustering. In: Thirty-fifth AAAI conference
on artificial intelligence. AAAI, pp 2–9
Lu Y, Wan Y (2013) PHA: a fast potential-based hierarchical agglomerative clustering method. Pattern Rec-
ognit 46(5):1227–1239. https://​doi.​org/​10.​1016/j.​patcog.​2012.​11.​017
Lu Y, Hou X, Chen X (2016) A novel travel-time based similarity measure for hierarchical clustering. Neu-
rocomputing 173:3–8. https://​doi.​org/​10.​1016/j.​neucom.​2015.​01.​090
Ma X, Dhavala S (2018) Hierarchical clustering with prior knowledge. CoRR. arXiv:​abs/​1806.​03432

13
Comprehensive survey on hierarchical clustering algorithms… 8261

Macnaughton-Smith P, Williams W, Dale M et al (1964) Dissimilarity analysis: a new technique of hierar-


chical sub-division. Nature 202(4936):1034–1035
Mao Q, Zheng W, Wang L et al (2015) Parallel hierarchical clustering in linearithmic time for large-scale
sequence analysis. In: 2015 IEEE international conference on data mining, ICDM 2015, Atlantic City,
NJ, November 14–17, 2015. IEEE Computer Society, pp 310–319. https://​doi.​org/​10.​1109/​ICDM.​
2015.​90
Monath N, Kobren A, Krishnamurthy A et al (2019) Scalable hierarchical clustering with tree grafting. pp
1438–1448. https://​doi.​org/​10.​1145/​32925​00.​33309​29
Monath N, Dubey KA, Guruganesh G et al (2021) Scalable hierarchical agglomerative clustering. In: Pro-
ceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. Association
for Computing Machinery, New York, NY, KDD ’21, pp 1245–1255. https://​doi.​org/​10.​1145/​34475​
48.​34674​04
Muhr M, Sabol V, Granitzer M (2010) Scalable recursive top-down hierarchical clustering approach with
implicit model selection for textual data sets. In: Database and expert systems applications, DEXA,
international workshops, Bilbao, August 30–September 3, 2010. IEEE Computer Society, pp 15–19.
https://​doi.​org/​10.​1109/​DEXA.​2010.​25
Mulinka P, Casas P, Fukuda K et al (2020) HUMAN—hierarchical clustering for unsupervised anomaly
detection & interpretation. In: 11th international conference on network of the future, NoF 2020, Bor-
deaux, October 12–14, 2020. IEEE, pp 132–140. https://​doi.​org/​10.​1109/​NoF50​125.​2020.​92491​94
Müllner D (2011) Modern hierarchical, agglomerative clustering algorithms. CoRR. arXiv:​abs/​1109.​2378
Murtagh F (1983) A survey of recent advances in hierarchical clustering algorithms. Comput J 26(4):354–
359. https://​doi.​org/​10.​1093/​comjnl/​26.4.​354
Murtagh F, Contreras P (2012) Algorithms for hierarchical clustering: an overview. Wiley Interdiscip Rev
Data Min Knowl Discov 2(1):86–97. https://​doi.​org/​10.​1002/​widm.​53
Murtagh F, Contreras P (2017) Algorithms for hierarchical clustering: an overview, II. Wiley Interdiscip
Rev Data Min Knowl Discov. https://​doi.​org/​10.​1002/​widm.​1219
Murtagh F, Legendre P (2014) Ward’s hierarchical agglomerative clustering method: which algorithms
implement Ward’s criterion? J Classif 31(3):274–295. https://​doi.​org/​10.​1007/​s00357-​014-​9161-z
Myers C, Rabiner L, Rosenberg A (1980) Performance tradeoffs in dynamic time warping algorithms for
isolated word recognition. IEEE Trans Acoust Speech Signal Process 28(6):623–635
Narita K, Hochin T, Nomiya H (2018) Incremental clustering for hierarchical clustering. In: 5th interna-
tional conference on computational science/intelligence and applied informatics, CSII 2018, Yonago,
July 10–12, 2018. IEEE, pp 102–107. https://​doi.​org/​10.​1109/​CSII.​2018.​00025
Narita K, Hochin T, Hayashi Y et al (2020) Incremental hierarchical clustering for data insertion and its
evaluation. Int J Softw Innov 8(2):1–22. https://​doi.​org/​10.​4018/​IJSI.​20200​40101
Nasiriani N, Squicciarini AC, Saldanha Z et al (2019) Hierarchical clustering for discrimination discovery:
a top-down approach. In: 2nd IEEE international conference on artificial intelligence and knowledge
engineering, AIKE 2019, Sardinia, June 3–5, 2019. IEEE, pp 187–194. https://​doi.​org/​10.​1109/​AIKE.​
2019.​00041
Nazari Z, Kang D (2018) A new hierarchical clustering algorithm with intersection points. In: 2018 5th
IEEE Uttar Pradesh section international conference on electrical, electronics and computer engineer-
ing (UPCON). IEEE, pp 1–5
Neto ACA, Sander J, Campello RJGB et al (2021) Efficient computation and visualization of multiple den-
sity-based clustering hierarchies. IEEE Trans Knowl Data Eng 33(8):3075–3089. https://​doi.​org/​10.​
1109/​TKDE.​2019.​29624​12
Nikpour S, Asadi S (2022) A dynamic hierarchical incremental learning-based supervised clustering for
data stream with considering concept drift. J Ambient Intell Humaniz Comput 13(6):2983–3003.
https://​doi.​org/​10.​1007/​s12652-​021-​03673-0
Núñez-Valdéz ER, Solanki VK, Balakrishna S et al (2020) Incremental hierarchical clustering driven auto-
matic annotations for unifying IoT streaming data. Int J Interact Multim Artif Intell 6(2):1–15. https://​
doi.​org/​10.​9781/​ijimai.​2020.​03.​001
Omran MGH, Engelbrecht AP, Salman AA (2007) An overview of clustering methods. Intell Data Anal
11(6):583–605
Pang N, Zhang J, Zhang C et al (2019) Parallel hierarchical subspace clustering of categorical data. IEEE
Trans Comput 68(4):542–555. https://​doi.​org/​10.​1109/​TC.​2018.​28793​32
Parmar D, Wu T, Blackhurst J (2007) MMR: an algorithm for clustering categorical data using rough set
theory. Data Knowl Eng 63(3):879–893. https://​doi.​org/​10.​1016/j.​datak.​2007.​05.​005
Qin J, Lewis DP, Noble WS (2003) Kernel hierarchical gene clustering from microarray expression data.
Bioinformatics 19(16):2097–2104. https://​doi.​org/​10.​1093/​bioin​forma​tics/​btg288

13
8262 X. Ran et al.

Qin H, Ma X, Herawan T et al (2014) MGR: an information theory based hierarchical divisive clustering
algorithm for categorical data. Knowl Based Syst 67:401–411. https://​doi.​org/​10.​1016/j.​knosys.​2014.​
03.​013
Rabin J, Ferradans S, Papadakis N (2014) Adaptive color transfer with relaxed optimal transport. In: 2014
IEEE international conference on image processing, ICIP 2014, Paris, October 27–30, 2014. IEEE, pp
4852–4856. https://​doi.​org/​10.​1109/​ICIP.​2014.​70259​83
Rahman MA, Rahman MM, Mollah MNH et al (2018) Robust hierarchical clustering for metabolomics
data analysis in presence of cell-wise and case-wise outliers. In: 2018 international conference on
computer, communication, chemical, material and electronic engineering (IC4ME2). IEEE, pp 1–4.
https://​doi.​org/​10.​1109/​IC4ME2.​2018.​84656​16
Reddy CK, Vinzamuri B (2013) A survey of partitional and hierarchical clustering algorithms. In: Aggar-
wal CC, Reddy CK (eds) Data clustering: algorithms and applications. CRC Press, Boca Raton, pp
87–110
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev
26(2):195–239
Rocha C, Dias LC (2013) MPOC: an agglomerative algorithm for multicriteria partially ordered clustering.
4OR 11(3):253–273. https://​doi.​org/​10.​1007/​s10288-​013-​0228-1
Ros F, Guillaume S (2019) A hierarchical clustering algorithm and an improvement of the single linkage
criterion to deal with noise. Expert Syst Appl 128:96–108
Ros F, Guillaume S, Hajji ME et al (2020) KdMutual: a novel clustering algorithm combining mutual neigh-
boring and hierarchical approaches using a new selection criterion. Knowl Based Syst 204(106):220.
https://​doi.​org/​10.​1016/j.​knosys.​2020.​106220
Roux M (2018) A comparative study of divisive and agglomerative hierarchical clustering algorithms. J
Classif 35(2):345–366. https://​doi.​org/​10.​1007/​s00357-​018-​9259-9
Sabarish B, Karthi R, Kumar TG (2020) Graph similarity-based hierarchical clustering of trajectory data.
Procedia Comput Sci 171:32–41. https://​doi.​org/​10.​1016/j.​procs.​2020.​04.​004
Sahoo N, Callan J, Krishnan R et al (2006) Incremental hierarchical clustering of text documents. In: Pro-
ceedings of the 2006 ACM CIKM international conference on information and knowledge manage-
ment, Arlington, VA, November 6–11, 2006. ACM, pp 357–366. https://​doi.​org/​10.​1145/​11836​14.​
11836​67
Salton G (1975) A vector space model for information retrieval. J ASIS 18(11): 613–620
Sander J, Qin X, Lu Z et al (2003) Automatic extraction of clusters from hierarchical clustering representa-
tions. In: Whang KY, Jeon J, Shim K et al (eds) Advances in knowledge discovery and data mining.
Springer, Berlin, Heidelberg, pp 75–87
Saunders A, Ashlock DA, Houghten SK (2018) Hierarchical clustering and tree stability. In: 2018 IEEE
conference on computational intelligence in bioinformatics and computational biology, CIBCB 2018,
Saint Louis, MO, May 30–June 2, 2018. IEEE, pp 1–8. https://​doi.​org/​10.​1109/​CIBCB.​2018.​84049​78
Sharan R, Shamir R (2000) Center CLICK: a clustering algorithm with applications to gene expression
analysis. In: Proceedings of the eighth international conference on intelligent systems for molecular
biology, August 19–23, 2000, La Jolla/San Diego, CA. AAAI, pp 307–316
Sharma S, Batra N et al (2019) Comparative study of single linkage, complete linkage, and ward method of
agglomerative clustering. In: 2019 international conference on machine learning, big data, cloud and
parallel computing (COMITCon). IEEE, pp 568–573
Shimizu T, Sakurai K (2018) Comprehensive data tree by actor messaging for incremental hierarchical
clustering. In: 2018 IEEE 42nd annual computer software and applications conference, COMPSAC
2018, Tokyo, 23–27 July 2018, vol 1. IEEE Computer Society, pp 801–802. https://​doi.​org/​10.​1109/​
COMPS​AC.​2018.​00127
Sisodia D, Singh L, Sisodia S et al (2012) Clustering techniques: a brief survey of different clustering algo-
rithms. Int J Latest Trends Eng Technol 1(3):82–87
Sneath PH, Sokal RR (1975) Numerical taxonomy. The principles and practice of numerical classification,
vol 50. Williams WT published in association with Stony Brook University. https://​doi.​org/​10.​1086/​
408956
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. https://​hdl.​
handle.​net/​11299/​215421, May 23, 2000
Székely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending Ward’s
minimum variance method. J Classif 22(2):151–183. https://​doi.​org/​10.​1007/​s00357-​005-​0012-9
Takumi S, Miyamoto S (2012) Top-down vs bottom-up methods of linkage for asymmetric agglomerative
hierarchical clustering. In: 2012 IEEE international conference on granular computing, GrC 2012,
Hangzhou, August 11–13, 2012. IEEE Computer Society, pp 459–464. https://​doi.​org/​10.​1109/​GrC.​
2012.​64686​89

13
Comprehensive survey on hierarchical clustering algorithms… 8263

Tan P, Steinbach M, Karpatne A et al (2019) Introduction to data mining, Second Edition. Pearson, Harlow
Toujani R, Akaichi J (2018) GHHP: genetic hybrid hierarchical partitioning for community structure in
social medias networks. In: 2018 IEEE smartWorld, ubiquitous intelligence & computing, advanced
& trusted computing, scalable computing & communications, cloud & big data computing, internet
of people and smart city innovation, SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI 2018,
Guangzhou, October 8–12, 2018. IEEE, pp 1146–1153. https://​doi.​org/​10.​1109/​Smart​World.​2018.​
00199
Tripathy B, Ghosh A (2011a) SDR: an algorithm for clustering categorical data using rough set theory. In:
2011 IEEE recent advances in intelligent computational systems. IEEE, pp 867–872
Tripathy B, Ghosh A (2011b) SSDR: an algorithm for clustering categorical data using rough set theory.
Adv Appl Sci Res 2(3):314–326
Tripathy B, Goyal A, Chowdhury R et al (2017) MMeMeR: an algorithm for clustering heterogeneous
data using rough set theory. Int J Intell Syst Appl 9(8):25
Tsekouras G, Kotoulas P, Tsirekis C et al (2008) A pattern recognition methodology for evalua-
tion of load profiles and typical days of large electricity customers. Electr Power Syst Res
78(9):1494–1510
Turi R (2001) Clustering-based colour image segmentation. PhD Thesis, Monash University
Varshney AK, Muhuri PK, Lohani QMD (2022) PIFHC: the probabilistic intuitionistic fuzzy hierarchical
clustering algorithm. Appl Soft Comput 120(108):584. https://​doi.​org/​10.​1016/j.​asoc.​2022.​108584
Veldt N, Benson AR, Kleinberg JM (2020) Localized flow-based clustering in hypergraphs. CoRR.
arXiv:​abs/​2002.​09441
Vidal E, Granitto PM, Bayá A (2014) Discussing a new divisive hierarchical clustering algorithm. In:
XLIII Jornadas Argentinas de Informática e Investigación Operativa (43JAIIO)-XV Argentine
symposium on artificial intelligence (ASAI)(Buenos Aires, 2014)
von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416. https://​doi.​org/​10.​
1007/​s11222-​007-​9033-z
Wang T, Lu Y, Han Y (2017) Clustering of high dimensional handwritten data by an improved hyper-
graph partition method. In: Intelligent computing methodologies—13th international conference,
ICIC 2017, Liverpool, August 7–10, 2017, proceedings, part III, lecture notes in computer sci-
ence, vol 10363. Springer, pp 323–334. https://​doi.​org/​10.​1007/​978-3-​319-​63315-2_​28
Wishart D (1969) An algorithm for hierarchical classifications. Biometrics 25:165–170
Xi Y, Lu Y (2020) Multi-stage hierarchical clustering method based on hypergraph. In: Intelligent com-
puting methodologies—16th international conference, ICIC 2020, Bari, October 2–5, 2020, pro-
ceedings, part III, lecture notes in computer science, vol 12465. Springer, pp 432–443. https://​doi.​
org/​10.​1007/​978-3-​030-​60796-8_​37
Xiong T, Wang S, Mayers A et al (2012) DHCC: divisive hierarchical clustering of categorical data.
Data Min Knowl Discov 24(1):103–135. https://​doi.​org/​10.​1007/​s10618-​011-​0221-2
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193
Xu R, Wunsch DC II (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678.
https://​doi.​org/​10.​1109/​TNN.​2005.​845141
Yamada Y, Masuyama N, Amako N et al (2020) Divisive hierarchical clustering based on adaptive reso-
nance theory. In: International symposium on community-centric systems, CcS 2020, Hachioji,
Tokyo, September 23–26, 2020. IEEE, pp 1–6. https://​doi.​org/​10.​1109/​CcS49​175.​2020.​92314​74
Yang J, Grunsky E, Cheng Q (2019) A novel hierarchical clustering analysis method based on Kullback–
Leibler divergence and application on dalaimiao geochemical exploration data. Comput Geosci
123:10–19. https://​doi.​org/​10.​1016/j.​cageo.​2018.​11.​003
Yu F, Dong K, Chen F et al (2007) Clustering time series with granular dynamic time warping method.
In: 2007 IEEE international conference on granular computing, GrC 2007, San Jose, CA, 2–4
November 2007. IEEE Computer Society, pp 393–398. https://​doi.​org/​10.​1109/​GrC.​2007.​34
Yu M, Hillebrand A, Tewarie P et al (2015) Hierarchical clustering in minimum spanning trees. Chaos
25(2):023107
Zahn CT (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans
Comput 20(1):68–86. https://​doi.​org/​10.​1109/T-​C.​1971.​223083
Zeng J, Gong L, Wang Q et al (2009) Hierarchical clustering for topic analysis based on variable fea-
ture selection. In: 2009 sixth international conference on fuzzy systems and knowledge discovery.
IEEE, pp 477–481
Zeng K, Ning M, Wang Y et al (2020) Hierarchical clustering with hard-batch triplet loss for person re-
identification. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR
2020, Seattle, WA, June 13–19, 2020. IEEE, pp 13654–13662. https://​doi.​org/​10.​1109/​CVPR4​
2600.​2020.​01367

13
8264 X. Ran et al.

Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large data-
bases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data,
Montreal, QC, June 4–6, 1996. ACM Press, pp 103–114. https://​doi.​org/​10.​1145/​233269.​233324
Zhang W, Wang X, Zhao D et al (2012) Graph degree linkage: agglomerative clustering on a directed
graph. In: Computer vision—ECCV 2012—12th European conference on computer vision, Flor-
ence, October 7–13, 2012, proceedings, part I, lecture notes in computer science, vol 7572.
Springer, pp 428–441. https://​doi.​org/​10.​1007/​978-3-​642-​33718-5_​31
Zhang W, Zhao D, Wang X (2013) Agglomerative clustering via maximum incremental path integral.
Pattern Recogn 46(11):3056–3065. https://​doi.​org/​10.​1016/j.​patcog.​2013.​04.​013
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Pro-
ceedings of the 2002 ACM CIKM international conference on information and knowledge manage-
ment, McLean, VA, November 4–9, 2002. ACM, pp 515–524. https://​doi.​org/​10.​1145/​584792.​584877
Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: Third international
conference on knowledge discovery and data mining, WKDD 2010, Phuket, 9–10 January 2010.
IEEE Computer Society, pp 195–199. https://​doi.​org/​10.​1109/​WKDD.​2010.​123
Zhao D, Tang X (2008) Cyclizing clusters via zeta function of a graph. In: Advances in neural information
processing systems 21, proceedings of the twenty-second annual conference on neural information
processing systems, Vancouver, BC, December 8–11, 2008. Curran Associates, Inc., pp 1953–1960
Zhao Y, Karypis G, Fayyad UM (2005) Hierarchical clustering algorithms for document datasets. Data Min
Knowl Discov 10(2):141–168. https://​doi.​org/​10.​1007/​s10618-​005-​0361-3
Zhao W, Li B, Gu Q et al (2020) Improved hierarchical clustering with non-locally enhanced features for
unsupervised person re-identification. In: 2020 international joint conference on neural networks,
IJCNN 2020, Glasgow, July 19–24, 2020. IEEE, pp 1–8. https://​doi.​org/​10.​1109/​IJCNN​48605.​2020.​
92067​22
Zhou D, Huang J, Schölkopf B (2006) Learning with hypergraphs: clustering, classification, and embed-
ding. In: Advances in neural information processing systems 19, proceedings of the twentieth annual
conference on neural information processing systems, Vancouver, BC, December 4–7, 2006. MIT
Press, pp 1601–1608
Zhou R, Zhang Y, Feng S et al (2018) A novel hierarchical clustering algorithm based on density peaks for
complex datasets. Complex. https://​doi.​org/​10.​1155/​2018/​20324​61
Zhu Y, Ting KM, Jin Y et al (2022) Hierarchical clustering that takes advantage of both density-peak and
density-connectivity. Inf Syst 103(C):101871. https://​doi.​org/​10.​1016/j.​is.​2021.​101871

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.

13

You might also like