0% found this document useful (0 votes)
5 views17 pages

Bioinformatics Protein Structure Metabolism

Bioinformatics journal research papers. Very informative and easy to understand. It's all about protein structures

Uploaded by

veda peddisetti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Bioinformatics Protein Structure Metabolism

Bioinformatics journal research papers. Very informative and easy to understand. It's all about protein structures

Uploaded by

veda peddisetti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Ortega-Legarreta et al.

BMC Bioinformatics (2025) 26:219 BMC Bioinformatics


https://doi.org/10.1186/s12859-025-06249-3

S O F T WA R E Open Access

GeneSetCluster 2.0: a comprehensive toolset


for summarizing and integrating gene-sets
analysis
Asier Ortega-Legarreta1†, Alberto Maillo2†, Daniel Mouzo1, Ana Rosa López-Pérez1, Lara Kular3,
Majid Pahlevan Kakhki3, Maja Jagodic3, Jesper Tegner2,4, Vincenzo Lagani2,5, Ewoud Ewing3* and
David Gomez-Cabrero2*

​†​
Abstract
​​ A
​​​​ ​s​i​e​r Ortega-Legarreta
and Alberto Maillo have Background Gene-Set Analysis (GSA) is commonly used to analyze high-throughput
contributed equally to this experiments. However, GSA cannot readily disentangle clusters or pathways due to
work.
redundancies in upstream knowledge bases, which hinders comprehensive exploration
*Correspondence: and interpretation of biological findings. To address this challenge, we developed
Ewoud Ewing
[email protected]
GeneSetCluster, an R package designed to summarize and integrate GSA results.
David Gomez-Cabrero Over time, we and users as well identified limitations in the original version, such as
[email protected]
1
difficulties in managing redundancies across multiple gene-sets, large computational
Translational Bioinformatics
Unit, Navarrabiomed, Hospital
times, and its lack of accessibility for users without programming expertise.
Universitario de Navarra (HUN), Results We present GeneSetCluster 2.0, a comprehensive upgrade that delivers
Universidad Pública de Navarra
(UPNA), IdiSNA, 31008 Pamplona,
methodological, computational, interpretative, and user-experience enhancements.
Spain Methodologically, GeneSetCluster 2.0 introduces a novel approach to address
2
Biological and Environmental duplicated gene-sets and implements a seriation-based clustering algorithm that
Sciences and Engineering Division,
King Abdullah University of Science
reorders results, aiding pattern identification. Computationally, the package is
and Technology, Thuwal, Saudi optimized for parallel processing, significantly reducing execution time. GeneSetCluster
Arabia
3
2.0 enhances cluster annotations by associating clusters with relevant tissues and
Department of Clinical
Neuroscience, Karolinska Institutet,
biological processes to improve biological interpretation, particularly for human
and Center for Molecular Medicine, and mouse data. To broaden accessibility, we have developed a user-friendly web
Karolinska University Hospital, application enabling non-programmers to use it. This version also ensures seamless
SE-171 76 Stockholm, Sweden
4
Unit of Computational Medicine,
integration between the R package, catering to users with programming expertise, and
Department of Medicine, Center the web application for broader audiences. We evaluated the updates in a single-cell
for Molecular Medicine, Karolinska RNA public dataset.
Institutet, Karolinska University
Hospital, L8:05, Conclusion GeneSetCluster 2.0 offers substantial improvements over its predecessor.
SE-171 76 Stockholm, Sweden
5
Furthermore, by bridging the gap between bioinformaticians and clinicians in
Institute of Chemical Biology,
Ilia State University, 0162 Tbilisi,
multidisciplinary teams, GeneSetCluster 2.0 facilitates collaborative research. The R
Georgia package and web application, along with detailed installation and usage guides, are
available on GitHub (​h​t​t​p​s​:​​/​/​g​i​t​​h​u​b​.​c​o​​m​/​T​r​​a​n​s​l​a​​t​i​o​n​a​​l​B​i​o​i​n​​f​o​r​m​​a​t​i​c​s​​U​n​i​t​/​​G​e​n​e​S​e​​t​C​l​u​​s​t​
e​r​2​.​0), and the web application can be accessed at ​h​t​t​p​s​:​​/​/​t​r​a​​n​s​l​a​t​i​​o​n​a​l​​b​i​o​.​s​​h​i​n​y​a​​p​p​s​.​i​o​​
/​g​e​n​​e​s​e​t​c​l​u​s​t​e​r​/.

© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article
are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this licence, visit ​h​t​t​p​:​/​/​c​r​e​a​t​i​v​e​c​o​m​m​o​n​s​.​o​r​g​/​l​i​c​e​n​s​e​s​/​b​y​/​4​.​0​/​​​​.​​​​
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 2 of 17

Keywords Gene-set analysis, Gene-set enrichment analysis, Functional annotation,


Seriation-based clustering, Web application, Data-mining

​​Background
High-throughput technologies are fundamental for understanding biological systems by
enabling the profiling of thousands to millions of features (e.g., genes) on a genome-wide
scale. The final stages of their associated bioinformatic analysis aim to identify the sys-
tem’s most relevant features. However, interpreting these features in a biological context
can be overwhelming [1]. To address these challenges, methods have been developed to
identify functionally related groups of features, offering biologists higher-order, inter-
pretable summaries of their experiments [2].
Gene-Set Analysis (GSA) has emerged as the standard for functional bioinformatic
analysis in gene expression studies. Methodologically, two main approaches domi-
nate GSA: Over-Representation Analysis (ORA [3]) and Gene-Set Enrichment Analy-
sis (GSEA [4]). Briefly, ORA determines whether the proportion of relevant genes in a
gene-set exceeds the expected by chance. At the same time, GSEA ranks all genes based
on their association with a trait and tests whether genes within a particular set cluster
toward the top of the ranking, reflecting their importance. Beyond gene expression,
tools such as GREAT [5] extend GSA to other genomic features, including DNA methyl-
ation and chromatin accessibility, by first linking genomic ranges to genes. Equally criti-
cal to the methodologies are the gene-sets, curated from existing scientific literature or
derived from molecular experiments. Prominent resources include the Gene Ontology
(GO) project [6], Reactome [7], the Kyoto Encyclopedia of Genes and Genomes (KEGG)
pathway database [8], and the Molecular Signatures Database (MSigDB) [9].
Despite the utility of GSA, and regardless of the methodology used, interpreting its
results remains challenging. First, the focus shifts from interpreting individual genes
to interpreting gene-sets or pathways, but GSA often identifies thousands of overlap-
ping processes, complicating the interpretation. This redundancy derives from gene-set
overlap, where highly related pathways are repeatedly significant, resulting in top-ranked
processes that reflect the same underlying signal. Second, researchers frequently need to
“analyze multiple contrasts within a single study” (e.g., screening various drugs against
controls or performing knock outs) producing extensive lists of overlapping gene-sets,
either across contrasts or from different databases, or “multiple gene-sets derived from
the same contrast but from several gene-set databases”, or both. Several approaches have
been developed to improve GSA interpretability, which we will denote by GSA inter-
pretation tools (GSAit). “Slim Ontologies” reduce redundancy by collapsing databases
into discrete categories [9–11] while other methods incorporate the graph structure of
databases like GO into statistical frameworks [12–14]. However, both frameworks are
database-specific and not widely generalizable. A more versatile, data-driven framework
[15–20] defines distances between gene-sets (e.g., based on shared genes or semantic
similarity), clusters them using these distances, and interprets the clusters through text
mining or representative gene-sets.
We initially developed GeneSetCluster 1.0 [21], a GSAit tool to address these chal-
lenges. It measured distances between gene-sets using relative risk (RR) and applied
hierarchical clustering to identify clusters of gene-sets. However, despite its successful
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 3 of 17

application [22, 23] the tool had limitations, including non-interpretable clusters caused
by identical or “outlier” gene-sets, and an inability to refine clustering despite identifying
associated challenges. Furthermore, GeneSetCluster 1.0 was available only as an R pack-
age, restricting its use to bioinformatics experts with programming skills.
GeneSetCluster 2.0 introduces several significant enhancements to overcome the
limitations of its predecessor. Methodologically, it incorporates a novel approach to
address duplicated gene-sets and utilizes a seriation-based clustering algorithm to reor-
der results, facilitating the identification of meaningful patterns. Computationally, the
tool is optimized for parallel processing, which significantly reduces execution times
and enhances efficiency. To improve biological interpretation, particularly for human
and mouse data, GeneSetCluster 2.0 enriches cluster annotations by associating clusters
with relevant tissues and biological processes, enabling more comprehensive insights.
Additionally, the tool broadens accessibility through developing a user-friendly web
application, allowing non-programmers to leverage its functionality while maintaining
the R package for users with programming expertise. The web-application version also
ensures seamless integration between the R package.
Together, these improvements make GeneSetCluster 2.0 a robust and versatile solu-
tion for Gene-Set Interpretation Analysis, enabling a diverse user base and facilitating
better integration of bioinformatics into multidisciplinary research workflows.

Implementation
In this section we first briefly describe version 1.0 of GeneSetCluster. Then we separately
illustrate the new functionalities implemented in version 2.0 for the gene-set cluster
identification and interpretation.

GeneSetCluster 1.0
GeneSetCluster 1.0 [21] is an R package designed to address a critical challenge in gene-
set analysis (GSA): interpreting results that often encompass hundreds to thousands of
potentially overlapping gene-sets. Furthermore, a common issue in GSA is that many
gene-sets represent similar biological processes but are labeled differently, making it
challenging to identify overarching themes.
GeneSetCluster 1.0 addresses this challenge by grouping gene-sets based on shared
genes, using RR as the distance metric, and employing k-means or hierarchical cluster-
ing methods. To determine the optimal number of clusters, the package uses silhouette
analysis [24] and the elbow method [25]. Notably, this approach assigns all input gene-
sets to a cluster, which could be considered a limiting factor. To facilitate interpreta-
tion, GeneSetCluster 1.0 enables the gene-set cluster visualization in three visualization
schemes: as a network, as a dendrogram, or as a heatmap.
In summary, GeneSetCluster 1.0 (v1.0) enables the integration of results from differ-
ent GSA tools and experimental conditions, offering a unified framework for exploring
multiple GSA results simultaneously. For extended details, we refer to the original pub-
lication [21].

GeneSetCluster 1.0 limitations in the clustering analysis


Several limitations have been identified over the past years of user experience with ver-
sion 1.0. The first limitation was related to clustering analysis. While sub-clusters could
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 4 of 17

be visually identified after a clustering analysis, the tool did not allow for re-clustering
these sub-groups to achieve greater granularity.
Second, multiple GSA results often identified the same gene-sets (e.g., identical Gene
Ontology IDs), despite slight variations in the subsets of genes associated with each
result. In GeneSetCluster 1.0, these duplicated gene-sets were treated independently—a
methodology we refer to as the “Raw Gene-Sets” approach. This occasionally introduced
bias into the clustering process, leading, for example, to the largest cluster being com-
posed of these repeated gene-sets. Third, the clustering methods used—k-means and
hierarchical clustering—forced each gene-set into a cluster. This constraint limited the
biological interpretability of the resulting clusters.
To address these limitations, we implemented three modules—sub-clustering analysis,
merging duplicated gene-sets, and seriation analysis.

Sub-clustering analysis
GeneSetCluster 2.0 implements BreakUpCluster, which enables selecting a gene-set
cluster and identifying sub-clusters within it (“breaking it” into smaller sub-clusters)
(Fig. 1). This targeted refinement addresses the issue of the challenging interpretation of
large clusters. By allowing researchers to focus on specific clusters of interest, BreakUp-
Cluster provides a detailed exploration of finer gene-set relationships while preserving
the overall clustering framework.

Fig. 1 Workflow of GeneSetCluster 2.0. The illustration outlines all the features, with new implementations in-
dicated by red asterisks. After uploading the GSA results, users can choose between two methods for handling
duplicated gene-sets: “Raw Gene-sets” and “Unique Gene-sets” during the CombineGeneSet step. Subsequently, the
new “Seriation-based” clustering approach can be applied. Within this clustering method, the function ClusterIn-
dependentGeneSets calculates the groups based on an optimal number, which can be manually adjusted using
SetPathway. On the other hand, with the “Classic” method, clusters can now be subdivided into smaller clusters
using BreakUpCluster. Finally, the outputs of both clustering methods can be annotated with additional functional
interpretations (ORA and Wordcloud) or through tissue enrichment analysis using the TissueExpressionPerGeneSet
function and its corresponding plot, PlotTissueExpression
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 5 of 17

Merging duplicated gene-sets


Multiple GSA results often identify the same gene-sets (e.g., identical Gene Ontology
IDs), even though the subsets of genes associated with each GSA result may vary slightly.
In GeneSetCluster 1.0, these duplicated gene-sets were treated independently, an
approach we refer to as the “Raw Gene-Sets” methodology. This approach occasionally
introduced bias into the clustering process, sometimes resulting in the largest cluster
being dominated by these repeated gene-sets, which were distinct from other clusters in
the analysis.
To address this limitation, GeneSetCluster 2.0 introduces a new approach called
"Unique Gene-Sets." This method detects repeated gene-sets with identical ID labels and
merges them into a single, unified entry that contains the union of all genes associated
with these sets. For example, GO: 0007612 (a biological process related to “learning”)
might be identified in one GSA analysis due to the genes Pak6, Reln, and Adcy3, and in
another due to Reln, Adcy3, and Eif2ak4. The “Unique Gene-Sets” methodology merges
these results, counting GO: 0007612 only once and consolidating the associated genes
into a single list: Pak6, Reln, Adcy3, and Eif2ak4. Consequently, each gene-set is treated
as a unique entity during the clustering process, eliminating the bias caused by duplica-
tions (Fig. 2a).
This refined approach simplifies the analysis, resulting in more precise and interpreta-
ble clusters. It also facilitates more consistent comparisons across studies and enhances
the understanding of the biological significance across different research contexts. While
the “Unique Gene-sets” method is recommended in most cases, the “Raw Gene-sets”
method is retained for specific scenarios—such as comparative analysis across tools or
information layers, tracking variability in gene-set membership, or reproducing legacy
analyses—where preserving duplicated gene-sets may be beneficial (see Supplementary
material for details).

Seriation-based clustering approach


GeneSetCluster 2.0 enhances its predecessor by offering a new “seriation-based” cluster-
ing approach. Briefly, seriation methods aim to reorder data, typically rows and columns
of a similarity or distance matrix, to uncover patterns that might not be apparent oth-
erwise. These methods offer improvements over k-means or hierarchical clustering in
certain contexts by focusing on the relative order of elements rather than assigning them
into discrete clusters or nested hierarchies. For consistency, the k-means and hierarchi-
cal methods remain available in the updated version.
Generally, seriation-based [25] clustering identifies coherent groups by arranging
gene-sets in a linear sequence based according to their pairwise similarities. Unlike
k-means or hierarchical clustering, which strictly divides gene-sets into separate groups,
seriation emphasizes uncovering patterns within the data [26]. Specifically, GeneSet-
Cluster 2.0 tests 32 seriation algorithms (Table S1) from the seriation R package [27]
to automatically find the optimal algorithm (Fig. 2b). Briefly, each seriation algorithm
undergoes a four-step evaluation process:

1. Initial Seriation: The distance matrix, which represents the pairwise similarities
between gene-sets, is reordered by the algorithm, optimizing the placement of similar
gene-sets next to each other.
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 6 of 17

Fig. 2 New implementations of GeneSetCluster 2.0. A Methods for handling duplicate gene-sets. Left side illus-
trates “Raw Gene-sets” method, considering all input gene-sets as separate entities. Right side depicts the new
“Unique Gene-sets” approach, which merges duplicate gene-sets by combining their genes. B Visualization of the
“Seriation-based” method

2. Threshold-Based Segmentation: Potential clusters are identified within the reordered


matrix by applying a predefined similarity threshold (default set at 0.6). This process
groups adjacent gene-sets with similarity scores above this threshold. The similarity
threshold can be manually adjusted by the users.
3. Cluster Size Optimization: Clusters are refined to meet a specified size requirement
(minimum size from 4 to 10 gene-sets per cluster) ensuring each cluster contains a
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 7 of 17

meaningful number of gene-sets. This requirement prevents excessive fragmentation


and over-large clusters that may overlook biologically relevant relationships. Users
can also modify the minimum size threshold manually.
4. Score computation: The output of each algorithm is scored based on three weighted
criteria: Hamiltonian path length (40%) [28], anti-Robinson form criterion (40%) [29],
and the total number of clustered gene-sets (20%).

These metrics were selected to jointly balance ordering coherence, structural consis-
tency, and coverage of relevant gene-sets. The Hamiltonian path length measures how
compact and locally coherent the seriation order is by minimizing the sum of pairwise
dissimilarities between consecutive gene-sets. The anti-Robinson form criterion quan-
tifies how well the dissimilarity matrix conforms to an anti-Robinson structure, where
dissimilarities increase with distance from the diagonal, indicating a consistent similar-
ity progression. The total number of clustered gene-sets reflects the inclusiveness of the
solution, promoting configurations that assign more gene-sets to meaningful clusters.
The relative weights were determined empirically by evaluating seriation results across
internal benchmarks using gene-set collections of various types and sizes, thereby opti-
mizing both biological interpretability and the internal consistency of the formed clus-
ters (data not shown).
The Hamiltonian path length:

n−1

L (D) = di,i+1 (1)
i=1

where:
D is the dissimilarity matrix.
di,i+1 is the dissimilarity between consecutive genesets i and i + 1
n is the number of gene-sets.
The Anti-Robinson form criterion:

n−1

AR (D) = (n − i) di,i+1 (2)
i=1

where:
D is the dissimilarity matrix.
di,i+1 is the dissimilarity between consecutive genesets i and i + 1
n is the number of gene-sets.
Total number of clustered gene-sets:

C (G) = |C i |(3)
i

where: C (G) is the total count of gene-sets in clusters, |Ci | is the number of gene-sets in
cluster i
Finally, the seriation algorithm with the highest score from step 4 is automatically
selected as the optimal solution. However, users can override this automatic selection by
specifying their preferred algorithm. Similarly, the optimal minimum number of gene-
sets required to define a cluster is automatically determined but can be modified by
users.
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 8 of 17

The “Seriation-based” method provides several key advantages: (1) it effectively iso-
lates outlier gene-sets that share low similarity with others, preventing them from
introducing noise into the clustering results; (2) the optimal seriation algorithm is auto-
matically selected based on objective scoring criteria, minimizing user intervention; and
(3) it can detect gradual transitions or hierarchical structures that might be overlooked
by traditional clustering methods (see Results section). This new method provides a more
detailed and comprehensive understanding of the gene-sets relationships.
In summary, GeneSetCluster 2.0 incorporates both “Classic” (k-means, hierarchical)
and “Seriation-based” clustering techniques, providing a robust toolkit for exploring and
understanding complex gene-set relationships. The “Classic” methods can discover well-
defined patterns, while the “Seriation-based” approach reveals subtle transitions or hier-
archical structures.

Functional annotations
In v1.0, interpretation of gene-set clusters were delegated to the user (e.g. investigating
user supplied gene subset) or using plugins to WebgestaltR [30] and StringDB [31]. In
v2.0, we aim to allow the interpretation of data-driven within the tool. A key concept
used during annotation is cluster-associated genes, which correspond to the union of the
genes in all gene-sets in a specific cluster.

Automatic gene-set cluster annotations


Within v2.0 each gene-set cluster can be annotated through two approaches. In the first
approach, v2.0 applies ORA to the cluster-associated genes, using the enrichGO function
from the clusterProfiler R package for this analysis [32]. In the second approach, Gen-
eSetCluster 2.0 conducts a semantic enrichment analysis for Gene Ontology (GO) using
the simplifyGO function from the SimplifyEnrichment package [33]. Internally, simpli-
fyGO extracts biological themes by computing how closely related GO terms are in the
GO hierarchy. This process clarifies the biological context of each cluster by emphasizing
functional themes within the GO terms. This meta-analysis helps researchers to deter-
mine whether particular pathways or processes are enriched in the cluster-associated
genes.

Tissue enrichment
GeneSetCluster 2.0 incorporates a tissue enrichment analysis (limited to human data)
that identifies possible associations between gene-set clusters with human tissues. v2.0
conducts a GSEA [4] per gene-set cluster where the gene-set is the cluster-associated
genes, and the ranking per tissue is based on the gene expression available in the GTEx
database [34], which contains median expression levels across 54 human tissues. As a
result, we obtain a ranking of relevant tissues per gene-set cluster. For computational
efficiency, users can access the tissue expression database via the API integrated into
GeneSetCluster 2.0 or by downloading directly from the repository for local use.

Computational enhancements
GeneSetCluster 1.0 faced limitations in computational scalability, particularly when pro-
cessing large numbers of gene-sets. GeneSetCluster 2.0 incorporates a parallelization
scheme to address this issue to enhance computational efficiency. This improvement is
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 9 of 17

especially evident in the CombineGeneSets step (Fig. 1), where distances between gene-
sets are calculated. Parallel processing techniques were implemented using the R pack-
ages doParallel [35] and foreach [36]. These tools enable the computational workload
to be distributed efficiently across multiple processors or threads. As a result, GeneSet-
Cluster 2.0 achieves faster execution times and enhanced overall performance, making it
well-suited for handling large-scale datasets.

Web application
To enable experienced non-programming users, we developed a web-based Shiny appli-
cation. Overall, we consider the Web Application to be part of the v2.0 improvements
described in Sect. “Implementation”. The Shiny application, developed with R 4.3, is
hosted on a Shiny server at the URL ​h​t​t​p​s​:​​/​/​t​r​a​​n​s​l​a​t​i​​o​n​a​l​​b​i​o​.​s​​h​i​n​y​a​​p​p​s​.​i​o​​/​g​e​n​​e​s​e​t​c​l​u​s​t​e​r​
/. It is fully compatible with all operating systems and web browsers. Upon accessing the
application, users are prompted to select the mandatory input parameters and upload
their GSA results. Once the "Run Analysis" button is clicked, the results, along with
corresponding plots, are generated and displayed. Users can then examine the results,
add additional annotations, and perform further analyses to gain biological insights. All
results and plots are available for download. Additionally, users can save their analysis
by downloading an RData file for future use, allowing them to continue by re-uploading
it in the Shiny application or moving to the R package version. This transition can also
be made from the R package to the Shiny application. For a more user-friendly experi-
ence, the application can also be deployed locally from the R package using the com-
mand run_app() (Fig. 3).

Graphical user interface


The user interface of the web application is divided into two main sections, as illustrated
in Fig. 4. In the input section, users specify parameter values such as Source, Gene ID…
Users can also start from a previously saved analysis or import an analysis performed
using the R package.
The results section is further divided into two parts. At the top, users can view differ-
ent plots, including heatmaps of the RR matrix (under the Heatmap_S tab for the “Seri-
ation-based” clustering and under Heatmap tab for the other clustering methods), along
with their tissue enrichment plots (available in the Tissue_S and Tissue tabs respectively).
Below the plots, the results are displayed in tables, showing which gene-sets belong to
each cluster (under the Data tab), ORA details (under the ORA tab), gene information
(under the Genes tab), and tissue data (under the Tissue enrichment tab). The source
code of the application can be accessed at ​h​t​t​p​s​:​​/​/​g​i​t​​h​u​b​.​c​o​​m​/​T​r​​a​n​s​l​a​​t​i​o​n​a​​l​B​i​o​i​n​​f​o​r​m​​a​t​i​
c​s​​U​n​i​t​/​​G​e​n​e​S​e​​t​C​l​u​​s​t​e​r​2​.​0.

Analysis workflow
The application supports input files from GREAT, IPA, and GSEA in various formats,
including.csv,.tsv, and Excel. For other tools, users can use our provided template, avail-
able in Excel format through the application, to input the necessary information for
GeneSetCluster compatibility. The required fields in the template are: ID (gene-set iden-
tifier), Count (number of genes from your list found in the gene-set), GeneRatio (calcu-
lated as the number of genes found divided by the total genes in the gene-set), p.adjust
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 10 of 17

Fig. 3 General overview of the interaction. Overview of the seamless integration between the web application
and R package

(adjusted p-value), and geneID (list of genes in Ensembl ID, SYMBOL or ENTREZ ID).
Additionally, regardless of the input source, users must specify complementary informa-
tion, including the gene ID format, organism, and the preferred method for handling
duplicates.
Once the data are uploaded and the user specifies their preferences, clicking the “Run
analysis” button initiates the default analysis using the k-means method. The results,
including corresponding plots, are displayed, with the input gene-sets initially classified
into clusters determined by the OptimalGeneSets function (Fig. 1). User can later cus-
tomize the number of clusters or subdivide larger clusters into smaller subclusters for
more detailed exploration.
Furthermore, the users can perform the “Seriation-based” method (Sect. “Computa-
tional enhancements”) or apply the tissue enrichment analysis to gain further insights.
We also provide interactive features that allow users to explore the results:

Targeted analysis
Users may want to investigate specific conditions or phenotypes in their data, focusing
on a particular group of genes. This functionality allows them to assess whether their
cluster-associated genes are enriched in these specific gene groups. Users can either
import a custom gene list or select a phenotype from well-known databases for enrich-
ment analysis. Two key databases are accessible through the application: 1) the Human
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 11 of 17

Fig. 4 Web application interface of GeneSetCluster 2.0. The screenshot presents two main sections. The input
section, highlighted in blue, allows the user to specify parameter values. After the user clicks on “Run Analysis” the
result section will display, highlighted in green. This section displays the results in tables at the bottom and plots
at the top

Phenotype Ontologies database (HPO), which provides standardized terms for human
disease-related phenotypes [37], and 2) the Mammalian Phenotype (MP) database,
which offers terms for annotating mouse phenotype data [38]. Overall, this feature helps
uncover relevant genetic insights by emphasizing specific groups of genes within the
cluster-associated genes.

Gene-set cluster functional characterization


Users may required to explore data at the gene level rather than just focusing on gene-
sets. In this case, the application displays a table listing all cluster-associated genes and
the frequency of each gene in the cluster. This allows them to filter genes based on fre-
quency, helping them identify the most prevalent genes within each cluster, discover
genes exclusive to a specific cluster, or search for a particular gene. Moreover, an ORA
can be conducted on the filtered gene list to streamline the biological interpretation.
Additionally, each gene is linked to external databases such as GeneCards for human
data [39] and Mouse Genome Informatics for mice [40], providing further context about
gene functions and implications. This feature enhances flexibility and depth in gene
exploration, allowing users to focus on the most relevant genes and obtain more detailed
biological insights.
Finally, all results can be downloaded directly from the application, and the plots can
be in several formats, including.jpg,.png, and.pdf. Furthermore, users can save their anal-
ysis as an RData file, allowing them to quickly resume their work by re-uploading it to
the application or importing it into the R package for further analysis. This functionality
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 12 of 17

fosters effective interaction and collaboration among multidisciplinary teams, improves


workflow efficiency, and supports ongoing research efforts.
In summary, GeneSetCluster 2.0 is accessible online and allows users to perform all
its functions in a user-friendly manner. A comprehensive and informative user guide is
available at the About tab from ​h​t​t​p​s​:​​/​/​t​r​a​​n​s​l​a​t​i​​o​n​a​l​​b​i​o​.​s​​h​i​n​y​a​​p​p​s​.​i​o​​/​g​e​n​​e​s​e​t​c​l​u​s​t​e​r​/.

Results
Enhancing biological interpretation.
To evaluate the performance of GeneSetCluster 2.0 and compare it with GeneSetCluster
1.0, as well as a custom manual analysis, we applied both frameworks to a publicly avail-
able single-cell RNA dataset [41]. The selected dataset explored the molecular basis of
myelodysplastic syndromes with a deletion of the long arm of chromosome 5, del(5q), by
analyzing the transcriptional and regulatory landscape of CD34 + progenitor cells using
single-cell RNA-seq. Additionally, the study evaluates the impact of lenalidomide treat-
ment on transcriptional alterations.
The authors conducted differential expression analyses across five comparisons: (1)
Non-del(5q) cells of complete responders versus at diagnosis; (2) Non-del(5q) cells of
partial responders versus at diagnosis; (3) Del(5q) cells of partial responders versus non-
responders; (4) Non-del(5q) cells of complete responders versus healthy cells; and, (5)
Non-del(5q) cells of partial responders versus healthy cells. Complete responders refer
to patients who achieved complete cytogenetic response, while partial responders refer
to those with partial cytogenetic response to lenalidomide treatment. From each com-
parison, GSEA was conducted, identifying 20 distinct gene-sets: 12 gene-sets were
found in the first three comparisons, while the remaining 8 were identified in the last
two comparisons. The authors manually grouped them into 7 clusters based on the bio-
logical implications: (1) Cluster 1: Related to ubiquitin processes (7 gene-sets) (2) Clus-
ter 2: Focused on proteasome-mediated processes (2 gene-sets). (3) Cluster 3: Linked to
autophagy (2 gene-sets) (4) Cluster 4: Erythropoietin signaling (2 gene-sets). (5) Clus-
ter 5: PD-L1/PD-1 checkpoint pathway (1 gene-set) (6) Cluster 6: Phosphatidylinositol
signaling system (1 gene-set). (7) Cluster 7: Mitochondrial and ribosomal translation (8
gene-sets). These original results are visualized in Fig. 7a of the publication [41].
GeneSetCluster 1.0 and GeneSetCluster 2.0 were applied to these gene-sets. In the
previous version, using “Raw Gene-sets” and k-means/hierarchical clustering meth-
ods, only two large clusters emerged: one associated with mitochondrial translation (8
gene-sets) and another with proteasome-mediated processes (12 gene-sets). In contrast,
GeneSetCluster 2.0, applying “Unique Gene-sets” and “Seriation-based” methods, gen-
erated four distinct clusters: (1) Cluster 1: Mitochondrial translation (8 gene-sets). (2)
Cluster 2: protein polyubiquitination (4 gene-sets) (3) Cluster 3: Proteasome-mediated
processes (2 gene-sets) (4) Cluster 4: Autophagosome assembly (2 gene-sets). Three
gene-sets remained unclustered, as the new method does not force gene-sets into clus-
ters and requires a minimum of two pathways per cluster. The biological significance
of each cluster was determined through an ORA of the cluster-associated genes against
the biological process database. Figure 5 illustrates these results. This analysis can be
performed using both the R package and the web version of the tool. In this case study,
default parameters were used, and the optimal seriation algorithm selected automati-
cally was OLO_average [27].
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 13 of 17

Fig. 5 Performance comparison from GeneSetCluster 1.0 and GeneSetCluster 2.0. The first column shows the 20
gene-sets being analyzed. The “Manual”column presents the clusters manually curated by experts. The “GeneSet-
Cluster 1.0" column displays results generated using the “Raw Gene-sets” and “Classic” methods. The “GeneSetCluster
2.0” column shows the improved results using the “Unique Gene-sets” and “Seriation-based” methods in GeneSet-
Cluster 2.0

Reducing computational times


To assess the effectiveness of the parallelization scheme, the CombineGeneSets function
was tested using datasets of various sizes: small (239 gene-sets), medium (1000 gene-
sets), and large (2287 gene-sets), with different numbers of threads: 1, 2, 4, 6, 8, and 10.
The execution time was measured for each condition repeated ten times. The tests were
conducted on a workstation equipped with Apple M1 Pro processor with a 10-core CPU
and 16 GB of RAM.
The mean execution times (in seconds) for each configuration are shown in Table S2.
Results showed that performance was slowest with a single thread, but execution times
decreased significantly as threads were added, especially for medium and large datasets,
up to four threads. For instance, the medium dataset’s processing time dropped from
125 s with one thread to 42 s with four. Similarly, the large dataset achieved substantial
gains, reducing processing time to 440 s with six threads, down from 1945s with a single
thread. This pattern suggests that using four to six threads effectively balances computa-
tional load and efficiency in GeneSetCluster 2.0.
By implementing parallelization, GeneSetCluster 2.0 enhances the scalability and
speed of the CombineGeneSets step, allowing users to analyze large datasets more
effectively.

Discussion
High-throughput technologies are crucial for profiling biological systems, but interpret-
ing the large number of features identified can be overwhelming. Gene-Set Analysis
(GSA) simplifies this by grouping related genes for functional interpretation. Common
methods include Over-Representation Analysis and Gene-Set Enrichment Analysis,
which assess the importance of gene-sets. However, redundant and overlapping pro-
cesses often make GSA results difficult to interpret. To overcome this challenge, we
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 14 of 17

developed GeneSetCluster 1.0 [21] using a data-driven approach that clustered gene-
sets and enable the biological interpretation of those clusters. Despite its successful
application [22, 23] GeneSetCluster 1.0 had limitations, including non-interpretable
clusters and accessibility barriers. As a result, we developed GeneSetCluster 2.0 to over-
come such challenges by improving three areas: methodological development, improved
computational times, and a web-based user-friendly version of the tool.
The methodological improvements in GeneSetCluster 2.0 can be directly observed in
its ability to avoid the over-aggregation of unrelated gene-sets that occurred in the pre-
vious version. The tool more effectively separates gene-sets by combining the “Unique
Gene-sets” approach and the “Seriation-based” clustering method, allowing biologi-
cally related processes to be grouped with greater exactness. For example, when Gen-
eSetCluster 1.0 combined 12 gene-sets into a broad proteasome-mediated processes,
GeneSetCluster 2.0 refined this into biologically distinct clusters, including protein
polyubiquitination, proteasome-mediated processes, and autophagosome assembly. This
additional granularity in the clustering enables more precise biological interpretations.
The fact that GeneSetCluster 2.0 left three gene-sets unclustered ensures that unre-
lated gene-sets are not forced into clusters, maintaining a higher degree of biological
relevance; furthermore, those non-clustered gene-sets can be investigated separately if
required. Thus, v2.0 version demonstrates superior clustering capability, yielding clearer
and more meaningful biological insights from complex gene-set data.
Another key improvement is the significant reduction in computational times
achieved through optimized parallelization in GeneSetCluster 2.0. The CombineGene-
Sets step, which calculates distances between gene-sets, now leverages parallel process-
ing techniques. By distributing the workload efficiently across multiple processors, the
tool achieves faster execution times, especially for larger datasets. This optimization
enhances scalability and ensures users can analyze extensive data more effectively.
Finally, incorporating the web-based version makes GeneSetCluster 2.0 more acces-
sible to users with limited bioinformatics expertise. Similar to tools developed in other
areas, such accessibility promotes collaboration and more efficient communication
within multidisciplinary teams [42–44]. Furthermore, a standout feature of GeneSet-
Cluster 2.0 is the seamless integration between the R package and the web application
enabling users to transition their analyses and results between platforms effortlessly. For
instance, analyses initiated in the R package can be saved and uploaded into the web
application to continue exploration with its intuitive interface. Similarly, work started in
the web application can be exported back to R for more detailed or customized work-
flows. This two-way compatibility ensures that users can tailor their workflow to suit
their preferences and expertise, facilitating real-time collaboration and efficient sharing
of insights.
Despite the novelties and improvements introduced in version 2.0, the discoveries
made using GeneSetCluster remain inherently limited to known pathways and biological
processes. To develop the framework to overcome such limitations, future developments
should focus on integrating exploratory tools that relate gene-set clusters to phenotype
information in a more dynamic and exploratory manner, similar to existing tools [45].
However, no existing tool provides the precise framework and flexibility that GeneSet-
Cluster 2.0 does.
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 15 of 17

It is noteworthy that, even after more than 20 years since the inception of GSEA, it
is still in development [46]. Furthermore, significant advancements are still required to
facilitate deeper yet low-complexity biological interpretations. Responding to this need,
GeneSetCluster 2.0 improves upon version 1.0 and brings a unique exploratory frame-
work to the field, enabling researchers to characterize gene-sets and their associated
clusters better.

Conclusion
In conclusion, GeneSetCluster 2.0 represents a significant advancement in gene-set
analysis interpretation for the research community. Methodologically, it enhances the
identification of gene-set clusters and their exploration. Computationally, it ensures
faster processing of large datasets. Most importantly, the new web application improves
accessibility for researchers with varying levels of coding expertise. Integrating the web
application with the R package fosters collaboration between bioinformaticians and biol-
ogists, supporting multidisciplinary research efforts. The web application is available at​
h​t​t​p​s​:​​/​/​t​r​a​​n​s​l​a​t​i​​o​n​a​l​​b​i​o​.​s​​h​i​n​y​a​​p​p​s​.​i​o​​/​g​e​n​​e​s​e​t​c​l​u​s​t​e​r​/. The updated R package, ​c​o​m​p​r​e​h​e​
n​si​​v​e documentation, and supporting materials can be downloaded from GitHub at ​h​t​t​p​
s​:​​/​/​g​i​t​​h​u​b​.​c​o​​m​/​T​r​​a​n​s​l​a​​t​i​o​n​a​​l​B​i​o​i​n​​f​o​r​m​​a​t​i​c​s​​U​n​i​t​/​​G​e​n​e​S​e​​t​C​l​u​​s​t​e​r​2​.​0.

Abbreviations
GSA Gene-set analysis
ORA Over-representation analysis
GSEA Gene-set enrichment analysis
GO Gene ontology
KEGG Kyoto encyclopedia of genes and genomes
MSigDB Molecular signatures database
GTEx Genotype-tissue expression
RR Relative risk
HPO Human phenotype ontologies
MP Mammalian phenotype

Supplementary Information
The online version contains supplementary material available at https://doi.org/10.1186/s12859-025-06249-3.

Supplementary Material 1

Acknowledgements
We thank all users of GeneSetCluster 1.0 who provided valuable feedback that helped shape the development of version
2.0.
Author contributions
AO-L , EE, and DG-C developed the concept. AO-L and AM conducted methodology development, software
implementation, data analysis, visualization, validation, and manuscript preparation with the supervision of EE and DG-C.
JT, VL, and DG-C contributed to manuscript writing, review, and editing. All authors reviewed the tool and provided
suggestions for the development of the R tool and the user-friendly environment. All authors proofread the paper.
Funding
This project has received funding from the European Union’s Horizon Europe program under grant agreement No.
101070950 (X-PAND). Additionally, this work is supported by grants from the Swedish Research Council, the Swedish
Brain Foundation, the Swedish Association for Persons with Neurological Disabilities, the Swedish MS Foundation. LK is
supported by a fellowship from the Margaretha af Ugglas Foundation.
We acknowledge the KAUST Baseline Awards, with D.G.-C. supported by KAUST Baseline Award no. BAS/1/1093-01-01,
and J.T. supported by KAUST Baseline Award no. BAS/1/1078-01-01.
Data availability
The GeneSetCluster 2.0 R package and associated documentation are freely available on GitHub at ​h​t​t​p​s​:​​/​/​g​i​t​​h​u​b​.​c​o​​m​/​T​
r​​a​n​s​l​a​​t​i​o​n​a​​l​B​i​o​i​n​​f​o​r​m​​a​t​i​c​s​​U​n​i​t​/​​G​e​n​e​S​e​​t​C​l​u​​s​t​e​r​2​.​0. The web-based Shiny application can be accessed at ​h​t​t​p​s​:​​/​/​t​r​a​​n​s​l​a​t​i​​o​
n​a​l​​b​i​o​.​s​​h​i​n​y​a​​p​p​s​.​i​o​​/​g​e​n​​e​s​e​t​c​l​u​s​t​e​r​/. The tissue expression database derived from GTEx data can be downloaded from ​h​
t​t​p​s​:​​/​/​d​o​i​​.​o​r​g​/​1​​0​.​6​0​​8​4​/​m​9​​.​f​i​g​s​​h​a​r​e​.​2​​5​9​6​5​​6​6​4​.​v​1 and is also accessible through the package's API. Example datasets and
documentation are included in the package repository. The example datasets analyzed in this study are available in the
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 16 of 17

Gene Expression Omnibus (GEO) database under accession codes GSE111385 and GSE198256. The single-cell RNA-seq
data used in the Results section for performance comparison can be accessed under GEO accession code GSE245452.

Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.

Received: 4 February 2025 / Accepted: 7 August 2025

References
1. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput
Biol. 2012;8(2): e1002375.
2. Reimand J, Isserlin R, Voisin V, Kucera M, Tannus-Lopes C, Rostamianfar A, et al. Pathway enrichment analysis and visualiza-
tion of omics data using g:Profiler, GSEA. Cytoscape EnrichmentMap Nat Protoc. 2019;14(2):482–517.
3. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, et al. GO::TermFinder—open source software for accessing gene
ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformat-
ics. 2004;20(18):3710–5.
4. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowl-
edge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.
5. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, et al. GREAT improves functional interpretation of cis-
regulatory regions. Nat Biotechnol. 2010;28(5):495–501.
6. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat
Genet. 2000;25(1):25–9.
7. Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The reactome pathway knowledgebase.
Nucleic Acids Res. 2018;46(D1):D649–55.
8. Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks
involving diseases and drugs. Nucleic Acids Res. 2010;38:355.
9. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene
set collection. Cell Syst. 2015;1(6):417–25.
10. Yon Rhee S, Wood V, Dolinski K, Draghici S. Use and misuse of the gene ontology annotations. Nat Rev Genet.
2008;9(7):509–15.
11. Davis MJ, Sehgal MSB, Ragan MA. Automatic, context-specific generation of Gene Ontology slims. BMC Bioinfo.
2010;11(1):498.
12. Grossmann S, Bauer S, Robinson PN, Vingron M. Improved detection of overrepresentation of gene-ontology annotations
with parent–child analysis. Bioinformatics. 2007;23(22):3024–31.
13. Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association. Bioinformatics. 2007;23(2):257–8.
14. Alexa A, Rahnenführer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating
GO graph structure. Bioinformatics. 2006;22(13):1600–7.
15. Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, et al. ClueGO: a Cytoscape plug-in to decipher func-
tionally grouped gene ontology and pathway annotation networks. Bioinformatics. 2009;25(8):1091–3.
16. Bu D, Luo H, Huo P, Wang Z, Zhang S, He Z, et al. KOBAS-i: intelligent prioritization and exploratory visualization of biologi-
cal functions for gene enrichment analysis. Nucleic Acids Res. 2021;49(W1):W317–25.
17. Prummer M. Enhancing gene set enrichment using networks. F1000Res. 2019;8:129.
18. Wang G, Oh DH, Dassanayake M. GOMCL: a toolkit to cluster, evaluate, and extract non-redundant associations of Gene
Ontology-based functions. BMC Bioinfo. 2020;21(1):139.
19. Zhou Y, Zhou B, Pache L, Chang M, Khodabakhshi AH, Tanaseichuk O, et al. Metascape provides a biologist-oriented
resource for the analysis of systems-level datasets. Nat Commun. 2019;10(1):1523.
20. Pesquita C, Faria D, Falcão AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009;5(7):
e1000443.
21. Ewing E, Planell-Picola N, Jagodic M, Gomez-Cabrero D. GeneSetCluster: a tool for summarizing and integrating gene-set
analysis results. BMC Bioinfo. 2020;21(1):443.
22. Kular L, Ewing E, Needhamsen M, Pahlevan Kakhki M, Covacu R, Gomez-Cabrero D, et al. DNA methylation changes in glial
cells of the normal-appearing white matter in multiple sclerosis patients. Epigenetics. 2022;17(11):1311–30.
23. Ewing E, Kular L, Fernandes SJ, Karathanasis N, Lagani V, Ruhrmann S, et al. Combining evidence from four immune cell
types identifies DNA methylation patterns that implicate functionally distinct pathways during multiple sclerosis progres-
sion. EBioMedicine. 2019;43:411–23.
24. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math.
1987;20:53–65.
25. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66.
26. Liiv I. Seriation and matrix reordering methods: an historical overview. Stat Anal Data Min: ASA Data Sci J. 2010;3(2):70–91.
27. Hahsler M, Hornik K, Buchta C. Getting things in order: an introduction to the R package seriation. J Stat Softw.
2008;25(3):1–34.
Ortega-Legarreta et al. BMC Bioinformatics (2025) 26:219 Page 17 of 17

28. Caraux G, Pinloche S. PermutMatrix: a graphical environment to arrange gene expression profiles in optimal linear order.
Bioinformatics. 2005;21(7):1280–1.
29. Earle D, Hurley CB. Advances in dendrogram seriation for application to visualization. J Comput Graph Stat.
2015;24(1):1–25.
30. Wang J, Vasaikar S, Shi Z, Greer M, Zhang B. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive
gene set enrichment analysis toolkit. Nucleic Acids Res. 2017;45(W1):W130–7.
31. Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, et al. The STRING database in 2023: protein-protein
association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res.
2023;51(D1):D638–46.
32. Yu G, Wang LG, Han Y, He QY. Clusterprofiler: an R package for comparing biological themes among gene clusters. OMICS.
2012;16(5):284–7.
33. Gu Z, Hübschmann D. Simplifyenrichment: a bioconductor package for clustering and visualizing functional enrichment
results. Genom Proteom Bioinfo. 2023;21(1):190–202.
34. Aguet F, Anand S, Ardlie KG, Gabriel S, Getz GA, Graubert A, et al. 2020 The GTEx consortium atlas of genetic regulatory
effects across human tissues. Science. 1979;369(6509):1318–30.
35. Weston S, Corporation M. doParallel: Foreach Parallel Adaptor for the’parallel’Package. R package version 1.0. 15. 2019.
36. Weston M, others. foreach: Provides Foreach Looping Construct. R package version 1.5. 1. 2020.
37. Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S. The human phenotype ontology: a tool for annotating and
analyzing human hereditary disease. Am J Hum Genet. 2008;83(5):610–5.
38. Smith CL, Goldsmith CAW, Eppig JT. The mammalian phenotype ontology as a tool for annotating, analyzing and compar-
ing phenotypic information. Genome Biol. 2004;6(1): R7. https://doi.org/10.1186/gb-2004-6-1-r7.
39. Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D. GeneCards: a novel functional genomics compendium with automated
data mining and query reformulation support. Bioinformatics. 1998;14(8):656–64.
40. Eppig JT. mouse genome informatics (MGI) resource: genetic, genomic, and biological knowledgebase for the laboratory
mouse. ILAR J. 2017;58(1):17–41.
41. Serrano G, Berastegui N, Díaz-Mazkiaran A, García-Olloqui P, Rodriguez-Res C, Huerga-Dominguez S, et al. Single-cell
transcriptional profile of CD34+ hematopoietic progenitor cells from del(5q) myelodysplastic syndromes and impact of
lenalidomide. Nat Commun. 2024;15(1):5272.
42. Jagadish HV. Big data and science: myths and reality. Big Data Res. 2015;2(2):49–52.
43. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014;2(1):3.
44. van der Velde KJ, Imhann F, Charbon B, Pang C, van Enckevort D, Slofstra M, et al. MOLGENIS research: advanced bioinfor-
matics data software for non-bioinformaticians. Bioinformatics. 2019;35(6):1076–8.
45. Bhuva DD, Tan CW, Liu N, Whitfield HJ, Papachristos N, Lee SC, et al. vissE: a versatile tool to identify and visualise higher-
order molecular phenotypes from functional enrichment analysis. BMC Bioinfo. 2024;25(1):64.
46. Koopmans F. GOAT: efficient and robust identification of gene set enrichment. Commun Biol. 2024;7(1):744.

Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

You might also like