0% found this document useful (0 votes)
7 views32 pages

Skin Cancer

This study presents a weighted ensemble transfer learning model for the categorical classification of skin cancer, utilizing five convolutional neural networks (CNNs) to enhance diagnostic accuracy. The model achieved an accuracy of 94.49% and utilized data from the HAM10000 dataset, which includes various skin lesion categories. The research highlights the potential of AI in improving skin cancer detection, particularly in differentiating between benign and malignant lesions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views32 pages

Skin Cancer

This study presents a weighted ensemble transfer learning model for the categorical classification of skin cancer, utilizing five convolutional neural networks (CNNs) to enhance diagnostic accuracy. The model achieved an accuracy of 94.49% and utilized data from the HAM10000 dataset, which includes various skin lesion categories. The research highlights the potential of AI in improving skin cancer detection, particularly in differentiating between benign and malignant lesions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Journal Pre-proof

Categorical classification of skin cancer using a weighted ensemble of transfer


learning with test time augmentation

Aliyu Tetengi Ibrahim, Mohammed Abdullahi, Armand Florentin Donfack Kana,


Mohammed Tukur Mohammed, Ibrahim Hayatu Hassan

PII: S2666-7649(24)00053-5
DOI: https://doi.org/10.1016/j.dsm.2024.10.002
Reference: DSM 123

To appear in: Data Science and Management

Received Date: 14 February 2024


Revised Date: 21 October 2024
Accepted Date: 22 October 2024

Please cite this article as: Ibrahim, A.T., Abdullahi, M., Donfack Kana, A.F., Mohammed, M.T., Hassan,
I.H., Categorical classification of skin cancer using a weighted ensemble of transfer learning with test
time augmentation, Data Science and Management, https://doi.org/10.1016/j.dsm.2024.10.002.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.

© 2024 Xi’an Jiaotong University. Publishing services by Elsevier B.V. on behalf of KeAi
Communications Co. Ltd.
Categorical classification of skin cancer using a weighted ensemble of
transfer learning with test time augmentation

Aliyu Tetengi Ibrahima,1, Mohammed Abdullahia,2, Armand Florentin Donfack Kanaa,3,


Mohammed Tukur Mohammedc,4, Ibrahim Hayatu Hassana,b,5
a
Department of Computer Science, Ahmadu Bello University, Zaria, 810006, Nigeria
b
Institute for Agricultural Research, Ahmadu Bello University Zaria, 810006, Nigeria
c
National Cereals Research Institute Badeggi, 912104, Nigeria
1
[email protected], [email protected], [email protected],
4
[email protected], [email protected]

f
r oo
-p
re
lP
na
ur
Jo
Categorical classification of skin cancer using a weighted ensemble of
transfer learning with test time augmentation
Abstract
Skin cancer is the abnormal development of cells on the surface of the skin and is one of the most fatal
diseases in humans. It usually appears in locations that are exposed to the sun, but can also appear in
areas that are not regularly exposed to the sun. Due to the striking similarities between benign and
malignant lesions, skin cancer detection remains a problem, even for expert dermatologists. Considering
the inability of dermatologists to diagnose skin cancer accurately, a convolutional neural network
(CNN) approach was used for skin cancer diagnosis. However, the CNN model requires a significant
number of image datasets for better performance; thus, image augmentation and transfer learning
techniques have been used in this study to boost the number of images and the performance of the
model, because there are a limited number of medical images. This study proposes an ensemble transfer-
learning-based model that can efficiently classify skin lesions into one of seven categories to aid
dermatologists in skin cancer detection: (i) actinic keratoses, (ii) basal cell carcinoma, (iii) benign

f
keratosis, (iv) dermatofibroma, (v) melanocytic nevi, (vi) melanoma, and (vii) vascular skin lesions.

oo
Five transfer learning models were used as the basis of the ensemble: MobileNet, EfficientNetV2B2,
Xception, ResNext101, and DenseNet201. In addition to the stratified 10-fold cross-validation, the
results of each individual model were fused to achieve greater classification accuracy. An annealing

r
learning rate scheduler and test-time augmentation (TTA) were also used to increase the performance
-p
of the model during the training and testing stages. A total of 10,015 publicly available dermoscopy
images from the HAM10000 (Human Against Machine) dataset, which contained samples from the
re
seven common skin lesion categories, were used to train and evaluate the models. The proposed
technique attained 94.49% accuracy on the dataset. These results suggest that this strategy can be useful
lP

for improving the accuracy of skin cancer classification. However, the weighted average of f1-score,
recall, and precision were obtained to be 94.68%, 94.49%, and 95.07%, respectively.
Keywords: Skin cancer; Test time augmentation; Annealing learning rate scheduler; Dermoscopy;
na

Transfer learning; Deep convolutional neural network.

1. Introduction
ur

The broad category of cancer includes a vast range of illnesses that can affect any part of the body. The
Jo

rapid development of aberrant cells that outgrow their normal range is a characteristic of cancer that
enables the disease to invade nearby body parts and spread to other organs. In 2022, nearly 20 million
new cancer cases will be reported worldwide, including nonmelanoma skin cancers (NMSCs), along
with 9.7 million cancer-related deaths, including NMSCs. Estimates indicate that approximately one in
five individuals will develop cancer during their lifetime, whereas approximately one in nine men and
one in 12 women will succumb to the disease (Bray et al., 2024). Skin cancer is the most prevalent form
of malignancy among humans, especially in Caucasians (Simoes et al., 2015). Nonmelanoma and
melanoma are the two subtypes of this cancer, with melanoma being the more severe form because of
its tendency for rapid metastasis. Globally, approximately 232,100 cases (1.7%) of all newly diagnosed
primary malignant cancers, excluding NMSCs, are cutaneous melanoma. Additionally, approximately
55,500 cancer-related deaths (0.7% of all cancer-related deaths) occur annually due to cutaneous
melanoma (Schadendorf et al., 2018). According to the World Health Organization, approximately
132,000 cases of melanoma and 2–3 million cases of nonmelanoma skin lesions will be discovered
worldwide, with skin lesions accounting for one in every three cancer cases. In 2018, the incidence of
skin cancer (containing melanoma as well as nonmelanoma) was 1,329,781 new cases, ranking third
among all cancers (World Health Organization, 2017). Despite the fact that compared to other skin
malignancies, melanoma is less prevalent, accounting for only approximately 1% of all cases, it is
responsible for most deaths caused by skin cancer. Over the last few decades, the number of people
dying of skin cancer has increased.
Thinking alphabetically is a quick approach to recall typical melanoma traits, that is, the melanoma
ABCDEs. Asymmetry, border, color, diameter, and evolution are all letters in the acronym ABCDE.
These diagnoses can be made visually. However, most skin tumors exhibit ABCDE features.
Consequently, this increases the likelihood of misdiagnosis. There are also clinical processes that are
error-prone, such as biopsies, seven-point checklists, and pattern analysis (Charan et al., 2020).
Clinicians frequently perform biopsies for the detection of skin cancer. To determine whether a
suspected skin lesion is cancerous, a sample must be collected for medical testing. This process is
painful, slow, and time-consuming. Computer technology allows for a more comfortable, less
expensive, and faster diagnosis of skin cancer symptoms (Dildar et al., 2021).
Although skin cancer can be fatal, research suggests that if it is detected and treated early, the chance
of survival increases (Conic et al., 2018). If these types of skin cancers can be diagnosed at an early
stage, there is a better chance of successful cure. Specifically, melanoma is treatable with a 95% chance
of survival over five years if discovered early (World Health Organization, 2017). Notably, survival
rates vary according to ethnicity. Because skin tissue contains a significant amount of melamine, people
of color have a lower risk of developing skin cancers. Research cites sun exposure as the primary cause

f
oo
of skin cancer, and one study found that 86% of individuals with melanoma were exposed to ultraviolet
UV rays for a prolonged period (Parkin et al., 2011). Malignant skin lesions are difficult to distinguish
from benign skin lesions, because they appear similar. Dermatologists can diagnose skin cancer with

r
60% accuracy by using only their eyes. With dermoscopy and proper training, the accuracy improved
-p
to 75%. They can attain a superior accuracy of 75% to 84% percent if they have a dermoscope and are
well trained (Vestergaard et al, 2008). Dermatologists mostly identify skin cancer through visual
re
inspection, which is a difficult process given the apparent similarity of skin malignancies. This method
is useful for detecting the early stages of skin cancer, leading to fewer deaths. It may take some time to
lP

thoroughly inspect the skin lesion with the naked eye and may be ineffective. The study findings
indicate that dermatologists require substantial training and expertise to appropriately diagnose the
lesion class through visual inspection, which is why it is strongly discouraged (Morton et al., 1998).
na

According to studies on clinical dermatologists’ ability to make accurate diagnoses, a dermatologist


with more than ten years of experience can achieve diagnostic accuracy of 80%, whereas dermatologists
ur

with 3–5 years of experience can only reach diagnostic accuracy of 62%, and the accuracy decreases
for dermatologists who have less experience (Morton et al., 1998). Because inexperienced
dermatologists might experience reduced accuracy of diagnosis of skin lesions, studies on dermoscopy
Jo

suggest the necessity to establish an automated, robust, and efficient system for skin cancer detection.
To overcome inter-observation discrepancies, melanoma diagnosis requires the experience of well-
trained specialists. Consequently, automatic melanoma recognition can improve the efficiency and
accuracy of early cancer detection (Hosny et al., 2019).
The events over the last 50 years have proven that artificial intelligence (AI) is as important as electricity
was when it was discovered. Brinker et al. (2019) built a convolutional neural network (CNN) model
for skin lesion categorization, and compared the model performance with that of 145 skin cancer
physicians from 12 hospitals at German universities. To compare the performance of the CNN with that
of dermatologists, the CNN was trained using cutting-edge deep learning techniques and 12,378 open-
source dermoscopic and 100 clinical images. Dermatologists and deep neural networks (DNNs) were
used to evaluate the sensitivity and specificity. Dermatologists achieved an average sensitivity and
specificity of 89.4% (range 55.0%–100%) and 64.4% (range 22.5%–92.5%), respectively, using clinical
images. The CNN model achieved a mean specificity of 68.2% (range 47.5%–86.25%) at the same
sensitivity. Using the mean, the CNN had a specificity of 61.1% and high sensitivity of 92.8%. Given
the specificity of each dermatologist, compared to the CNN, only 19 dermatologists out of 145 achieved
higher sensitivity.
Several medical issues can be resolved with the assistance of AI. In contrast to dermatologists, the use
of AI to improve the precision of skin cancer diagnosis can lead to better outcomes. Dermoscopy has
enhanced the diagnostic accuracy of physicians in recognizing skin cancer; however, its accuracy
remains relatively low. AI offers significant assistance in the early assessment and diagnosis of skin
cancer. Over the past decade, there has been a surge in research and publications in AI. Studies have
demonstrated that CNN algorithms can classify skin lesions from dermoscopic images with
performance equal to or better than that of clinicians (Liopyris et al., 2022). In seven out of 11 studies,
AI showed better diagnostic accuracy than dermatologists (Brancaccio et al., 2023). There has been
considerable interest in DNNs owing to their remarkable performance in various activities involving
computer vision. These models comprise numerous layers that work together to extract meaningful
representations from high-dimensional input data at many levels of abstraction (LeCun et al., 2015).
Globally, deep learning is the most advanced method used for analyzing medical images, and the CNN
is an effective deep learning method for automatically extracting significant features from image data.
The gold standard for diagnosing melanoma involves a combination of clinical evaluation, dermoscopy,
and histopathological examination. Dermatologists or trained healthcare providers perform a thorough
examination of the skin during clinical evaluation to identify moles and other signs that can be indicative
of skin cancer, using the ABCDE approach to spot suspicious lesions. Dermoscopy has gained

f
oo
popularity for skin cancer diagnosis because of its ability to accurately identify skin lesions that are not
visible to the naked eye (Chaturvedi et al., 2020). This method allows for the visualization of subsurface
structures that are not visible to the naked eye and are correlated with histopathological structures (Rao

r
& Ahn, 2012). Systematic reviews have shown that assessing skin lesions with dermoscopy is more
-p
accurate than visual inspection alone for detecting melanomas and basal cell carcinomas, particularly
in secondary care settings (Dinnes et al., 2018). However, a clinical examination combined with
re
dermoscopy alone is insufficient for a definitive diagnosis. Although dermoscopy plays a significant
role in early detection, histopathological examination remains the definitive method for diagnosing
lP

melanoma (Mihulecea et al., 2023). This process involves obtaining a biopsy specimen of a suspicious
lesion and analyzing the tissue under a microscope to confirm the diagnosis. A pathologist evaluates
cell morphology and other characteristics to confirm whether the lesion is malignant. Although
na

conventional pathology remains the gold standard for the diagnosis of melanoma, it has some
limitations. Significant discrepancies exist among pathologists regarding the classification,
ur

terminology, significance of subtypes, and model of tumor progression; most notably, there is
disagreement on diagnoses (Elmore et al., 2017).
Jo

Melanoma differs from NMSCs in that it has the potential to spread locally, regionally, and distantly.
Both melanoma and NMSCs are the most commonly diagnosed skin cancers in the Caucasian
population. NMSC is the most prevalent subtype, whereas melanoma is the most lethal (Sung et al.,
2021). NMSCs usually affect non-melanocyte cells such as keratinocytes or epidermal cells, whereas
melanoma affects melanocytes, which are the pigment-producing cells of the skin (Prajapat et al., 2023).
NMSCs originate from various types of skin cells, including basal cell carcinoma (BCC), squamous
cell carcinoma (SCC), actinic keratosis (AK), sebaceous cell carcinoma, and other rare tumors such as
Merkel cell carcinoma, Bowen’s disease, adnexal tumors, and Kaposi sarcoma (Hyeraci et al., 2023;
Grosu-Bularda et al., 2018). Melanoma usually develops in the lower layer of the epidermis and results
from the neoplastic transformation of melanocytes located in the stratum basale of the skin. Melanoma
also develops in other areas, such as the inner ear, bones, heart, and uvea of the eyes (Leiter et al., 2020).
Melanoma development is a multistep process that includes five distinct clinical and histological stages
(Liu et al., 2018). Melanoma is frequently ulcerated and prone to inflammation, bleeding, scaling, and
shedding (Arivazhagan et al., 2022). It begins by growing and spreading through the outer skin layer
before penetrating deeper layers, eventually connecting with the blood and lymph vessels. A normal
mole is usually the same color as the skin, such as brown, black, or tan, with a distinctive border
separating it from the surrounding skin. These moles are generally round or oval and smaller than 0.25
inches (approximately 6 mm) in diameter, about the size of a pencil eraser (Albahar, 2019). BCC, a
type of NMSC, typically grows slowly and rarely metastasizes, unlike melanoma. However, BCC can
cause significant patient morbidity and generally occurs only in areas with hair follicles and sebaceous
glands (Mortada et al., 2023; Loh et al., 2016). Most BCCs occur in the head and neck (Camela et al.,
2023).
Benign lesions frequently emerge from the hyperplasia of skin cells and lead to the emergence of
growths including moles or dermatofibromas (Cohen et al., 2019). These lesions are normally
noncancerous and pose minimum health challenges (Reddy et al., 2024). In contrast, malignant skin
lesions emerge from the uncontrolled growth of mutated skin cells that are typically triggered by
exposure to ultraviolet (UV) radiation, genetic predispositions, or any other carcinogenic elements
(Alonso-Belmonte et al., 2022). Benign skin lesions involve a diverse array of conditions ranging from
common moles to seborrheic keratoses. They are usually harmless growths or abnormalities normally
characterized by specific visual cues such as regular borders, consistent coloration, and a symmetrical
structure (Marghoob et al., 2019). Lesions such as moles, cysts, and dermatofibromas are common
benign lesions (Senel, 2011). The proliferation of melanocytes, which are the pigment-producing cells
in the skin, leads to the formation of moles or nevi. In contrast, malignant skin lesions, which are
essentially skin cancers, originate from the malignant modification of skin cells (Esfahani et al., 2023).
Malignant lesions might present abnormal borders, multiple colors, and asymmetry, reflecting the

f
oo
invasive characteristics of cancerous growth (Kazaj et al., 2022). More alarming evidence includes
alterations in the size, shape, and elevation of pre-existing moles (Catalano et al., 2019).

r
Recently, the DNN field has progressed significantly, and several network architectures have shown
-p
promising results in image classification challenges. Using a process known as transfer learning (TL),
such designs are employed in various applications worldwide (Weiss et al, 2016). TL is a machine
re
learning (ML) research topic whose goal is to store and transfer knowledge acquired in solving one
problem to solving another problem that is unrelated but nevertheless poses the same challenges. In this
lP

strategy, the weights of a model that has already been trained for a specific task are used for a range of
tasks. Deep learning presents certain difficulties in accurately classifying skin lesions. First, access to
high-quality data is essential to achieve good classification results using deep learning techniques.
na

However, when employing a CNN for image classification, a large number of clear images are critical.
If the number of images is small, the model cannot be generalized. An image dataset containing five
thousand (5,000) images is quite small for training a CNN model, and clear medical images are scarce.
ur

As a result, TL is required to improve deep learning performance in the classification of skin cancer.
Jo

The deep CNN proposed in this study is highly accurate for classifying skin lesions using a variety of
DNN models and a publicly available dataset (HAM10000) in combination with test-time augmentation
(TTA). Different weights were assigned to each model based on its performance during individual
training. The performance of each model was recorded, and the base models were not permitted to
contribute equally during the ensemble. In other words, models that performed well were rewarded,
whereas those that performed poorly were penalized.
This work makes the following notable contributions:

• Adapted Rexnet101, MobileNets, EfficientNetV2, Xception, and DenseNet201 for TL.


Features were retrieved from the dense layers using TL, and weighted majority voting was
utilized in an ensemble framework to aggregate the predictions provided by softmax for the
categorization of seven different classes of skin lesions.
• Image augmentation and model fine-tuning were performed during the training and testing
phases to address the variance in image resolution and class imbalance in the image data and
enhance the performance of the proposed model.
• A dynamic learning rate is proposed based on cosine annealing for optimal model performance.
• Lesion segmentation and other intensive image preprocessing procedures were not utilized in
this work, making the task less complex and more efficient.
• A comparison study was conducted to analyze the performance of five fine-tuned deep learning
models and the associated weighted ensemble model with TTA to discover which model
performed better on the HAM10000 dataset.
The remainder of this paper is organized as follows. Related studies are discussed in Section 2 along
with a brief overview of the technique. Section 3 discusses the proposed deep-learning-based approach
for skin cancer categorization employed in this study, including the dataset, preprocessing,
classification models, fine-tuning, ensemble model with the application of TTA, and the performance
matrix. Section 4 presents the findings of the experiment and analysis of the proposed model. Finally,
a summary, conclusion, and future work are presented in Section 5.
2. Related works
Many attempts have been made to solve skin cancer diagnostic issues using computer algorithms since
the 1990s. Binder et al. (1994) constructed an artificial neural network (ANN) to classify benign and
malignant tumors using dermoscopic images. In contrast to human diagnosis, which has a specificity of

f
90%, the ANN model only has an 88% specificity. Consequently, a wide range of approaches for

oo
diagnosing melanoma using dermoscopy or clinical imaging have been developed. For skin lesion
categorization, several hybrid approaches that combine ML techniques, such as deep learning
approaches, were examined, and have an 84% accuracy rate (Ramlakhan et al., 2011). However, a

r
-p
performance comparison of CNN models with other ML methods and techniques revealed that modern
deep learning models produce better outcomes.
re
Esteva et al. (2017) used a deep CNN that was trained end-to-end directly from raw images, using only
pixels and disease labels as inputs. The study used the Inception V3 CNN architecture, demonstrating
lP

AI capable of classifying skin cancer with a level of competence comparable to that of dermatologists.
Brinker et al. (2019) evaluated a CNN trained using dermoscopic images to classify melanoma. It
demonstrated comparable performance to that of dermatologists, achieving high sensitivity and
na

specificity. However, there was a lack of diversity in the dataset (e.g., rare melanoma types) and an
unaddressed class imbalance, which could bias the model toward the majority classes, such as atypical
nevi, potentially reducing its effectiveness in detecting melanomas in real-world applications.
ur

Chaturvedi et al. (2020) used fine-tuned pretrained CNNs and ensemble models to classify skin cancer
Jo

images. Five pretrained CNNs and four ensemble models were compared to analyze their performance,
where both ResNetXt101 and InceptionResNetV2 as individual models achieved an exceptionally high
accuracy of 93.20%. Although the ensemble of InceptionResNetV2 and ResNeXt101 achieved good
performance, it did not significantly surpass the accuracy of the individual ResNeXt101 model. This
may be owing to the fact that weights were not assigned to the base models within the ensemble. Afza
et al. (2022) showed a three-step classification technique for skin lesions. The proposed model was
developed on a deep learning framework that followed the ResNet50 pretrained architecture. This
approach used a combination of superpixel-based methods and deep learning. Although the hierarchical
three-step superpixel segmentation aims to improve lesion boundary detection, this technique may not
perform as well with very small lesions or irregular shapes, which are common in melanoma.

Le et al. (2020) presented a deep learning model designed to classify skin lesions into seven categories,
including melanoma. Using TL with pretrained ResNet50 models in combination with class-weighted
and focal loss functions, the authors addressed the imbalance between benign and malignant lesions.
However, the class weights were manually adjusted based on the ratios between the classes. An adaptive
or dynamic weighting approach could be explored to automatically fine-tune the class weights during
training. Chaturvedi et al. (2021) developed an efficient automated method for skin cancer classification
using a MobileNet model with TL. Although data augmentation is typically applied during the training
phase to artificially increase the size and diversity of the training data, it should also be applied during
the test phase (known as TTA) to improve the robustness and generalization of the model.
Rahman et al. (2021) proposed a deep-learning system for classifying skin lesions. Their study used
five state-of-the-art architectures (ResNeXt, SeResNeXt, ResNet, DenseNet, and Xception) combined
into a weighted ensemble model. However, three of the five ensemble models (ResNeXt, SeResNeXt,
and ResNet) belong to similar architectural families, potentially limiting the diversity of the extracted
features. Adding models from different architectural families could enhance performance. Mohamed
and El-Behaidy (2019) designed a model that focused on improving skin lesion classification using the
MobileNet and DenseNet-121 models. The two models exhibited strong individual performance.
Leveraging the strengths of both models by combining them into a weighted ensemble could lead to
further performance gains. Because oversampling is applied to address data imbalance, using k-fold
cross-validation would give a more robust performance evaluation, reduce the risk of overfitting, and
improve model generalization.

Gessert et al. (2018) presented a method for classifying skin lesions using an ensemble of CNNs. A
total of 54 base models were used in the ensemble. Although ensemble models are effective in
improving performance, using more than 50 models in an ensemble can lead to computational

f
inefficiency. Such a large ensemble would require significant training time and resource consumption,

oo
rendering it impractical for real-world applications.

Zhang et al. (2019) proposed an attention residual learning CNN (ARL-CNN) for skin lesion

r
-p
classification, which integrates residual learning and attention mechanisms to improve skin lesion
classification in dermoscopy images. However, the 30 h training time for the proposed ARL-CNN50
model on an NVIDIA GTX Titan XP GPU underscores the computational inefficiency. Keerthana et
re
al. (2023) developed two new hybrid CNN models with a support vector machine (SVM) that classifies
dermoscopy images as benign or melanoma lesions with higher accuracy than the traditional CNN
lP

model. However, CNNs often select a large number of features, many of which may be redundant or
irrelevant for classification tasks. Without feature selection, irrelevant features introduce noise and
hence reduce the efficiency of the SVM classifier.
na

Bansal and Sridhar (2022) designed a skin lesion classification method based on TL in tandem with
ur

ensemble techniques, namely, DenseNet201 and MobileNet. However, no technique was used to
prioritize the base models that performed well during individual training, which could reduce the overall
efficiency of the ensemble results. Ali et al. (2022) proposed a framework for classifying skin cancer
Jo

types using EfficientNet models. This study investigated the performance of EfficientNet models B0–
B7 in classifying skin cancer types by identifying the version with the best performance. Different
versions of EfficientNet have been tested, but the potential benefits of combining these models using a
hybrid approach have not been explored. A hybrid technique could improve both the accuracy and
robustness across a wider range of skin lesion types.

Xin et al. (2022) demonstrated an improved transformer network called SkinTrans for skin cancer image
classification. They achieved high accuracy on two datasets by leveraging the strengths of vision
transformers (VITs), contrastive learning, and label shuffling and capturing more complex spatial
dependencies in the image data. Although label shuffling can improve class balance, it may disrupt
natural patterns or relationships in the data. Exploring other imbalance strategies such as focal loss or
class weighting could provide a more robust solution.

Priyadharshini et al. (2023) developed a skin cancer detection system using image processing
techniques, such as noise removal, contrast enhancement, fuzzy c-means clustering, principal
component analysis, and extreme learning machine with teaching-learning-based optimization.
Although principle component analysis (PCA) is useful for dimensionality reduction, it may not always
capture the relevant features and complex relationships between the features in skin images. The
proposed ELM-TLBO method achieves an accuracy rate of 93.18%. Combining PCA with other feature
extraction techniques or domain-specific features may improve performance.

3. Materials and methods

The sources of the datasets, data preparation, and model training are covered in this section, together
with all the other materials and techniques utilized to develop the proposed model.
3.1 Data source
The training, evaluation, and testing of the proposed model were performed using a well-known dataset
from the HAM10000 challenge. However, early skin cancer categorization studies have been hampered
by the small number of dermoscopic images. To address this issue, the HAM10000 dataset, which
includes a large number of dermoscopy images, was published in 2018 (Tschandl et al., 2018). The
collection contains 10,015 dermoscopy images with a resolution of 600 × 450 pixels, including 514
BCC images, 1,113 melanoma images, 6,705 melanocytic nevi images, 327 AK images, 1,099 benign

f
keratosis images, 142 vascular images, and 115 dermatofibroma images. According to Le et al. (2020),

oo
the images were obtained during a 20-year period from Cliff Rosendahl’s skin cancer clinic in
Queensland, Australia, and the Department of Dermatology at the Medical University of Vienna,
Austria.

r
3.1.1 Data preprocessing -p
re
Although there is no need for preprocessing before feeding the raw data into the neural networks,
thorough preparation is required to improve the performance of the system. To do so, several techniques
are applied to prepare image data for networks to extract important features effortlessly. A Keras Image
lP

Data Generator was used to preprocess the images of skin lesions. To make the images suitable for the
corresponding models, the skin lesion images available in the dataset were appropriately rescaled from
na

600 × 450 pixel resolution to 224 × 224 pixel resolution. Stratified sampling was used at every level to
preserve the interclass ratio between subgroups and avoid the possibility of being excluded from the
minority class. To assess the consistency of the final model throughout HAM10000, a stratified 10-fold
ur

cross-validation was performed in the final experiment Train-Val split.


3.1.2 Training data augmentation
Jo

The deep convolutional models used in this study required a large number of images for training to
achieve improved performance. The chosen dataset contained many images, but they were insufficient.
To increase the stability of the model, the training data should be as diverse as possible. Data
augmentation was utilized to increase the number and variety of images already available, which also
aids in improving the prediction accuracy. The images in the HAM10000 dataset were unevenly
distributed across seven classes. Data augmentation allowed us to rebalance the dataset classes, thereby
reducing the number of minority classes. The image augmentation techniques used include height and
breadth shifting, shearing, brightness, and zooming.
3.2 Model architectures
This study used TL to classify skin lesions by modifying the architecture and weights of the Resnext101,
MobileNet, EfficientNetV2B2, Xception, and Densenet201 models, which had already been trained
using the ImageNet dataset. The following modifications were made, among others: (1) the use of
Global Average Pooling, (2) the replacement of the top (i.e., classification) layer of each model with a
dropout layer block of 50% (0.5), followed by last or final layers that are fully connected. To discover
new HAM10000 dataset properties, training was performed in Kaggle with the pretrained models’ entire
layer structure unfrozen. Before adding the full classification layer with seven neurons, global average
pooling and a dropout layer was utilized.
3.3 Implementation details
In the realm of medical imaging, limited labeled data are available, making the development of a high-
performance deep learning system difficult. Insufficient training data was an issue, but it was reduced
using the data augmentation technique and optimizing the significant architecture trained on the
ImageNet dataset. Deep pretrained models extract broad features from their initial layers before
extracting target-specific information from their later layers. In this study, the features extracted by the
initial pretrained model layers were employed. Fig. 1 depicts the general architecture of the proposed
model. The modification was initialized by first removing the dense layers from each of the five
pretrained models and replacing them with a layer of global average pooling. Subsequently, batch
normalization; dropout; the last dense layer, which has seven neurons owing to the seven classes of the
classifier system; and softmax activation occurred.
In this study, multiple pretrained CNN architectures, including EfficientNetV2B4, Xception,
ResNeXt101, DenseNet201, and MobileNet, were used to classify skin cancer images, relying
exclusively on these architectures for feature extraction. Kim et al. (2022) performed experiments that

f
proved that TL does work with limited data and thus suggested that practitioners and data scientists use

oo
deep models only as feature extractors, which might help reduce computational resources and time
without losing predictive accuracy. The purpose of each of these models is to automatically fetch and

r
learn features from input images (skin cancer images in this context) during training, that contribute to
-p
the classification of the lesions by capturing low- to high-level features including edges, textures,
patterns, shapes, motifs, and fine details. The input images were automatically processed through their
re
convolutional layers by fetching hierarchical features that increasingly captured complex and abstract
information. These features are relevant in skin cancer classification for the following reasons: texture
lP

and edge detection are vital for differentiating fine details in skin cancers, pattern and shape recognition
assist in lesion shape and structure irregularity identification, and high-level semantic information is
critical for understanding the overall context and morphology of lesions and differentiating benign from
na

malignant lesions. A dense layer of seven neurons with a softmax activation function was added to each
architecture after feature extraction. This layer transformed the high-dimensional feature vectors into a
probability distribution over seven classes and achieved classification by outputting probabilities for
ur

each class. These probabilities assisted in the final classification decision, for which the class with the
highest probability was selected as the correct prediction.
Jo

3.4 Hyperparameter tuning


Hyperparameters that determine the performance of neural networks include the learning rate, number
of layers, activation functions, and optimization algorithm. The model was tuned to determine the
optimum hyperparameter combination for improving its discrimination ability. Five deep learning
models were trained in this study, and the hyperparameters were fine-tuned to achieve the best
performance. All of them were trained numerous times, including the hyperparameters and varied
values for tuning purposes, because there was no technique in the studied literature for optimizing
hyperparameters for TL. To achieve a good outcome, the models were first fine-tuned with various
numbers of trainable layers, batch sizes, epochs, and learning rates using ImageNet weight initialization.
f
r oo
-p
re
lP
na

Fig. 1. The proposed model architecture. Note: TTA: test-time augmentation.


ur

Finding the best learning rate is difficult. Therefore, the cosine annealing learning rate scheduler was
used, which changes the learning rate at the start of each epoch. A dropout of 50% was applied to
Jo

remove some layers that did not contribute significantly to the model performance.
3.5 CNN ensemble
The results of an ensemble of DNNs were always superior to those of a single DNN. An automated
strategy was explored and developed using an ensemble of deep CNNs to obtain the maximum feasible
accuracy in our image categorization scenario. The outputs of the classification layers, which use the
output of the fully connected layers to determine the confidence values for each class, were considered
for an ensemble of CNNs (there are n = 7 classes in the example).
ALGORITHM 1: ALGORITHM TO FIND OPTIMUM WEIGHTS COMBINATION
Input: Predictions P = {set of predictions 𝒑 by the base models}
Label 𝑦 = {set of true labels of the data set, 𝑦 ∈ ℝ𝑛 }
Output: Optimum weights for ensemble (w1,....wk)
Weighted ensemble accuracy score with optimum weights (max_a)
1 max_a = 0 //initialize accuracy score
2 ow = 0 //initialize optimum weights
3 For w1 = 0,....n:
4 For w2 = 0,....n:
5 For w3 = 0,....n:

f
oo
6 For w4 = 0,....n:
7 For w5 = 0,....n:

r
8
9 Piwi:
-p
w = [w1/10, w2/10, w3/10, w4/10, w5/10]
Prediction * weights i for each base model
re
10 𝑤 = (P1w1,....,Pkwk) //combine predictions
lP

11 𝒚 = argmax(𝑤) //get the prediction with max value


12 𝑎 = accuracy_score (y, 𝒚) //calculate weighted acc
na

13 if 𝑎 > max_a:
ur

14 max_a = 𝑎
15 ow = w
Jo

16 end
17 end
18 end
19 end
20 end
21 end

Fig. 2. Pseudocode of the grid search algorithm to find the optimum weight combination (OWC)

According to Harangi (2018), for proper formalization, a CNN is regarded to be a function 𝑔 = 𝑦 →


ℝ𝑛 that assigns n confidence values 𝑝𝑖 ∈ ℝ to a new, formerly unseen image y, where 𝑝𝑖 ∈
[0,1] 𝑓𝑜𝑟 𝑖 = 1, … . . , 𝑛, 𝑎𝑛𝑑 ∑𝑛𝑖 𝑝𝑖 = 1 . In our specific work, the values 𝑝1 , 𝑝2 , 𝑝3 , 𝑝4 , 𝑝5 , 𝑝6 ,
and 𝑝7 show the confidence of the given CNN that y should be categorized as AK (𝐾1 ), benign keratosis
( 𝐾2 ), BCC (𝐾3 ), melanocytic nevi ( 𝐾4 ), melanoma ( 𝐾5 ), dermatofibroma ( 𝐾6 ), or vascular ( 𝐾7 ). The
CNN classifies belonging to the class with the highest likelihood as a straightforward decision.
𝑦 → 𝐾𝑖 , 𝑖𝑓 𝑝𝑖 = max(𝑔(𝑦)). (1)

The CNNs in this situation should give each test image seven confidence values, showing the likelihood
that the image would be classified as melanocytic nevi, vascular keratosis, benign keratosis, melanoma,
AK, dermatofibroma, or BCC. As a result, we need to derive 𝑝𝑖′ probabilities (𝑝𝑖′ ∈ [0,1], 𝑤ℎ𝑒𝑟𝑒 𝑖 =
1, … . .7, and ∑7𝑖=1 𝑝𝑖′ = 1), which were derived from the confidence values of the five CNN models in
the ensemble for each image. There are various aggregations such as the product of the probabilities
(PP), sum of maximal probabilities (SMP), simple majority voting (SMV), and sum of the probabilities
(SP). In most studies, the latter is commonly used owing to its performance.
3.5.1 The sum of probabilities (SP)
According to Harangi (2018), SP is the most frequently used aggregation model and is represented as

f
∑𝑚
𝑗=1 𝑝𝑖𝑗

oo
𝑝 ′ = ∑𝑛 𝑚 , 𝑖 = 1, … … , 𝑛, (2)
𝑖=1 ∑𝑗=1 𝑝𝑖𝑗

r

-p
where 𝑝𝑖𝑗 denotes the confidence values of 𝐶𝑁𝑁𝑗 and y is contained in class 𝐶𝑖 . In this application, we
consider 𝑗 = 1, … . , 𝑚 = 5 respective CNN model classifiers. The term for normalization ∑𝑛𝑖=1 ∑𝑚
𝑛 ′
𝑗=1 𝑝𝑖𝑗
[ ]
was used to obtain 𝑝𝑖 ∈ 0,1 𝑓𝑜𝑟 𝑖 = 1, … . , 𝑛 with ∑𝑖=1 𝑝𝑖 = 1. The addition of the confidence values
re
is computed in this fusion model of the members 𝐶𝑁𝑁𝑗 (𝑗 = 1, … . . , 𝑚). The final label of each image
was determined by independently selecting the highest normalized sum for each class. However,
lP

misclassification can occur easily when this approach is applied. If a classifier with poor overall
accuracy misses an image by giving it a high probability, the wrong class is rated by other classifiers
with poor but not zero ratings. To address the issue of misclassification, we assigned various weights
na

to each model to give the greatest weight to the models that correctly classified the data.
3.5.2 Weighted ensemble of CNNs
ur

Additionally, to determine the final label, we included the classifiers’ individual accuracies in the fusion
models. To address this issue, the idea proposed by Harangi (2018) was employed, which gives voters
Jo

weights and makes use of models based on weighted majority voting and the addition of the highest
weighted probability to solve the problem. We used the grid search (GS) approach to discover the
optimal weight modification by considering the weights as ensemble parameters. After having
determined the appropriate weights 𝜔𝑗 (𝑗 = 1, … ,5) for each of the individual CNNs, the confidence
(probability) values 𝑝𝑖𝑗 of 𝐶𝑁𝑁𝑗 (𝑖 = 1, … ,7) are multiplied by 𝜔𝑗 (𝑗 = 1, … . . ,5) and compute the
probabilities for each class 𝑝𝑖′ applying the weighted confidence values 𝜔𝑗 𝑝𝑖𝑗 rather than the original
𝑝𝑖𝑗 values. In other words, these weights can be considered as information regarding the dependability
of the respective CNNs.
3.5.3 Optimum weight combination using the grid search algorithm
Currently, there is no automated technique for the optimum weight combination (OWC) to be used for
a weighted ensemble of CNN models. However, a GS, which is a brute-force approach, was used to
search for many possible combinations of weights, and the combination that resulted in the most
accurate prediction was used as the OWC. To accomplish this, a Python nest loop with a range of 0
(minimum) to 9 (maximum) was utilized, with an iteration (number of loops) equal to the number of
base models. (K)^n determines the cumulative sum of all possible combinations, where k is the
maximum range, and n denotes the overall number of basic models. Consequently, the total number of
possible combinations of weights is = 9^5 = 59,049. The pseudocode and flowcharts of the GS
technique are shown in Figs. 2 and 3.
f
r oo
-p
re
lP
na
ur

Fig. 3. Flowchart of the grid search algorithm to find the OWC


Jo

3.5.4 Test time augmentation (TTA)


The goal of TTA is to make random changes IN the test images, similar to what data augmentation does
in the training set. As a result, rather than presenting the trained model with ordinary clean images only
once, the model shows the augmented images numerous times. We then averaged the predictions of
each corresponding image and used it as our final estimate. Before we ensembled their individual
predictions to obtain the correct label, we applied TTA to each model. We also averaged the errors by
averaging our predictions for randomly modified images, thereby improving the performance of the
model. In a single vector, the error can be large, resulting in an erroneous answer. However, when the
errors are averaged, only the correct answer is shown. The following equation is used to represent this
technique:
∑𝑛
𝑖=1 𝑝𝑖
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 = (3)
𝑁

where i is the cumulative number of both the original and augmented images, 𝑝𝑖 is the probability
(confidence) value for image I, and N represents the total number of original and modified images.
3.6 Cosine annealing learning rate
A dynamic learning rate schedule called cosine annealing can start with a high learning rate, quickly
drop to a low value, and then quickly rise again. At the beginning of each epoch, a learning schedule
was used to adjust the learning rate. Huang et al. (2017), proposed a method that was used in Snapshot
Ensembles (i.e., ensembles of models generated from a single model). The scheduling equation is as
follows:
𝑇
𝛼0 𝜋𝑚𝑜𝑑(𝑡−1,[ ])
𝑀
𝛼(𝑡) = (cos ( 𝑇 ) + 1) (4)
2 [ ]
𝑀

In the equation for the cosine annealing learning rate schedule, ∝(t) is the learning rate at epoch t, the
maximum learning rate is denoted by α0, T represents the total epochs, the number of cycles during the
training is represented as M, the operation of the modulo is denoted as mod, and square brackets indicate
a floor operation. The learning rate for the specified epoch is then returned using the function.
3.7 Evaluation metrics

f
oo
The effectiveness of the proposed model was evaluated using five quantitative measures: categorical
F1-score, precision, sensitivity, accuracy, and specificity. These metrics are represented by the
following mathematical expressions:

r
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
-p (5)
re
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (6)
𝑇𝑃+𝐹𝑃
lP

2𝑇𝑃
𝐹1 − 𝑠𝑐𝑜𝑟𝑒 = (7)
2𝑇𝑃+𝐹𝑃+𝐹𝑁
na

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (8)
(𝑇𝑃+𝐹𝑁)
ur

4. Results and discussion


Jo

The results obtained after the training and evaluation of the model are presented in this section. The
performances of each of the five base models without TTA, base models with TTA, ensemble model
without TTA and weights, ensemble model with weights but without TTA, ensemble model with TTA
but without weights, and ensemble model with TTA and weights were compared. Accuracy, weighted
average F1 score, weighted average recall, and weighted average precision were used as evaluation
metrics throughout the study.

4.1 The environment

All scripts were written in Python on a Jupyter notebook hosted on the Kaggle cloud, using the Keras
and TensorFlow frameworks. An NVIDIA Tesla with P100 16 GB and VRAM as the GPU, with 13
GB of RAM + two Intel Xeon cores as the CPU enhanced the training speed by a factor of 12.5,
according to specifications on the Kaggle cloud.

4.2 Dataset analysis

The HAM10000 dataset, which is available on the Kaggle cloud, was used and comprises 10,015
images. The International Skin Imaging Collaboration (ISIC) archive also makes this dataset of skin
lesion images available to the public. It comprises seven categories: AK, dermatofibroma, melanocytic
nevus, vascular lesions, BCC, melanoma, and benign keratosis. The classification of these skin groups
is not a simple task, and there is a substantial risk of misdiagnosis because of their significant
resemblance. The original images have dimensions of 600 × 450 pixels. This work, on the other hand,
scaled all of the images to 224 ×224 pixels. All images were processed and loaded using the process
function in the Keras TensorFlow image data generator, which allows the model parameters to be
initialized with those from models that have already been trained on ImageNet, while simultaneously
reducing the computational cost of the model. The entire dataset was initially divided into training sets,
with 80% of the samples assigned to each class and the remaining 20% going to validation and testing
sets. It was discovered that the base models did not perform well on this traditional train-test split model.
Stratified ten-fold cross-validation was then performed, with each fold containing an equal number of
samples from each class. This indicates that the neural network was trained on each of the ten training
and validation sets that were created by randomly dividing the total training set into two groups.

4.3 Results

f
oo
The five base models were fine-tuned by eliminating the last dense layer, substituting it with batch
normalization and dropout layers, and adding a dense layer comprising seven neurons with softmax

r
activation. No additional dense layers were added between the batch normalization, dropout, and fully
-p
connected layers. Because the ImageNet dataset differs from our image dataset and learns new features
during training, all layers of the base models were unfrozen. The model was not given reproducible
re
results and was trained 38 times to obtain average scores for each of the evaluation metrics used in this
study. Throughout the training of the five base models, batch sizes of 32 and 16 were used for training
and validation, respectively. However, two different batch sizes were discovered after several training
lP

sessions were performed for each model using different batch sizes. A categorical cross-entropy loss
function with label smoothing was used so that the model would not be too confident on the respective
na

labels.
4.3.1 Training base models individually without TTA
ur

An epoch of 30 was used for Xception and DenseNet201, and 40 epochs were used for MobileNet,
ResNext101, and EfficientNetV2B2. These different numbers of training epochs were used after testing
Jo

several epochs and were found to outperform others. The Adam optimizer was used for the
EfficientNetV2B2, DenseNet201, Xception, and MobileNet models. A stochastic gradient descent
(SGD) optimizer was used for the RexNet101 model. The learning rates for EfficientNetV2B2 and
MobileNet were 0.0001 and 0.00001 for DesnseNet201, Xception, and RexNet101, respectively. After
several analyses, only Xception and DenseNet201 performed better when the annealing learning rate
scheduler was used. It accepts the minimum and maximum learning rates that are used to modify the
value of the learning rate at the beginning of each epoch. The values of 0.00001 and 0.0001 were used
as the minimum (initial) and maximum learning rates, respectively, for the learning rate scheduler. The
performance using the evaluation metrics for each of the five base models is presented in Table 1, where
the bold text indicates the maximum result achieved by a specific model with respect to any evaluation
metric. However, Densenet201 achieved accuracies of 90.47, 90.63, 90.47, and 91.28 for accuracy, f1-
score, recall, and precision, which were the highest compared to the other models. The results are shown
in Fig. 4.
Table 1
Results for individual models without TTA

Model Accuracy F1-score Recall Precision


Efficientnetv2b2 88.41 88.22 88.41 88.62
Resnext101 87.55 87.93 87.55 89.34
Xception 90.31 90.47 90.31 90.89
Densenet201 90.47 90.63 90.47 91.28
Mobilenet 87.72 87.61 87.72 88.83

f
r oo
-p
re
Fig. 4. Performance of the individual models without TTA
lP

4.3.2 Training base models individually with TTA


na

The same parameters used in training the model in Section 4.3.1 were replicated in this phase. The only
difference was that several augmented images were presented to the model during the test phase instead
of using only a single (original) image to test the model. Because the images were augmented during
ur

the training phase of the model, replicating the same image augmentation techniques enhanced the
performance of the model. However, it was reported in some of the reviewed literature that some
Jo

augmentation techniques (such as rotation, horizontal and vertical flipping) affects or decreases the
performance of the model; consequently, the proposed model made use of width and height shift range,
shear range, brightness range, and zoom as the image augmentation techniques in both the training and
testing phases. After applying TTA, the performance of the model improved, as shown in Table 2 and
Fig. 5. Among the base models, only Xception achieved 91.26, 91.46, 91.26, and 91.99 as accuracy, f1-
score, recall, and precision, which is higher than that of the other models. This shows that the
improvement by TTA does not strictly depend on the performance of the model without TTA because
Densenet201 performed better than Xception without applying TTA, as displayed in Table 1.
Table 2
The results for individual models with TTA

Model Accuracy F1-score Recall Precision


Efficientnetv2b2 90.51 90.51 90.51 90.77
Resnext101 89.35 89.85 89.35 90.90
Xception 91.26 91.46 91.26 91.99
Densenet201 91.12 91.36 91.12 91.92
Mobilenet 89.47 89.51 89.47 90.34
f
oo
Fig. 5. Performance of the individual models with TTA

r
4.3.3 Ensemble model without weights and TTA
-p
The results from each model in Section 4.3.1 were combined, and the average was taken to obtain the
final result. The ensemble model outperformed each of the five base models. However, at this stage,
re
each model was allowed to contribute equally in the ensemble without considering how they
outperformed each other during individual training and testing. The performance of this ensemble is
lP

presented in Table 3 and Fig. 6.


4.3.4 Ensemble model with weights but without TTA
na

The steps taken in Section 4.3.3 were repeated here. The only difference was the assignment of weights
to each of the five base models, which was not done in the previous section. The performance of each
ur

of the base models was not the same as reported in Table 1 and Fig. 4. This is because the complexity
and design of deep learning models differ; as such, they do not deliver uniform results, and the results
Jo

from certain models are superior to those from others. As a result, it would be beneficial to give the
models with better performance more weights, and therefore obtain the highest output or result from
each model. Therefore, weights were assigned to each of the base models based on their performance
during individual training and testing to prioritize the contribution of high-performing models to the
ensemble over low-performing models. The task was to discover the optimal weight combination for
each base model, and there was no automatic method for doing so. Therefore, a grid search approach
was employed to obtain several possible combinations of weights. In total, 59,049 weight combinations
were generated. The search process continued until all feasible weight combinations were examined,
ultimately determining the one that maximized the evaluation (accuracy) parameter. The results of this
weighted ensemble are presented in Table 3 and Fig. 6.
4.3.5 Ensemble model with TTA but without weights
The ensemble model was used to further evaluate the generalization ability using the TTA, but without
assigning weights to each model. This meant that in the ensemble phase, each of the five models
contributed uniformly because individual performance was not taken into account. The same techniques
used in Section 4.3.2 to train and test each of the base models were replicated here. The only difference
is that the average of the final results from all the models in Section 4.3.2 were used to obtain a single
result, which, however, improved the results compared to the results of the individual models in Section
4.3.2. The results of this ensemble with TTA are shown in Table 3 and Fig. 6. The ensemble model with
weights assigned but no TTA clearly outperformed the ensemble model with TTA but no weights
assigned, as shown in Table 4. This demonstrates that assigning weights to the base models of the
ensemble when their performances are distinct improves the results.
Table 3
Results for ensemble models

Model Accuracy F1-score Recall Precision


Ensemble 92.63 92.83 92.63 93.32
Weighted Ensemble 93.83 94.00 93.83 94.40
Ensemble TTA 93.21 93.47 93.21 93.96
Weighted Ensemble TTA 94.49 94.68 94.49 95.07

f
r oo
-p
re
lP
na
ur

Fig. 6. Chart for ensemble models


Jo

4.3.6 Ensemble model with TTA and weights


In Section 4.3.5, weights were not assigned to the models because individual performance was not
considered. The same GS strategy discussed in Section 4.3.3 was utilized here to learn more about how
each model individually influenced the ensemble with TTA, which improved the performance of the
model. Compared to the results of the previous phases, this phase produced the best results. The results
of this ensemble with the TTA and weights are shown in Table 3 and Fig. 6. The accuracy, f1-score,
recall, and precision for this model were 94.49, 94.68, 94.49, and 95.07, respectively, which were not
only the highest for all analysis carried out in this work, but also outperformed many works in the
reviewed literature.
4.4 Comparative analysis with recent similar studies from the literature
This study compared the effectiveness of the proposed model with those of several other techniques.
The HAM10000 dataset was used for all the analyses. Table 4 presents the results obtained using several
relevant methodologies. In this table, the techniques proposed by Chaturvedi et al. (2020) achieve an
accuracy of 93.20%, which is very close to the accuracy achieved by the proposed model. In addition,
the technique proposed by Afza et al. (2022), one of the most recent in Table 4, was tested on two
distinct skin lesion image datasets (ISIC 2016 and PH2). This model also performed well, which is one
of its advantages, and is adaptable to both binary and multiclass classification tasks.
Table 4
Comparison with similar studies on same HAM10000

Work Year Model Acc. F1-Sco. Rec. Prec.

Chaturvedi et al. (2020) 2020 Ensemble model 93.20 88.00 88.00 88.00

Le et al. (2020) 2020 Ensemble of modified Resnet50 93.00 87.00 85.00 88.00

Chaturvedi et al. (2021) 2021 Mobilenet 91.30 83.00 83.00 89.00

Nidhi, et al. (2022) 2022 Densenet and Mobilenet 91.00 90.00 90.00 92.00

Rahman et al. (2021) 2021 Ensemble model 88.00 89.00 94.00 87.00

f
oo
Ali et al. (2022) 2022 EfficientnetB0-B4 87.00 87.91 88.00 88.00

Afza et al. (2022) 2022 Resnet50 85.80 86.14 ---- 86.26

r
Xin et al. (2022) 2022 -p
VIT model 94.30 ---- --- 94.10
re
Proposed model 2024 Ensemble model with TTA 94.49 94.68 94.49 95.07
lP

4.5 Managerial implications


na

The results of this study have a significant impact on healthcare management, especially in dermatology
practices and healthcare services in general. By demonstrating the effectiveness of using a weighted
ur

ensemble of deep learning models with TL techniques and TTA, healthcare managers can better
integrate AI-driven diagnostic tools into the clinical workflow. Such integration can help increase the
accuracy of disease diagnosis in hospitals, especially in the case of skin cancers, reducing the burden
Jo

on dermatologists, and potentially reducing the number of misdiagnoses. The effectiveness of this
model in classifying different types of skin lesions can accurately enable early detection and treatment
as well as minimize the need for unnecessary procedures such as biopsies, which improves patient
outcomes and reduces healthcare costs associated with advanced skin cancer treatments. Combining the
model proposed in this study with other AI tools in healthcare institutions will also optimize resource
allocation in healthcare facilities to ensure efficient utilization of medical personnel and equipment,
particularly when dealing with high patient traffic. Healthcare managers should consider investing in
AI technologies, such as the one proposed in this study, which would ultimately lead to a reduction in
operational costs and improvements in patient care, because it can streamline diagnostic processes. The
proposed method can also be incorporated into telemedicine platforms to enable the remote evaluation
of skin cancer, which would significantly improve the effectiveness of telemedicine services, especially
in regions with limited access to dermatological expertise. Training programs for medical staff
regarding the use of such AI tools will also be crucial for ensuring effective adoption and maximization
of the benefits of these technologies. The insights gained from this study can also contribute to the
development of personalized treatment plans based on individual patient characteristics and skin cancer
type.
4.6 Theoretical implications
This study contributes to existing literature on AI and ML in medical image analysis in the context of
skin cancer detection. This research also advances the theoretical understanding of ensemble learning,
showing that weighted ensembles improve the performance of skin cancer classification models. This
study also contributes to the discussion of how best to combine multiple models to achieve superior
performance, especially in complex medical image classification tasks. The findings further emphasize
the significance of TTA in enhancing the robustness of ML models in medical imaging, providing
valuable insights for future research aimed at improving model performance in real-world applications
where data variability is a significant challenge.
This study confirms the success of TL in medical domains, supports the theory that models pretrained
on large, general datasets can be effectively adapted to specific tasks with limited data, contributes to
the ongoing development of TL methodologies, and highlights their potential to overcome the
challenges associated with limited datasets in healthcare. Illustrations of the use of pretrained models
in this study provide insights into the advantages of TL in medical image analysis and are useful lessons
for future research in this area. These findings suggest that future research should explore the
hybridization of various deep learning models and the application of advanced data augmentation

f
oo
techniques during both the training and testing phases to further enhance the model robustness and
performance.

r
5. Conclusion and future work
-p
After experimenting with a variety of pretrained models and methodologies, we discovered that the
accuracy of skin cancer categorization could be improved using TL techniques. Furthermore, we
re
conclude that the pretrained Desenet201 and Xception models, among the base models, are of
significant assistance in the successful classification of skin cancer. In addition, only the Xception and
lP

DenseNet pretrained models performed effectively with the learning rate scheduler. The final outcome
demonstrates that combining networks from several architecture families increases performance.
However, the use of TTA techniques and the assignment of weights to the base models proved to be
na

helpful. The weighted average of f1-score, recall, precision, and accuracy scores obtained indicates the
capability of the model to accurately detect true positives, which can assist dermatologists in making
ur

decisions. However, this study had some limitations. For additional improvements, artifacts (such as
hair) that cause bias in the model should be eliminated. Instead of employing typical image
Jo

augmentation techniques to increase the number of image datasets by transforming original images into
various forms, generative adversarial networks (GANs) should be used to generate synthetic images to
increase the number of images available for training the model. One of the primary limitations of our
study was the demographic composition of the HAM10000 dataset, which predominantly consists of
images from Caucasian patients. This demographic skewness may limit the generalizability of our
model when applied to populations with different skin tones. Although the HAM10000 dataset provides
a robust foundation for benchmarking because of its extensive annotation and validation, future research
should involve a more diverse range of skin tones to improve the robustness of the model and its
applicability across different demographic groups.
References
Afza, F., Sharif, M., Mittal, M., et al., 2022. A hierarchical three-step superpixels and deep learning
framework for skin lesion classification. Meth, 202, 88-102.
Albahar, M.A., 2019. Skin lesion classification using convolutional neural network with novel
regularizer. IEEE Access, 7, 38306-38313.
Ali, K., Shaikh, Z.A., Khan, A.A., et al., 2022. Multiclass skin cancer classification using EfficientNets–
a first step towards preventing skin cancer. Neurosci. Inform., 2(4), 100034.
Alonso-Belmonte, C., Montero-Vilchez, T., Arias-Santiago, et al., 2022. [Translated article] Current
State of Skin Cancer Prevention: A Systematic Review. Actas Dermosifiliogr., 113(8), T781-T791.
Arivazhagan, N., Mukunthan, M.A., Sundaranarayana, D., et al., 2022. Analysis of skin cancer and
patient healthcare using data mining techniques. Comput. Intell. Neurosci., 2022, 2250275.
Bansal, N., Sridhar, S., 2022. Skin lesion classification using ensemble transfer learning. In Second
International Conference on Image Processing and Capsule Networks: ICIPCN 2021 2 (pp. 557-566).
Springer Int. Publ.
Binder, M., Steiner, A., Schwarz, M., et al., 1994. Application of an artificial neural network in
epiluminescence microscopy pattern analysis of pigmented skin lesions: a pilot study. Br. J. Dermatol.,
130(4), 460-465.
Brancaccio, G., Balato, A., Malvehy, J., et al., 2023. Artificial Intelligence in Skin Cancer Diagnosis:
A Reality Check. J. Invest. Dermatol.
Bray, F., Laversanne, M., Sung, H., et al., 2024. Global cancer statistics 2022: GLOBOCAN estimates
of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin., 74(3), 229-
263.

f
Brinker, T.J., Hekler, A., Enk, A.H., et al., 2019. A convolutional neural network trained with

oo
dermoscopic images performed on par with 145 dermatologists in a clinical melanoma image
classification task. Eur. J., 111, 148-154.

r
Camela, E., Ilut Anca, P., Lallas, K., et al., 2023. Dermoscopic Clues of Histopathologically Aggressive
-p
Basal Cell Carcinoma Subtypes. Medicina, 59(2), 349.
Catalano, O., Roldán, F.A., Varelli, C., et al., 2019. Skin cancer: findings and role of high-resolution
re
ultrasound. J. Ultrasound, 22(4), 423-431.
lP

Charan, D.S., Nadipineni, H., Sahayam, S., et al., 2020. Method to classify skin lesions using
dermoscopic images. arXiv prepr. arXiv:2008.09418.
Chaturvedi, S.S., Gupta, K., Prasad, P. S., 2021. Skin lesion analyser: an efficient seven-way multi-
na

class skin cancer classification using MobileNet. In Adv. Mach. Learn. Technol. Appl. Proc. AMLTA
2020, 165-176.
ur

Chaturvedi, S.S., Tembhurne, J.V., Diwan, T., 2020. A multi-class skin Cancer classification using deep
convolutional neural networks. Multimed. Tools Appl., 79(39), 28477-28498.
Jo

Cohen, P.R., Erickson, C.P., Calame, A., 2019. Atrophic dermatofibroma: a comprehensive literature
review. Dermatol. Ther., 9, 449-468.
Conic, R.Z., Cabrera, C.I., Khorana, A.A., et al., 2018. Determination of the impact of melanoma
surgical timing on survival using the National Cancer Database. J. Am. Acad. Dermatol., 78(1), 40-46.
Dildar, M., Akram, S., Irfan, M., et al., 2021. Skin cancer detection: a review using deep learning
techniques. Int. J. Environ. Res. Public Health, 18(10), 5479.
Dinnes, J., Deeks, J.J., Chuchu, N., et al., 2018. Visual inspection and dermoscopy, alone or in
combination, for diagnosing keratinocyte skin cancers in adults. Cochrane Database Syst. Rev., (12).
Elgendi, M., Nasir, M.U., Tang, Q., et al., 2021. The effectiveness of image augmentation in deep
learning networks for detecting COVID-19: A geometric transformation perspective. Front. Med., 8,
629134.
Elmore, J.G., Barnhill, R.L., Elder, D.E., et al., 2017. Pathologists’ diagnosis of invasive melanoma and
melanocytic proliferations: observer accuracy and reproducibility study. BMJ, 357.
Esfahani, P.R., Mazboudi, P., Reddy, A.J., et al., 2023. Leveraging machine learning for accurate
detection and diagnosis of melanoma and nevi: an interdisciplinary study in
dermatology. Cureus, 15(8).
Esteva, A., Kuprel, B., Novoa, R.A., et al., 2017. Dermatologist-level classification of skin cancer with
deep neural networks. Nature, 542(7639), 115-118.
Gessert, N., Sentker, T., Madesta, F., et al., 2018. Skin lesion diagnosis using ensembles, unscaled
multi-crop evaluation and loss weighting. arXiv prepr. arXiv:1808.01694.
Grosu-Bularda, A., Lăzărescu, L., Stoian, A., et al., 2018. Immunology and skin cancer. Arch. Clin.
Cases, 5(3), Clin-Cases.
Harangi, B., 2018. Skin lesion classification with ensembles of deep convolutional neural networks. J.
Biomed. Inform., 86, 25-32.
Hosny, K.M., Kassem, M.A., Foaud, M.M., 2019. Classification of skin lesions using transfer learning
and augmentation with Alex-net. PLoS ONE, 14(5), e0217293.
Huang, G., Li, Y., Pleiss, G., et al., 2017. Snapshot ensembles: Train 1, get m for free. arXiv prepr.
arXiv:1704.00109.

f
Hyeraci, M., Papanikolau, E.S., Grimaldi, M., et al.,2023. Systemic photoprotection in melanoma and

oo
non-melanoma skin cancer. Biomolecules, 13(7), 1067.
Kazaj, P.M., Koosheshi, M., Shahedi, A., et al., 2022. U-net-based models for skin lesion segmentation:

r
More attention and augmentation. arXiv prepr. arXiv:2210.16399.
-p
Keerthana, D., Venugopal, V., Nath, M.K., et al., 2023. Hybrid convolutional neural networks with
SVM classifier for classification of skin cancer. Biomed. Eng. Adv., 5, 100069.
re
Kim, H.E., Cosa-Linan, A., Santhanam, N., et al., 2022. Transfer learning for medical image
classification: a literature review. BMC Med. Imaging, 22(1), 69.
lP

Le, D.N., Le, H.X., Ngo, L.T., et al., 2020. Transfer learning with class-weighted and focal loss function
for automatic skin cancer classification. arXiv prepr. arXiv:2009.05977.
na

LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature, 521(7553), 436-444.
Leiter, U., Keim, U., Garbe, C., 2020. Epidemiology of skin cancer: update 2019. Sunlight Vitam. D
ur

Skin Cancer , 123-139.


Jo

Liopyris, K., Gregoriou, S., Dias, J., et al., 2022. Artificial intelligence in dermatology: challenges and
perspectives. Dermatol. Ther., 12(12), 2637-2651.
Liu, Q., Das, M., Liu, Y., et al., 2018. Targeted drug delivery to melanoma. Adv. Drug Deliv. Rev.,
127, 208-221.
Loh, T.Y., Rubin, A.G.,Jiang, S.I.B., (2016). Basal cell carcinoma of the dorsal hand: an update and
comprehensive review of the literature. Dermatol. Surg., 42(4), 464-470.
Marghoob, N.G., Liopyris, K., Jaimes, N., 2019. Dermoscopy: a review of the structures that facilitate
melanoma detection. J. Osteopath. Med., 119(6), 380-390.
Mihulecea, C.R., Iancu, GM., Leventer, M., et al., 2023. The Many Roles of Dermoscopy in Melanoma
Detection. Life, 13(2), 477.
Mohamed, E.H., El-Behaidy, W. H., 2019, December. Enhanced skin lesions classification using deep
convolutional networks. In 2019 9th Int. Conf. Intell. Comput. Inf. Syst. (ICICIS), 180-188. IEEE.
Mortada, H., Aldihan, R., Alhindi, N., et al., 2023. Basal cell carcinoma of the hand: A systematic
review and meta-analysis of incidence of recurrence. JPRAS open, 35, 42-57.
Morton, C.A., Mackie, R.M., 1998. Clinical accuracy of the diagnosis of cutaneous malignant
melanoma. Br. J. Dermatol., 138(2), 283-287.
Parkin, D.M., Mesher, D., Sasieni, P., 2011. Cancers attributable to solar (ultraviolet) radiation exposure
in the UK in 2010. Br. J. Cancer, 105(2), S66-S69.
Prajapat, V.M., Mahajan, S., Paul, P.G., et al., 2023. Nanomedicine: A pragmatic approach for tackling
melanoma skin cancer. J. Drug Deliv. Sci. Technol., 104394.
Priyadharshini, N., Selvanathan, N., Hemalatha, B., et al., 2023. A novel hybrid Extreme Learning
Machine and Teaching–Learning-Based Optimization algorithm for skin cancer detection. Healthc.
Anal., 3, 100161.
Rahman, Z., Hossain, M.S., Islam, M.R., et al., 2021. An approach for multiclass skin lesion
classification based on ensemble learning. Informatics Med. Unlocked, 25, 100659.
Ramlakhan, K., Shang, Y., 2011. A mobile automated skin lesion classification system. In 2011 IEEE
23rd Int. Conf. Tools Artif. Intell. 138-141. IEEE.
Rao, B.K., Ahn, C.S. 2012. Dermatoscopy for melanoma and pigmented lesions. Dermatol. Clin. ,
30(3), 413-434.

f
oo
Reddy, S., Shaheed, A., Patel, R. 2024. Artificial intelligence in dermoscopy: enhancing diagnosis to
distinguish benign and malignant skin lesions. Cureus, 16(2).

r
Russakovsky, O., Deng, J., Su, H., et al., 2015. Imagenet large scale visual recognition challenge. Int.
J. Comput. Vis., 115, 211-252. -p
Schadendorf, D., Van Akkooi, A.C., Berking, C., et al., 2018. Melanoma. The Lancet, 392(10151), 971-
re
984.
Senel, E., 2011. Dermatoscopy of non-melanocytic skin tumors. Indian J. Dermatol. Venereol. Leprol.
lP

, 77, 16.
Simoes, M.F., Sousa, J.S., Pais, A.C., 2015. Skin cancer and new treatment perspectives: A review.
na

Cancer Lett., 357(1), 8-42.


Sung, H., Ferlay, J., Siegel, R.L., et al., 2021. Global cancer statistics 2020: GLOBOCAN estimates of
ur

incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin., 71(3), 209-249.
Tschandl, P., Rosendahl, C., Kittler, H., 2018. The HAM10000 dataset, a large collection of multi-
Jo

source dermatoscopic images of common pigmented skin lesions. Sci. Data, 5(1), 1-9.
Vestergaard, M.E., Macaskill, P.H.P.M., Holt, P.E., et al., 2008. Dermoscopy compared with naked eye
examination for the diagnosis of primary melanoma: a meta‐analysis of studies performed in a clinical
setting. Br. J. Dermatol., 159(3), 669-676.
Weiss, K., Khoshgoftaar, T.M., Wang, D., 2016. A survey of transfer learning. J. Big Data, 3, 1-40.
World Health Organization, 2017. Radiation: Ultraviolet (UV) radiation and skin cancer. World Health
Organization, Geneva, Switzerland. https://www.who.int/news-room/questions-and-
answers/item/radiation-ultraviolet-(uv)-radiation-and-skin-cancer.
Xin, C., Liu, Z., Zhao, K., et al., 2022. An improved transformer network for skin cancer classification.
Comput. Biol. Med. En tal., 149, 105939.
Zhang, J., Xie, Y., Xia, Y., et al., 2019. Attention residual learning for skin lesion classification. IEEE
Trans. Med. Imaging, 38(9), 2092-2103.
Jo
ur
na
lP
re
-p
ro
of
Jo
ur
na
lP
re
-p
ro
of
Jo
ur
na
lP
re
-p
ro
of
Jo
ur
na
lP
re
-p
ro
of
Jo
ur
na
lP
re
-p
ro
of
Jo
ur
na
lP
re
-p
ro
of
Jo
ur
na
lP
re
-p
ro
of
Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

☐ The author is an Editorial Board Member/Editor-in-Chief/Associate Editor/Guest Editor for [Journal


name] and was not involved in the editorial review or the decision to publish this article.

☐ The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:

of
ro
-p
re
lP
na
ur
Jo

You might also like