Cyber Malware
Cyber Malware
Cyber
Malware
Offensive and Defensive Systems
Security Informatics and Law Enforcement
Series Editor
Babak Akhgar
CENTRIC (Centre of Excellence in Terrorism, Resilience,
Intelligence and Organised Crime Research)
Sheffield Hallam University
Sheffield, UK
The primary objective of this book series is to explore contemporary
issues related to law enforcement agencies, security services and industries
dealing with security related challenges (e.g., government organizations,
financial sector insurance companies and internet service providers) from
an engineering and computer science perspective. Each book in the series
provides a handbook style practical guide to one of the following security
challenges:
Cyber Crime – Focuses on new and evolving forms of crimes. Books
describe the current status of cybercrime and cyber terrorism develop-
ments, security requirements and practices.
Big Data Analytics, Situational Awareness and OSINT – Provides
unique insight for computer scientists as well as practitioners in security and
policing domains on big data possibilities and challenges for the security
domain, current and best practices as well as recommendations.
Serious Games – Provides an introduction into the use of serious
games for training in the security domain, including advise for design-
ers/programmers, trainers and strategic decision makers.
Social Media in Crisis Management – explores how social media
enables citizens to empower themselves during a crisis, from terrorism,
public disorder, and natural disasters.
Law enforcement, Counterterrorism, and Anti-Trafficking –
Presents tools from those designing the computing and engineering
techniques, architecture or policies related to applications confronting
radicalisation, terrorism, and trafficking.
The books pertain to engineers working in law enforcement and
researchers who are researching on capabilities of LEAs, though the series
is truly multidisciplinary – each book will have hard core computer science,
application of ICT in security and security / policing domain chapters.
The books strike a balance between theory and practice.
Iman Almomani • Leandros A. Maglaras •
Mohamed Amine Ferrag • Nick Ayres
Editors
Cyber Malware
Offensive and Defensive Systems
Editors
Iman Almomani Leandros A. Maglaras
Security Engineering Lab School of Computing
Prince Sultan University Edinburgh Napier University
Riyadh, Saudi Arabia Edinburgh, UK
Computer Science Department
Nick Ayres
The University of Jordan
School of Computer Science and
Amman, Jordan
Informatics
De Montfort University
Mohamed Amine Ferrag
Leicester, UK
AI and Digital Science Research Center
Technology Innovation Institute
Masdar City, Abu Dhabi, United Arab
Emirates
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher,
whether the whole or part of the material is concerned, specifically the rights of translation,
reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in
any other physical way, and transmission or information storage and retrieval, electronic
adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information
in this book are believed to be true and accurate at the date of publication. Neither the
publisher nor the authors or the editors give a warranty, expressed or implied, with respect
to the material contained herein or for any errors or omissions that may have been made.
The publisher remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland
AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
v
vi PREFACE
and wrapping attacks are among the most frequently reported security
threats, according to the study. The majority of these attacks are the result
of multiple malware variants.
ix
x INTRODUCTION: EMERGING TRENDS IN CYBER-MALWARE
then use this information for identity theft, financial fraud, or other
malicious purposes.
• Rootkit: A rootkit is a type of malware that allows cybercriminals to
gain administrative access to the victim’s computer system. Rootkits
often remain hidden from antivirus software and can be difficult to
detect and remove, allowing cybercriminals to maintain access to the
victim’s system for an extended period.
To conclude, cyber-malware is a type of malicious software that can
cause significant damage and disrupt computer systems, networks, and
devices. There are several types of cyber-malware, including viruses, Tro-
jans, worms, ransomware, adware, spyware, and rootkits [16, 17]. Under-
standing the different types of cyber-malware and taking proactive mea-
sures to protect against them is essential for individuals, businesses, and
governments.
response plans, such as evacuation plans, first aid procedures, and disaster
relief efforts. Mitigation strategies can also involve restoration efforts, such
as rebuilding infrastructure or rehabilitating natural habitats [28].
Effective prevention and mitigation strategies are essential for reducing
the impact of disasters and crises. By taking proactive steps to prevent
events from occurring or mitigating their effects, we can reduce the risk of
harm and save lives [29]. Additionally, these strategies can also help reduce
the economic and environmental impact of disasters, making recovery and
restoration efforts more manageable.
Examples of prevention and mitigation strategies include [27–29]:
• Hazard assessments: Conduct regular assessments to identify potential
hazards and develop appropriate prevention and mitigation strategies.
• Early warning systems: Implement systems that provide early warning
of potential hazards, such as natural disasters or industrial accidents,
to allow for timely response and mitigation.
• Infrastructure improvement: Upgrade and maintain infrastructure,
such as roads, bridges, and buildings, to make them more resilient
to disasters.
• Community education and outreach: Educate communities about
potential hazards, how to prepare for disasters, and what to do in
case of emergency.
• Disaster response planning: Develop comprehensive plans for respond-
ing to disasters and crises, including evacuation plans, emergency
communication systems, and disaster relief efforts.
• Environmental protection measures: Implement measures to protect
the environment, such as reducing pollution and conserving natural
resources, to prevent or mitigate the impact of disasters.
• Risk assessments: Conduct regular assessments to identify potential
hazards and develop appropriate prevention and mitigation strategies.
• Use Antivirus Software: Install and regularly update a reputable
antivirus software program on your computer or device. Antivirus
software can help detect and remove malware from your system.
• Keep Software Up-to-date: Keep your operating system, web browser,
and other software applications up-to-date with the latest security
patches and updates. Cybercriminals often exploit vulnerabilities in
outdated software.
• Use Strong Passwords: Use strong, unique passwords for all your
accounts and avoid using the same password across multiple accounts.
xxii INTRODUCTION: EMERGING TRENDS IN CYBER-MALWARE
FUTURE OF CYBER-MALWARE
The future of cyber-malware is a topic of concern for cybersecurity pro-
fessionals and businesses worldwide. As technology continues to evolve
and become more complex, so do the threats posed by cyber-malware.
One trend that is likely to continue in the future is the use of artificial
intelligence (AI) by cybercriminals to develop more sophisticated and
effective malware. AI-powered malware can adapt to its environment,
evade detection, and target specific vulnerabilities in a network or system.
This type of malware can also learn from its actions and adjust its behavior
accordingly, making it more difficult to stop.
Another potential development in cyber-malware is the increased use
of ransomware attacks. Ransomware is a type of malware that encrypts a
victim’s files or data and demands payment in exchange for the decryption
key. This type of attack has become increasingly common in recent years
and is likely to continue in the future, as it can be highly profitable
for attackers. In fact, some experts predict that ransomware attacks may
become more targeted, with attackers focusing on specific industries or
organizations with high-value data.
The Internet of Things (IoT) is another area of concern when it comes
to the future of cyber-malware. IoT devices are often connected to the
internet and can be vulnerable to attacks, as they may not have strong
INTRODUCTION: EMERGING TRENDS IN CYBER-MALWARE xxiii
from a third-party provider. This lowers the barrier to entry for less
technically savvy criminals, who can now launch sophisticated attacks
without having to develop their own malware. As MaaS becomes more
prevalent, we can expect to see more varied and sophisticated malware
being developed and deployed.
• Advanced evasion techniques: As cybersecurity defenses become more
sophisticated, malware developers are turning to advanced evasion
techniques to avoid detection. These techniques include using encryp-
tion to hide malicious code, exploiting vulnerabilities in antivirus
software, and creating polymorphic malware that can change its code
to evade detection. As evasion techniques become more sophisticated,
it will become increasingly difficult to detect and prevent malware
attacks.
• Targeted attacks: Rather than launching mass attacks, cybercriminals
are increasingly targeting specific individuals or organizations. This
allows them to conduct more sophisticated attacks, such as spear-
phishing, that are tailored to the victim’s interests or behaviors. As
more data becomes available on individuals and organizations, we can
expect to see more targeted attacks that leverage this information to
bypass defenses and gain access to sensitive data.
• IoT malware: With the rise of the Internet of Things (IoT), there is a
growing concern about the security of these devices. IoT devices are
often not designed with security in mind and can be easily hacked,
giving cybercriminals access to sensitive data or control over critical
infrastructure. As the number of IoT devices continues to grow, we
can expect to see more malware specifically designed to target these
devices.
• Machine Learning-Based Malware: Machine learning has become
a powerful tool for cybersecurity, and malware developers are no
exception. By using machine learning algorithms, malware can adapt
to its environment and learn how to evade detection.
• Deepfakes: Deepfakes are videos or images that have been manipulated
using artificial intelligence to make them appear real. In the future, we
can expect to see more malware that uses deepfakes to trick users into
downloading or installing malicious software.
• Mobile Malware: With the increasing use of mobile devices, mobile
malware has become a growing concern. In the future, we can expect
to see more mobile-specific malware that can steal sensitive data or
take control of the device.
INTRODUCTION: EMERGING TRENDS IN CYBER-MALWARE xxv
REFERENCES
1. Aziz S, Irshad M, Haider SA, Wu J, Deng DN, Ahmad S (2022)
Protection of a smart grid with the detection of cyber-malware
attacks using efficient and novel machine learning models. Front
Energy Res 10:1102
2. Choi KS, Lee CS, Merizalde J (2023) Spreading viruses and mali-
cious codes. In: Handbook on crime and technology. Edward Elgar
Publishing, Florida, United States, pp 232–250
3. Riebe T, Kaufhold MA, Reuter C (2021) The impact of organi-
zational structure and technology use on collaborative practices in
computer emergency response teams: an empirical study. Proc ACM
Hum-Comput Interact 5(CSCW2):1–30
4. Gazet A (2010) Comparative analysis of various ransomware virii. J
Comput Virol 6:77–90
5. Bridges L (2008) The changing face of malware. Netw Secur
2008(1):17–20
6. Alkhadra R, Abuzaid J, AlShammari M, Mohammad N (2021)
Solar winds hack: in-depth analysis and countermeasures. In: 2021
12th international conference on computing communication and
networking technologies (ICCCNT). IEEE, Kharagpur, India,
pp 1–7
INTRODUCTION: EMERGING TRENDS IN CYBER-MALWARE xxxi
xxxv
xxxvi CONTENTS
Index 279
CHAPTER 1
A Deep-Vision-Based Multi-class
Classification System of Android Malware
Apps
1.1 INTRODUCTION
Nowadays, smartphones have become an essential part of our lives because
they are not used only for phone calls; they can be used for personal
payment, keeping personal data, healthcare facilities, and other different
personal services and applications [17]. Furthermore, it is commonly
known that Android Operating System (OS) is considered the most
I. Almomani ()
Security Engineering Lab, Prince Sultan University, Riyadh, Saudi Arabia
Computer Science Department, The University of Jordan, Amman, Jordan
e-mail: [email protected]; [email protected]
W. El-Shafai
Security Engineering Lab, Computer Science Department, Prince Sultan
University, Riyadh, Saudi Arabia
(RNN) for identifying malicious attacks in Android apps. First, they col-
lected two control attributes from Android apps: program interface (API)
phones and privileges. Then, they tested the proposed detection algorithm
on the different Android families of the CICAndMal2017 dataset. The
results reveal that the developed DL algorithm is better than several other
detection methods by achieving a 98.2% detection accuracy.
In [31], the authors suggested an ML-based ransomware detection
model. This model uses different ML algorithms to extract and analyze the
valuable features of Android ransomware apps. They tested the detection
efficacy of their proposed model on the CICMalDroid2020 dataset to
check its classification accuracy of 10 distinct families of ransomware
apps. The obtained results prove that the random forest classifier achieved
superior ransomware detection efficiency compared to those of the other
employed ML-based classifiers.
The authors in [4] presented an android malware detection system using
five different ML algorithms and one DL algorithm to analyze and extract
the main static features of the Android apps. These extracted features were
permissions, API calls, permissions rate, and monitoring system events.
The suggested detection system was examined using the CICAndMal2017
dataset that composes both benign and malware apps, and it achieved a
detection accuracy of 98%.
In [30], a semi-supervised ML algorithm is presented to distinguish
between ransomware from benign Android apps. The proposed algorithm
composes different feature extraction and selection techniques that were
tested on different labeled and unlabeled Android apps of the CICAn-
dMal2017 dataset. In [34], a host-level encrypted traffic shaping-based
Android malware classification approach was proposed. The classification
approach tested three different ML algorithms on the real-world CICMal-
Droid2020 dataset for feature extraction and detection mechanisms. In
addition, the authors simulated two experimental scenarios: malware family
classification and binary malware detection. The results proved that the
proposed classification approach had an accuracy of 98.8% for binary
malware classification, while it achieved a 95.2% detection accuracy for
malware family classification scenario.
In [2], a conversation-level traffic feature-based Android malware cate-
gorization and detection approach was presented. This approach consisted
of four phases: feature extraction, data cleaning, feature selection, and
training and testing. Different ML-based classifiers were tested, and the
attained results tested on the CICMalDroid2020 dataset proved that the
6 I. ALMOMANI ET AL.
10% of the visual Android images for the validation process and 10% for
the testing process. In the evaluation of the employed CNN algorithms,
different security and detection assessment parameters are estimated, as
will be discussed and clarified in Sect. 1.4.3.
Table 1.2 shows samples of the vision-based color and grayscale formats
of the binary Android APKs for the two examined datasets. It is observed
that each android family, benign or malware, has distinct features and char-
acteristics compared to other android families. So, vision-based Android
malware detection models are highly recommended for malware classifica-
tion systems compared to other static or dynamic detection models.
2T P
F1-Score =
. (1.1)
2T P + F N + F P
TP
Precision (PPV) =
. (1.2)
FP + T P
TP
Recall =
. (1.3)
T P + FN
TN +TP
. Accuracy = (1.4)
T N + FP + T P + FN
FP + FN
Misclassification rate (MR) =
. (1.5)
T N + FP + T P + FN
FN
FOR =
. (1.6)
FN + T N
FP
FDR =
. (1.7)
FP + T P
FN
. FNR = (1.8)
FN + T P
Table 1.2 Vision-based color and grayscale samples of Android APKs
Color Gray Color Gray Color Gray Color Gray Color Gray
Benign Adware Ransomware Scareware SMSMalware
(i) CICAndMal2017 dataset.
Color Gray Color Gray Color Gray Color Gray Color Gray
Benign Adware Banking Riskware SMSMalware
A DEEP-VISION-BASED MULTI-CLASS CLASSIFICATION SYSTEM OF ANDROID…
FP
FPR =
. (1.9)
FP + T N
TN
NPV =
. (1.10)
FN + T N
TN
TNR =
. , (1.11)
FP + T N
Table 1.3 Experimental parameters of the CNN algorithms used in the proposed
classification system
Simulation parameter Value
Software Python libraries TensorFlow and Keras
Training/testing/validation ratio 80/10/10 (%)
CNN optimizer ADAM
Learning rate 0.0001
Regularizer decay rate 0.001
CNN regularizer L2 regularizer algorithm
Epochs number 128
Function of loss Categorical cross-entropy function
Minimum batch size 64
Fig. 1.2 Confusion matrices of the best-performed CNN algorithms on the color
and grayscale formats of the CICAndMal2017 dataset. (a) MobileNetV3Large. (b)
VGG16
Fig. 1.3 Confusion matrices of the best-performed CNN algorithms on the color
and grayscale formats of the CICMalDroid2020 dataset. (a) EfficientNetB7. (b)
ResNet50
Fig. 1.4 Accuracy and loss curves for the best-performed CNN algorithms
on the color and grayscale formats of the CICAndMal2017 dataset. (a)
MobileNetV3Large. (b) VGG16
Fig. 1.5 Accuracy and loss curves for the best-performed CNN algorithms on the
color and grayscale formats of the CICMalDroid2020 dataset. (a) EfficientNetB7.
(b) ResNet50
Acknowledgments The authors would like to thank the support of Prince Sultan
University. Moreover, this research was done during the author Iman Almomani’s
sabbatical year 2021/2022 from the University of Jordan, Amman, Jordan.
REFERENCES
1. Abadi M et al (2016) TensorFlow: a system for large-scale machine
learning. In: 12th USENIX symposium on operating systems
design and implementation (OSDI 16), pp 265–283
2. Abuthawabeh MKA, Mahmoud KW (2019) Android malware
detection and categorization based on conversation-level network
traffic features. In: 2019 International Arab conference on infor-
mation technology (ACIT). IEEE, Piscataway, pp 42–47
3. Alkahtani H, Aldhyani TH (2022) Artificial intelligence algorithms
for malware detection in android-operated mobile devices. Sensors
22(6):2268
4. Almahmoud M, Alzu’bi D, Yaseen Q (2021) Redroiddet: android
malware detection based on recurrent neural network. Proc Com-
put Sci 184:841–846
5. Almohaini R, Almomani I, AlKhayer A (2021) Hybrid-based anal-
ysis impact on ransomware detection for android systems. Appl Sci
11(22):10976
6. Almomani I, AlKhayer A, Ahmed M (2021) An efficient machine
learning-based approach for android v. 11 ransomware detection.
In: 2021 1st international conference on artificial intelligence and
data analytics (CAIDA). IEEE, Piscataway, pp 240–244
A DEEP-VISION-BASED MULTI-CLASS CLASSIFICATION SYSTEM OF ANDROID… 19
2.1 INTRODUCTION
With the development and the increasing number of available Android-
based systems and application software, such as in industrial IoT systems
and smartphones [2], the latter are also becoming more popular targets
for cyber criminals, who plant their malicious apps as an exploit to conduct
serious and devastating cyber attacks over a large network of connected
Fig. 2.1 A taxonomy of malware analysis techniques and detection strategies. (a)
Taxonomy of Malware analysis techniques for feature extraction. (b) Taxonomy of
Malware detection techniques
FEDERATED LEARNING BASED ANDROID MALWARE DETECTION 25
2.3 METHODOLOGY
2.3.1 Federated Learning Paradigm
Recently, a novel collaborative learning paradigm and a decentralized
optimization strategy named federated learning (FL) have been proposed
to train ML and DL models based on datasets and computational resources
28
1
N
. min f (W ) = (גW , Di ) (2.1)
W ∈Rd N
i=1
Fig. 2.3 A flow chart of the proposed FDL-based Android malware detection
such as the learning rate, the local batch size, and the local training
epochs.
2. The server sends this information to pre-selected clients (i.e., clients
with resource availability and sufficient training data) to compute
local updates in an asynchronous manner.
3. Each client performs a number of local training epochs on the
received model and then sends back the computed updates (i.e., the
new model parameters) to the server.
32 D. HAMOUDA ET AL.
4. To update the global model, the server aggregates all local updates
from selected clients. After that, Steps 2, 3, and 4 are repeated for
another round of FDL until model convergence.
5. The server evaluates and maintains the final version of the global
model for future use. Depending on its local performance to be
deployed for malware detection, each participating client is inde-
pendent in preserving any global model states throughout the FDL
training.
TP +TN
Acc =
. where : (2.2)
T P + FP + T N + FN
TP
Dr =
. (2.4)
T P + FN
strategy using the same settings and compared it with the proposed FDL-
based detection approach. The FDL-based model classified the “Benign”
class, which represents normal apps, with a recall of 92%, and the “mal-
ware” class, which comprises all ten Android malware families, with a 71%
of detection rate. The results demonstrate the efficiency of FDL, with
practically the same performance as the centralized approach. However,
these results of both detection approaches are not enough for real-world
application, considering the high rate of false positives and false negatives
as illustrated in Fig. 2.4.
Figure 2.5 illustrates a comparison of model accuracy, loss, and time
complexity using different training approaches. In terms of time complex-
ity, we can demonstrate the efficacy of the proposed FDL approach. How-
ever, when using a large number of participating clients, the global model’s
accuracy decreased from 83.74% to 78.47%, as depicted in Table 2.4.
Fig. 2.4 Confusion matrix results. (a) with the centralized approach. (b) with the
federated deep learning (FDl)
36 D. HAMOUDA ET AL.
Fig. 2.5 Comparison of model accuracy, loss, and time complexity using different
training approaches
2.5 CONCLUSION
In this chapter, we propose a novel, cost-effective DL-based Android mal-
ware system (FDL) leveraging the emergent federated learning paradigm.
The analysis was conducted using the network layer features of malware
samples to detect any variation from their normal behavior. Experimen-
tal results proved the efficiency and effectiveness of the proposed FDL
FEDERATED LEARNING BASED ANDROID MALWARE DETECTION 37
REFERENCES
1. Acharya S, Rawat U, Bhatnagar R (2022) A low computational cost
method for mobile malware detection using transfer learning and
familial classification using topic modelling. Appl Comput Intell
Soft Comput 2022:1–22
2. Al-Fuqaha A, Guizani M, Mohammadi M, Aledhari M, Ayyash
M (2015) Internet of things: a survey on enabling technolo-
gies, protocols, and applications. IEEE Commun Surv Tutorials
17(4):2347–2376
3. Andresini G, Appice A, Malerba D (2021) Autoencoder-based deep
metric learning for network intrusion detection. Inf Sci 569:706–
727
4. Arora A, Garg S, Peddoju SK (2014) Malware detection using
network traffic analysis in android based mobile devices. In: 2014
eighth international conference on next generation mobile apps,
services and technologies. IEEE, New York, pp 66–71
5. Aslan ÖA, Samet R (2020) A comprehensive review on malware
detection approaches. IEEE Access 8:6249–6271
6. Garg S, Peddoju SK, Sarje AK (2017) Network-based detection of
android malicious apps. Int J Inf Secur 16(4):385–400
7. Hamouda D, Ferrag MA, Benhamida N, Seridi H (2021) Intrusion
detection systems for industrial internet of things: A survey. In:
2021 International Conference on Theoretical and Applicative
Aspects of Computer Science (ICTAACS). IEEE, New York,
pp 1–8
38 D. HAMOUDA ET AL.
3.1 INTRODUCTION
Our modern world is rapidly moving toward digitalization and automation,
where everything is converging into an automated version. As technology
takes over our lives, we are at the start of the 4th industrial revolution,
which mainly focuses on a world that relies heavily on technology and
innovation. The use of technology not only provides us with convenience
but comfort as well. However, the rapid development of technology comes
at the price of ensuring cybersecurity. Attackers are finding many ways
to achieve their malicious goals, which requires us to take precautions to
I. Almomani ()
Security Engineering Lab, Prince Sultan University, Riyadh, Saudi Arabia
Computer Science Department, The University of Jordan, Amman, Jordan
e-mail: [email protected]; [email protected]
R. Alkhadra • M. Ahmed
Security Engineering Lab, Computer Science Department, Prince Sultan
University, Riyadh, Saudi Arabia
e-mail: [email protected]; [email protected]
face such security issues. One of the most popular and common forms of
security invasion in our digital world is using malicious code, often referred
to as malware [27]. Malware is a code written by security attackers to
intrude into a specific computer system or software to perform malicious
acts such as stealing data or causing damage. For example, malware could
be in different forms, such as worms, viruses, trojans, spyware, adware, or
ransomware. Therefore, it is essential to protect any system from malware.
This can be done by detecting the malware and then classifying which type
it is. A tremendous amount of research has been conducted in the past
years regarding the topic of malware detection and classification [11].
According to recent reports, malware generation and creation have been
increasing rapidly on a daily basis. It is estimated that around one million
malware files are created daily [31]. This increase could seriously threaten
the economy, both financially and technically. The increase in cyber threats
and crimes costs the economy around 1 trillion dollars in 2022 for cyber
insurance, which results in an increase of 50% in comparison to the past 2
years [12]. The term malware refers to any malicious entity that changes
the original behavior by utilizing software flaws and vulnerabilities. In this
chapter, the term malware will be used to refer to any malicious software
that may include any of the following malware families, ransomware,
adware, viruses, or keyloggers [11].
Depending on the purpose and behavior of the malware, it is categorized
into different families. Every family has common features. For instance,
stealing information, creating vulnerability, and denial of service are all
examples of malware behavior. Such behaviors are essential in detecting
malware since this information will be used to analyze the software and
categorize it into benign or malware [35]. To differentiate between
malicious and benign apps, we need to scan the program code first, extract
its features, and analyze them [6]. Features extraction can be achieved
through two main ways: static analysis [3] and dynamic analysis [13].
Another possible way is to use hybrid analysis [2], a combination of the
previous two [25]. Static analysis is concerned with contextual data from
the source code without running the program. However, dynamic analysis
involves executing the program and extracting the runtime features. The
hybrid analysis uses both contextual and runtime features to detect malware
[11].
Over the years, researchers have been developing new techniques for
malware detection. The latest trend in this field is using machine learning
for malware detection. However, this technique cannot be used without
ASPARSEV3: AUTO-STATIC PARSER AND CUSTOMIZABLE VISUALIZER 43
analyzing the program code and extracting important features that help in
discriminating the malware families [22]. It is possible to evade the risk
of malware if the related features are available. Therefore, a collection of
advanced detection methods using machine learning depends on feature
engineering as well as reverse engineering [33]. Feature engineering is
a technique used to manipulate unstructured data into features that can
be understandable by the computer or machine [32]. However, other
techniques, such as binary obfuscation, can be used by attackers to design
a reverse engineering resistant file [30]. Moreover, deep learning can be
used in an advanced model of neural networks to capture features, learn,
and adapt during training. Even though a few studies report the use of deep
learning, some do not discuss the scalability and different architectures
enough for malware detection [5, 33].
One of the main benefits of using static analysis over any other technique
is that this analysis does not require executing the program, making it a
safer choice to apply [25]. Moreover, another vital benefit is examining
the code without regard to the diversity of IoT architecture or the physical
capabilities of an IoT device. Hence, the analysis considers all possible
inspection methods with no reference to the physical performance [24].
Furthermore, due to the nature of the static analysis, the malware may not
be able to avoid, hide, and/or obfuscate during the analysis process because
it runs passively [34]. Finally, its automation characteristic is what makes
static analysis prominent and outstanding [16].
Therefore, this chapter introduces a new comprehensive static parsing
software called ASParseV3. It is an extension to ASParseV1 [1]. It is a
GUI-based tool with various features such as (a) selecting many files or
directories to be scanned in one experiment, (b) adding or removing key-
words/features, (c) filtering the keywords/features and specific file types,
(d) efficient scanning process as many files are scanned simultaneously, (e)
providing customizable visualization dashboards with the ability to export
the chart(s), and (f) exporting the results in different formats such as JSON
and CSV.
The rest of the chapter sections present and discuss the related works
regarding malware analysis techniques, malware detection, and the use of
static analysis for malware detection. Moreover, they present the proposed
developed software (ASParseV3), which performs static features extraction
and parsing. Also, the chapter demonstrates a use case of Android OS
malware static features extraction using the ASParseV3 software. Finally,
conclusions with a summary of possible future works are presented.
44 I. ALMOMANI ET AL.
The efficiency of the parsing approach highly affects the overall static
analysis process. The authors of [18] applied canonical representation
to enhance the parsing process for Android code by developing the
static analyzing tool, PetaDroid. The core of this proposed solution is to
define the application’s behavior by tracking the used APIs and the app’s
actions. Consequently, fingerprinting the malware applications. Besides
the API calls, the permissions can be utilized to determine the malicious
application’s behavior. In [29], the APK file has been decomposed using
APKtool to retrieve the Manifest file and class.dex file. The aforementioned
files were parsed to extract the permissions and the API calls, respectively.
Then, multidimensional behavior analysis was conducted on the extracted
features to develop a malware portrait. Even though there are many static
parsing tools, they are not flexible in accepting many file systems and can
extract only a limited number of features. Moreover, they do not have a
customizable graphical user interface (GUI). Therefore, there is a need
for a customizable GUI-based system with the ability to scan an unlimited
number of features on various file systems.
the files to export the results. Finally, after the results are exported, they
can be visualized via a customizable dashboard.
Fig. 3.3 Selecting and customizing file types windows. (a) Selecting Window. (b)
Customizing Window
Fig. 3.4 Selecting and customizing keywords windows. (a) Selecting Window.
(b) Customizing Window
ASPARSEV3: AUTO-STATIC PARSER AND CUSTOMIZABLE VISUALIZER 51
the settings button can be used to edit the list of keywords, as illustrated
in Fig. 3.4b.
Fig. 3.6 Visualization window and page. (a) Visualization Window. (b) Dash-
board Page
on the “Scan” button. Finally, the progress bar provides the user with real-
time updates on the scanning progress.
1 https://ibotpeaches.github.io/Apktool/.
2 https://apkcombo.com/.
3 https://www.virustotal.com/gui/home/upload.
54 I. ALMOMANI ET AL.
3.3.3.3 Validation
The validation process for ASParseV3 was carried out thoroughly to ensure
that its performance, user interface (UI), and user experience (UX) met
the required needs. The Security Engineering Lab (SEL) conducted the
validation and compared the scanning results of ASParseV3 with previous
releases of ASParse. In addition, VirusTotal was used to retrieve informa-
tion such as permissions used in the applications/APKs to compare with
ASParseV3 and verify further its scanning results’ accuracy. To validate the
use case, VirusTotal was used to collect the permissions used by the APK.
ASPARSEV3: AUTO-STATIC PARSER AND CUSTOMIZABLE VISUALIZER 55
Figure 3.10 shows a sample of the permissions used by the APK validation
test sample. The resulting permissions were then used to scan the same
APK using ASParseV3. The results showed that ASParseV3could scan the
uploaded APK and accurately report the number of occurrences for each
permission. Overall, the validation process demonstrates that ASParseV3
is a reliable and efficient tool for scanning applications and APKs features
such as permissions. The comparison with previous releases and the use
of VirusTotal helped ensure the scanning results’ accuracy. For example,
Table 3.3 illustrates the number of occurrences of each permission found
by ASParseV3 during the validation process. Moreover, using ASParseV3
to scan the same application without specifying any keywords has resulted
in showing additional permissions/API calls other than the ones retrieved
from VirusTotal as Table 3.4 illustrates. Hence, this validates the accuracy
of the ASParseV3 and its additional capabilities compared with similar
tools.
56
1 {
2 {
3 "ApplicationPath": [
4 "/Users/rahaf/Desktop/Use Case/Malware Sample 5",
5 "/Users/rahaf/Desktop/Use Case/Malware Sample 4",
6 "/Users/rahaf/Desktop/Use Case/Malware Sample 3",
7 "/Users/rahaf/Desktop/Use Case/Malware Sample 2",
8 "/Users/rahaf/Desktop/Use Case/Malware Sample 1",
9 ...
10 "/Users/rahaf/Desktop/Use Case/Benign Sample 5",
11 "/Users/rahaf/Desktop/Use Case/Benign Sample 4",
12 "/Users/rahaf/Desktop/Use Case/Benign Sample 3",
13 "/Users/rahaf/Desktop/Use Case/Benign Sample 2",
14 "/Users/rahaf/Desktop/Use Case/Benign Sample 1"
15 ],
16 "OutputPath": "/Users/rahaf/Desktop",
17 "filetypes": [
18 "xml",
19 "smali",
20 "dex"
21 ],
22 "selectedFileTypes": [
23 "smali",
24 "xml",
25 "dex"
26 ],
27 "keywords": [
28 "android",
29 "android/accessibilityservice",
30 "android/accounts",
31 "android/animation",
32 "android/annotation",
33 "android/app",
34 "android/app/admin",
35 "android/app/assist",
36 "android/app/backup",
37 "android/app/blob",
38 "android/app/job",
39 "android/app/role",
40 "android/app/slice",
41 "android/app/usage",
42 "android/appwidget",
43 "android/bluetooth",
44 "Button",
45 "Bundle",
46 "Callback"
47 ...
48
49 ],
50 "selectedKeywords": [
51 "android",
52 "android/animation",
53 "android/app",
54 "Button",
55 "Bundle",
56 "Callback"
57 ],
58 "ExperimentName": "Experiment_One"
59 }
60
61 }
Acknowledgments The authors would like to thank the support of Prince Sultan
University. Moreover, this research was done during the author Iman Almomani’s
sabbatical year 2021/2022 from the University of Jordan, Amman—Jordan.
REFERENCES
1. Al Khayer A, Almomani I, Elkawlak K (2020) ASAF: android static
analysis framework. In: 2020 first international conference of smart
systems and emerging technologies (SMARTTECH). IEEE, New
York, pp 197–202
2. Almohaini R, Almomani I, AlKhayer A (2021) Hybrid-based anal-
ysis impact on ransomware detection for Android systems. Appl Sci
11(22):10976
3. Almomani I, Ahmed M, El-Shafai W (2022) Android malware
analysis in a nutshell. PloS One 17(7):e0270647
4. Almomani I, AlKhayer A, Ahmed M (2021) An efficient machine
learning-based approach for Android v. 11 ransomware detection.
In: 2021 1st international conference on artificial intelligence and
data analytics (CAIDA). IEEE, New York, pp 240–244
5. Almomani I, Alkhayer A, El-Shafai W (2022) An automated
vision-based deep learning model for efficient detection of android
malware attacks. IEEE Access 10:2700–2720
6. Almomani I, Khayer A (2019) Android applications scanning:
the guide. In: 2019 International conference on computer and
information sciences (ICCIS). IEEE, New York, pp 1–5
7. Alsoghyer S, Almomani I (2019) Ransomware detection system for
Android applications. Electronics 8(8):868
8. Anupama ML, et al (2021) Detection and robustness evaluation of
android malware classifiers. J Comput Virol Hacking Tech 18(3):1–
24
9. Ardito L, et al (2020) Automated test selection for Android apps
based on APK and activity classification. IEEE Access 8:187648–
187670
60 I. ALMOMANI ET AL.
4.1 INTRODUCTION
The Internet has witnessed an explosion in the kinds of tools available to
attackers as well as attack techniques in recent years. Attackers continuously
develop advanced tools and techniques to bypass defense technologies,
conceal their identities, and evade detection. There are a lot of various tools
that attackers can use to control systems they have compromised in target
environments [1, 25, 40, 41]. These tools implement different ways to
communicate across the network. This has resulted in a remarkable increase
B. Al-Duwairi ()
Department of Network Engineering and Security, Jordan University of Science
and Technology, Irbid, Jordan
e-mail: [email protected]
A. S. Shatnawi
Department of Software Engineering, Jordan University of Science and
Technology, Irbid, Jordan
e-mail: [email protected]
domain name of the mothership server does not map directly to the real
IP address assigned to the mothership server.
The botherder configures the bot machines to act as proxies that for-
ward traffic between end users and mothership servers (Step 3). Therefore,
each bot machine that is part of this network is referred to as a flux
agent. When visiting the website hosted by myfastfluxdomain.com, end
users resolve the IP addresses of the mothership domain name myfast-
fluxdomain.com through the domain name system (Step 4). The domain
name would be resolved to a set of IP addresses that belong to the botnet
representing fast-flux agents. It is to be mentioned here that flux agents
are mainly compromised machines with intermittent connectivity, limited
computational power, and low to average bandwidth. Finally, the end user
accesses the content through one of the flux agents returned by the DNS
reply (Step 5).
It is clear that the botnet of flux agents forms a protective layer for the
hidden malicious server. In order to increase the resilience of the network
and to evade detection, the botherder keeps changing the domain name
registration in a fast manner. This type of FFSNs is called a single-flux.
There is a more sophisticated type of FFSNs called double-flux FFSNs, in
which the botherder also changes the mapping between the authoritative
name server of the FFSN and its IP addresses quickly, resulting in a
constantly changing set of DNS servers, therefore providing a layer of
protection for the FFSN’ original authoritative name server. In this type
68 B. AL-DUWAIRI AND A. S. SHATNAWI
Fig. 4.2 Output of the first dig of the fast-flux domain rgyui.top (performed on
June 12 2022)
FAST-FLUX SERVICE NETWORKS: ARCHITECTURE, CHARACTERISTICS, AND… 69
the DNS response message for domain timeline.com. The common thing
about these two domains is that they resolve to multiple IP addresses.
However, a careful inspection of the response messages reveals major
differences that would allow us to distinguish between fast-flux domains
and CDN-hosted domains. Tables 4.1 and 4.2 show the country and ASN
number of each IP address for both domains.
It is clear that IP addresses of a fast-flux domain are usually distributed
in different countries and belong to several autonomous systems, while IP
addresses of CDN-hosted domains are located in the same country (in most
70 B. AL-DUWAIRI AND A. S. SHATNAWI
cases) and belong to the same autonomous system. This is expected because
of the intrinsic behavior of fast-flux service networks where domain names
are registered with botnet flux agents’ IP addresses located in different
countries and belonging to different autonomous systems. One of the main
characteristics of fast-flux networks is the short TTL value assigned for their
domain names compared to other domains. This is necessary to ensure the
frequent and rapid change in mapping between IP addresses of flux agents
and fast-flux domain names. As expected, performing another dig for the
fast-flux domain name rgyui.top shortly after the first dig returned another
set of IP addresses as shown in Fig. 4.4.
Fast-flux networks are characterized by frequent and fast mapping
changes between domain names and IP addresses. As a result, a particular
domain name would map to many IP addresses (selected from the pool of
flux agents controlled by the botherder) over a short period of time. For
example, the domain name rgyui.top maps to 17 distinct IP addresses based
on two consecutive IP addresses. Of course, this number grows fast when
performing more DNS queries for a longer period of time (Table 4.3).
In order to provide a reliable service and overcome the problem of
blacklisting fast-flux domain names, fast-flux operators keep registering
new domain names for their content servers. These domain names remain
active for a short period and are assigned IP addresses from the pool of IP
FAST-FLUX SERVICE NETWORKS: ARCHITECTURE, CHARACTERISTICS, AND… 71
Fig. 4.4 Output of the first dig of the fast-flux domain rgyui.top (performed on
June 12 2022)
Fig. 4.6 Sample FF hostnames and their FFSNs observed in the study conducted
in [2]
ing the domain name with many IP addresses provides high availability
of the malicious server as it increases the probability that one of the
flux agents is up and running.
• Large IP growth. To avoid blacklisting, mapping between a fast-flux
domain and agent IP addresses keeps changing over time. Therefore,
the number of IP addresses associated with a certain fast-flux domain
becomes large.
• Low TTL value. Since the mapping between a domain name and IP
addresses changes very fast in FFSNs, then the TTL values are kept
low. This guarantees that the values expire soon after the fast-flux
domain is resolved in order for users to obtain the new list of IP
addresses.
• Large number of autonomous systems. The number of IP addresses that
are returned in response to a DNS query for a fast-flux domain rep-
resents compromised machines that belong to different organizations
74 B. AL-DUWAIRI AND A. S. SHATNAWI
Table 4.4 List of features extracted from DNS response message for fast-flux
domain rgyui.top and legitimate domain timeline.com
Feature rgyui.top timeline.com
# IP addresses returned in 10 12
one DNS lookup
domain name length 9 12
TTL value of DNS record 75 284
# of distinct ASNs for all IP 6 1
addresses in a single DNS
lookup
# unique IP addresses Additional information Additional information
returned in all DNS lookups required required
(IP address growth)
strategically placed network sensors while users are surfing the Internet.
For example, with reference to the DNS response message of the fast-
flux domain rgyui.top shown in Fig. 4.2 and legitimate domain name
timeline.com shown in Fig. 4.3, the features shown in Table 4.4 can be
extracted directly from their DNS response messages.
In active DNS probing, it is required to start with a list of suspect
domain names that are usually obtained from email spam traps after
extracting URLs embedded in spam emails and stripping domain names
from them. A DNS lookup is then performed for each suspect domain
name using the Unix dig utility or any other ns lookup tools. Main
features are then extracted from DNS replies. A significant problem with
this approach is that it results in a large number of DNS queries which
may be suspected as a form of DDoS attack. In passive DNS probing,
information about all domain names queried by users in an organizational
network is collected passively. Collected DNS traffic traces are analyzed to
filter suspect domain names based on specific criteria. While this approach
does not incur additional DNS traffic, it deals with many DNS traffic traces
that require significant computational and storage resources. However, it
has the advantage of preventing false DNS replies that can be provided
by attackers who might be controlling authoritative domain name servers
while observing a large number of DNS queries. Moreover, it has the
advantage of discovering fast-flux domains that could potentially appear
in different malicious sources such as phishing emails, hacker forums, and
online social networks.
76 B. AL-DUWAIRI AND A. S. SHATNAWI
Table 4.5 List of features extracted from geolocation databases for fast-flux
domain rgyui.top and legitimate domain timeline.com
Feature rgyui.top timeline.com
# of distinct ASNs for all 6 1
IP addresses in a single
DNS lookup
# of distinct ASNs for all Additional information Additional information
IP addresses in a all DNS required required
lookup
# of distinct countries 4 1
# of distinct countries for Additional information Additional information
all IP addresses in a all required required
DNS lookup
FAST-FLUX SERVICE NETWORKS: ARCHITECTURE, CHARACTERISTICS, AND… 77
Fig. 4.7 Shodan search result for fast-flux domain rgyui.top IP address
211.171.233.126
78 B. AL-DUWAIRI AND A. S. SHATNAWI
Fig. 4.8 Shodan search result for fast-flux domain rgyui.top IP address
222.232.238.243
Fig. 4.9 Shodan search result for legitimate domain timeline.com IP address
52.2.173.203
DNS response message was eight. On the other hand, Shodan search results
for two IP addresses (IP1, 52.1.173.203, and IP2, 52.6.3.192) selected
arbitrarily from the set of IP addresses of fast the legitimate domain name
timeline.com shown in Figs. 4.9 and 4.10, respectively. The search results
show that both IP addresses have the same port numbers 80 and 443 open.
In fact, all IP addresses that correspond to this domain have the same ports
open. Table 4.6 shows the list of features extracted from Shodan.io for fast-
flux domain rgyui.top and legitimate domain timeline.com.
Fig. 4.10 Shodan search result for legitimate domain timeline.com IP address
52.1.119.170
Table 4.6 List of features extracted from Shodan.io for fast-flux domain rgyui.top
and legitimate domain timeline.com
Feature rgyui.top timeline.com
# of distinct open ports 11 2
# IP addresses found in the 8 out of 10 12 out of 12
database
ing a non-fast-flux domain. Flux agents work as proxy nodes that relay
traffic between end users and mothership servers. Going through these
agents takes additional processing and communication time. Typically, fast-
flux agents are office, or home machines with limited computational power
and intermittent Internet connectivity with low-speed Internet links [15].
In addition, it is expected that a flux agent’s actual owner would run several
applications and use the available bandwidth. This means that there is an
excellent chance that connecting to a flux agent does not succeed from
the first time or incurs additional overhead, resulting in additional delay in
setting up the connection.
Performing active delay measurement indicates whether a domain name
is a fast-flux domain name and may contribute to detecting fresh, fast-flux
domains that did not appear yet on any blacklist or do not have enough
DNS-related information to decide whether they are fast-flux domains or
no. Here response time measurement variations can be observed spatially
and temporally. Spatial variations are because fast-flux domain name maps
to multiple IP addresses that are distributed in different locations and
temporal variations to fluctuating workload on flux agents over time. In
other words, performing delay measurement between an end user machine
80 B. AL-DUWAIRI AND A. S. SHATNAWI
Table 4.7 List of features commonly used by different fast-flux detection mech-
anisms
Feature Source(s) Mode (Active/Passive)
# IP addresses returned in one DNS Active/Passive
DNS lookup
# unique IP addresses returned DNS Active/Passive
in all DNS lookups (IP address
growth)
# nameserver (NS) records in DNS Active/Passive
one single lookup
TTL value of DNS record DNS Active/Passive
# of distinct ASNs for all IP IP to ASN service Active
addresses in a single DNS
lookup
# of distinct ASNs for all IP IP to ASN service Active
addresses in a all DNS lookup
# of distinct countries IP to location service Active
# of distinct open ports Internet wide scanning active
database
# of distinct open ports Internet wide scanning active
database
# Response time difference Active delay measurement active
and a flux agent would be affected by the workload on that flux agent
depending on the running applications and Internet usage.
Table 4.7 summarizes the main features commonly used by fast-flux
detection mechanisms and shows the source of each feature and whether
this feature can be obtained actively or passively. It is to be noted that
various fast-flux detection mechanisms may use other features that are
primarily derived from the list shown in this table.
names and their resolved IP addresses. For a given domain name, the DNS
Message Aggregator module aggregates information from all observed
DNS messages corresponding to that domain during certain time interval.
This includes the set of resolved IP addresses for that domain, the number
of DNS queries observed during the monitoring time interval, and the
average TTL values for collected A records. The aggregated DNS messages
pass through the Message Pre-filtering module to filter out messages that
correspond to unlikely fast-flux domains. Remaining domain names are
processed by the Domain Clustering module, where domain names that
share the same set of IP addresses are grouped together in one cluster.
Finally, a machine learning-based classifier is used to classify each domain
into fast-flux domain name or non-fast-flux domain name.
PASSVM [2] is a mechanism that performs online fast-flux detection of
fast-flux domain names based on features extracted from the DNS response
message itself, local Censys database, and local geolocalization database.
The features include the number of IP addresses in the DNS response
message, TTL value, domain name length, number of distinct countries
where IP addresses are located, and number of distinct ASNs to which IP
addresses belong. As depicted in Fig. 4.12, whenever a user visits a website,
the A records of a suspect domain name received in a DNS reply message
in response to a DNS query are analyzed, and the required features are
obtained on the fly. Then, a decision is made on the fly whether the domain
name is a fast-flux domain or a non-fast-flux domain by using the SVM
machine learning algorithm.
Among the different features used in PASSVM, two new features
extracted from the Censys database have significantly improved the accu-
racy of fast-flux detection. IP ratio: The ratio of the number of IP addresses
returned from Censys to the number of IP addresses submitted in the
84 B. AL-DUWAIRI AND A. S. SHATNAWI
query. Ports: The number of distinct open port protocols for all IP
addresses returned from the Censys search engine.
The authors in [16] proposed a fast-flux detection mechanism that
is based on computing a FastFlux Score value. The system, called fast-
flux domain detector (FFDD), consists of three major modules that
include retriever, resolver, and recorder. The retriever performs active
DNS probing using the UNIX dig utility for fast-flux domain names and
legitimate domain names obtained from public sources. Also, it performs
active delay measurements between the client machine and each of the
resolved IP addresses after making the necessary formatting of the URL
link. For each domain name, the resolver calculates the FF-Score value
based on the response time measurements collected by the retriever and
stored by the recorder module.
FFDD has a training phase and a testing phase. The main objective
of the training phase is to determine the FF-Score of known fast flux
and legitimate domain names. The FF-Score threshold value is selected
FAST-FLUX SERVICE NETWORKS: ARCHITECTURE, CHARACTERISTICS, AND… 85
4.6 CONCLUSION
Fast-flux service networks represent a major trend in the operation and
management of botnets, malware distribution networks, and online
spam/scam campaigns. In these campaigns, spammers flood email boxes
of thousands of email users with advertisements about specific products
or services (e.g., pharmaceutical, adult content, and phishing). The
advertisements usually include hyperlinks to websites representing these
campaigns’ point-of-sale. Traditionally, spammers host the point of sale
website using a domain name that maps to a single IP address or multiple
IP addresses that remain constant for a considerable amount of time, which
would allow defenders to quickly identify and blacklist these IP addresses,
therefore denying access to spammers’ websites. On the other hand, fast-
flux service networks provide a layer of protection for point of sale website
by mapping the website to multiple IP addresses that keep changing at a
fast rate.
86 B. AL-DUWAIRI AND A. S. SHATNAWI
REFERENCES
1. Agarwal V, Mishra P, Kumar S, Pilli ES (2022) A review on attack
and security tools at network layer of IoT. Optical and Wireless
Technologies 2020:497–506
2. Al-Duwairi B, Jarrah M, Shatnawi AS (2021) PASSVM: a highly
accurate fast flux detection system. Comput Secur 110:102431
3. Alexa—top sites. https://www.alexa.com/topsites (Accessed on
20/6/2022)
4. Almomani A (2018) Fast-flux hunter: a system for filtering online
fast-flux botnet. Neural Comput Applic 29(7):483–493
5. Al-Nawasrah A, Almomani AA, Atawneh S, Alauthman M (2020)
A survey of fast flux botnet detection with fast flux cloud comput-
ing. International Journal of Cloud Applications and Computing
(IJCAC) 10(3):17–53
6. Anagnostopoulos M, Kambourakis G, Kopanos P, Louloudakis G,
Gritzalis S (2013) DNS amplification attack revisited. Comput
Secur 39:475–485
7. Aslan ÖA, Samet R (2020) A comprehensive review on malware
detection approaches. IEEE Access 8:6249–6271
8. Bakiras S, Loukopoulos T (2005) Combining replica placement
and caching techniques in content distribution networks. Comput
Commun 28(9):1062–1073
FAST-FLUX SERVICE NETWORKS: ARCHITECTURE, CHARACTERISTICS, AND… 87
31. MacFarland DC, Shue CA, Kalafut AJ (2017) The best bang for
the byte: characterizing the potential of DNS amplification attacks.
Comput Netw 116:12–21
32. Man K, Qian Z, Wang Z, Zheng X, Huang Y, Duan H (2020) Dns
cache poisoning attack reloaded: revolutions with side channels. In:
Proceedings of the 2020 ACM SIGSAC conference on computer
and communications security, pp 1337–1350
33. Nagunwa T, Kearney P, Fouad S (2022) A machine learning
approach for detecting fast flux phishing hostnames. J Inf Secur
Appl 65:103125
34. Passerini E, Paleari R, Martignoni L, Bruschi D (2008) Fluxor:
detecting and monitoring fast-flux service networks. In: Interna-
tional conference on detection of intrusions and malware, and
vulnerability assessment. Springer, Berlin, pp 186–206
35. Perdisci R, Corona I, Giacinto G (2012) Early detection of mali-
cious flux networks via large-scale passive DNS traffic analysis. IEEE
Trans Dependable Secure Comput 9(5):714–726
36. Rana S, Aksoy A (2021) Automated fast-flux detection using
machine learning and genetic algorithms. In: IEEE INFOCOM
2021-IEEE conference on computer communications workshops
(INFOCOM WKSHPS). IEEE, New York, pp 1–6
37. Salusky W, Danford R (2007) Know your enemy: fast-flux service
networks. In: The Honeynet Project, pp 1–24
38. Shodan search engine. https://www.shodan.io/ (Accessed on
20/6/2022)
39. Silva SS, Silva RM, Pinto RC, Salles RM (2013) Botnets: a survey.
Comput Netw 57(2):378–403
40. Thanh Vu SN, Stege M, El-Habr PI, Bang J, Dragoni N (2021)
A survey on botnets: incentives, evolution, detection and current
trends. Future Internet 13(8):198
41. Tuan TA, Long HV, Taniar D (2022) On detecting and classifying
DGA botnets and their families. Comput Secur 113:102549
42. VirusTotal. https://www.virustotal.com/ (Accessed on
20/6/2022)
90 B. AL-DUWAIRI AND A. S. SHATNAWI
43. Wang HT, Mao CH, Wu KP, Lee HM (2012) Real-time fast-flux
identification via localized spatial geolocation detection. In: 2012
IEEE 36th annual computer software and applications conference.
IEEE, New York, pp 244–252
44. Williams J, King J, Smith B, Pouriyeh S, Shahriar H, Li L (2021)
Phishing prevention using defense in depth. In: Advances in secu-
rity, networks, and Internet of Things. Springer, Cham, pp 101–
116
45. Zang XD, Gong J, Mo SH, Jakalan A, Ding DL (2018) Identifying
fast-flux botnet with AGD names at the upper DNS hierarchy. IEEE
Access 6:69713–69727
46. Zhou S (2015) A survey on fast-flux attacks. Inf Secur J: Global
Perspect 24(4–6):79–97
CHAPTER 5
5.1 INTRODUCTION
Cyberattacks are continuously increasing, and the estimation of the total
value at risk globally, due to these attacks, may reach 5.2 trillion USD until
2023 [37]. Cyberattacks employ different attack vectors, and a common
goal is the insertion of malware to target systems. Malware is defined as
a piece of software designed to cause damage or a program that performs
an undesired action, whether it be to disrupt or gain unauthorized access
to a system. Malware detection and classification is a hard task that
becomes increasingly difficult when we consider the high production rate
of new malicious executables and re-purposing of previously deployed
B. Tsouvalas ()
Stony Brook University, Stony Brook, NY, USA
e-mail: [email protected]
D. Serpanos
Computer Technology Institute and Press DIOPHANTUS, University of Patras,
Patras, Greece
e-mail: [email protected]
such as VirusTotal [5] and VirusShare [2], using their hash. Due to
these limitations, most efforts that require the availability of executable
samples employ proprietary or privately collected datasets, and, thus,
their results are not reproducible, and there is no ability for independent
comparisons. Since our approach requires analysis of executable samples
prior to classification, we have created our own dataset, collecting benign
and malicious executable samples; to demonstrate the dissimilarity and
diversity of the collected samples, we provide a similarity metric for both
sets of samples as we describe in Sect. 5.4.
A× = A1 × A2
. (5.1)
p × = p1 ⊗ p 2
. and q× = q1 ⊗ q2 (5.2)
The Random walk graph kernel for the pair .G1 and .G2 is defined as
T
κ(G1 , G2 ) =
. μ(k)q× Ak× p× (5.3)
k=0
where:
– .A× is the adjacency matrix of .G× , and, thus, .Ak represents the
probability of simultaneous k-length random walks on .G1 and .G2 ;
– .p× and .q× are the initial and stopping probability distributions of .G× ,
respectively (as described in .(2));
– T is the maximum length of a random walk;
– .μ(k) = λk ∈ [0, 1] is a coefficient that controls the importance of
length in random walks which we use to ensure that the sum converges
and the kernel value is well defined [66].
In the implementation of our method, we use .p1 = q1 = 1/|N1 | and
.p2 = q2 = 1/|N2 |, i.e., uniform distributions over the nodes of .G1 and
.G2 , as commonly used [21, 28]. For the extracted abstract API call graphs,
we note that:
– .A× in the kernel definition refers to an unweighted graph, while in the
case of weighted graph, .A× is replaced with matrix .W× which contains
the edge weights;
– the nodes of the graphs in our method are labeled, where the node
labels are the names of the corresponding distinct API calls. Thus,
.W× = A× , leading to Eq. (5.3) [28, 57, 66].
κ(G1 , G2 )
κ̂(G1 , G2 ) =
. (5.5)
max(κ(G1 , G1 ), κ(G2 , G2 ))
Normalization leads to kernel values in the range .{0, 1}, where 1 is the
result when a graph is compared to itself.
EFFICIENT GRAPH-BASED MALWARE DETECTION USING MINIMIZED KERNEL… 103
5.3.4 Classification
Classification is performed using all kernel values of the executable pairs
.(S, Di ), where .Di ∈ DS, as well as all kernel values of the executable pairs
.(Dk , Dl ), where .Dk , Dl ∈ DS; these values are .M , where .M = |DS|. These
2
kernel values are used for a support vector machine (SVM) classification
scheme.
The training of the SVM is performed using the set of the M vectors,
where each vector has the form .[(x11 , y1 ), . . . , (x1n , y1 )], [(x21 , y2 ), . . . ,
.(x2n , y2 )], .. . ., .[(xn1 , yn ), . . . , (xnn , yn )], where .xij represents the kernel
value of the comparison of samples .Si and .Sj , and .yi is the class label,
which is an integer value a with .a = −1 or .a = 1 for malicious and benign
executables, respectively.
Using the computed kernel values for the benign and the malicious
samples, we also evaluate binary classification using SVM. Considering
as baseline the case of the unweighted AACG, we evaluate the weighted
AACG approach.
5.4.1 Dataset
We have created our own dataset for the experimentation and testing of
our method, due to lack of appropriate public datasets. Focusing on the
Windows operating system, the dataset includes both benign and malicious
Windows executable files, containing only labeled executable samples. The
included benign samples are installation and support files and have been
collected from trusted sources such as Windows [3], Git [4], Cygwin [6],
and Codeblocks [7]. The dataset includes 567 benign executables, with
size ranging from several hundred KB to several MB. The malware samples
have been drawn from VirusShare [2], which is a website that provides
malware samples for academic and scientific purposes. We collected 827
malicious executables, with sizes ranging from several hundred KB to
several MB.
the maximum average similarity metric, is 32% for benign and 28.58%
for malware samples, respectively. This indicates that the similarity metric
between benign and malicious samples is sufficiently different.
Fig. 5.3 Benign-malware kernel values for dataset size: (from left to right) 50,
200, and 300, for unweighted graph (top) and weighted graph (bottom)
Fig. 5.4 SVM accuracy for unweighted and weighted abstract API call graph for
the different dataset sizes and train-to-test splits
5.5 CONCLUSIONS
We introduced an efficient and effective static method for malware detec-
tion, which employs API call graphs. Our method is based on the cal-
culation of an appropriate abstract API call graph, with reduced size
taking into account problem constraints. Furthermore, it includes efficient
calculation of a random walk graph kernel as a similarity metric. Through
experiments using an appropriate dataset, we show that the calculated
kernel constitutes an effective metric, which can be readily used for
malware classification with machine learning methodologies such as SVM.
Employing SVM and considering two different cases for the abstract API
call graph, an unweighted and a weighted one, we demonstrate that our
method is comparable to available alternatives when using unweighted
graphs, reaching more than 99% accuracy, and outperforms alternatives
when employing weighted graphs.
REFERENCES
1. (2019). https://ghidra-sre.org/ [Online; accessed 12-July-2022]
2. (2022). https://virusshare.com/, [Online; accessed 12-July-
2022]
3. (2022). https://www.microsoft.com/en-us/windows [Online;
accessed 12-July-2022]
4. (2022). https://git-scm.com/ [Online; accessed 12-July-2022]
5. (2022). https://www.virustotal.com/gui/home/upload [Online;
accessed 12-July-2022]
6. (2022). https://www.cygwin.com/ [Online; accessed 12-July-
2022]
7. (2022). https://www.codeblocks.org/ [Online; accessed 12-July-
2022]
8. Ah-Pine J (2010) Normalized kernels as similarity indices, pp 362–
373. https://doi.org/10.1007/978-3-642-13672-6_36
9. Alazab M, Layton R, Venkataraman S, Watters P (2010) Malware
detection based on structural and behavioural features of api calls
10. Alazab M, Venkataraman S, Watters P (2010) Towards understand-
ing malware behaviour by the extraction of api calls. In: 2010
112 B. TSOUVALAS AND D. SERPANOS
6.1 INTRODUCTION
The emergence of the Internet has provided a powerful means of com-
munication and data sharing, which has a huge impact on the worldwide
economic growth. However, systems and networks have become exposed
M. Belaoued • N. Chekkai
Caplogy, Velizy-Villacoublay, France
e-mail: [email protected]; [email protected]
A. Derhab ()
Center of Excellence in Information Assurance (CoEIA), King Saud University,
Riyadh, Saudi Arabia
e-mail: [email protected]
C. Ramdane
LICUS, University of 20 Aout 1955 Skikda, Skikda, Algeria
e-mail: [email protected]
1 https://cybersecurityventures.com/cybercrime-damages-6-trillion-by-2021/.
2 https://www.comparitech.com/antivirus/malware-statistics-facts/.
3 https://www.securitymagazine.com/articles/97166-ransomware-attacks-nearly-
doubled-in-2021.
N. Seddari
LIRE Laboratory, Abdelhamid Mehri-Constantine 2 University, Constantine,
Algeria
e-mail: [email protected]
A. Bouras
Department of Industrial Engineering, College of Engineering, Alfaisal
University, Riyadh, Saudi Arabia
e-mail: [email protected]
Z. Guessoum
CReSTIC EA 3804, University of Reims Champagne Ardenne, Reims, France
e-mail: [email protected]
DEEP LEARNING FOR WINDOWS MALWARE ANALYSIS 121
x1
y0
x2
y1
x3
y0
xp a (5)
W(1) (2)
W W(4)
a(2) a(3)
(3)
W
a(4)
RBMs
DNNS Deep Q-Learning
GANs
Autoencoders
CNNs
RNNs
GRUs LSTM
the graph structure or node labels. GNNs have been applied to a variety of
tasks, including node classification, link prediction, and graph classification.
However, during the testing step, deep learning algorithms are faster than
ML ones. Moreover, DL requires machines with significant computing
power and multiple GPUs due to its big data, whereas ML can function
on low-end machines with CPUs.
From Table 6.1, we can also observe that deep learning layers are
able to learn and solve problems without human intervention, while
machine learning model largely depends on the human intervention.
Finally, machine learning is suitable for simple applications such as pre-
diction and forecasting, while deep learning is used to solve complex
problems.
Some surveys are short [75, 81] or restricted to some analysis type like
static analysis [68] or dynamic analysis [26]. Urooj et al. [87] only focused
on ransomware using dynamic analysis. Four surveys [5, 32, 75, 81] are the
closest to our work as they cover deep learning and target Windows oper-
ating system. However, our work differs from the four earlier-mentioned
surveys in the following points:
• Multiple detection approaches and multiple operating system plat-
forms are covered in [5]. In [75, 81], only deep learning approach
is considered, and two operating systems, i.e., Windows and Android
OS, are targeted. Differently, our work solely focuses on deep learning
and only considers Windows OS.
• No taxonomy is provided in [75, 81]. In [32], feature-based taxon-
omy is proposed. Differently, our work proposes a taxonomy that
classifies works with respect to different criteria.
6.5.1.1 Detection
Malware detection is the process of identifying the presence of malware on
a computer or network. Malware detection is a binary classification issue,
where the outcome indicates whether the analyzed sample is malicious or
benign. The detection task is generally the primary step in the malware
analysis process. Indeed, once a sample is identified as malicious, it needs
to be assigned to a specific category or family. This is the role of the
classification task, and which is discussed in the following subsection. From
the total number of surveyed papers, we observed that the majority of
them (.≈60%) propose malware classification solutions, while the rest of
the papers deal with the malware detection problematic.
DEEP LEARNING FOR WINDOWS MALWARE ANALYSIS 131
Fig. 6.3 Proposed taxonomy for malware analysis using deep learning
6.5.1.2 Classification
Malware classification is the process of categorizing malware according
to its type or function. Indeed, malware classification is a more gen-
eral approach to deal with malware. Rather than trying to detect every
individual piece of malware, malware classification focuses on identifying
and categorizing different types of malware. Indeed, it will assign the
malware to a category or a family of malware with which it shares specific
characteristics (e.g., target, behavior, etc.). This information can then be
used to develop better detection and removal methods.
namely, static analysis and dynamic analysis [26, 80]. Dynamic analysis
requires the execution of the program. This is carried out in a controlled
setting that is typically created using an emulator (virtual environment)
[26]. This can be useful for understanding how the malware interacts with
the system and what is its purpose is. On the other hand, static analysis
does not require the execution of the program; instead of that, the analyzed
program is disassembled. The disassembly, also called reverse engineering,
is the process of converting a compiled (machine code, bytecode) program
into a more human readable format (i.e., assembly code). There are pros
and cons to both static and dynamic malware analysis. For instance, static
analysis is suitable for getting a general overview of what a piece of malware
does (i.e., malicious or benign), but it can be limited in its ability to show
how the malware actually behaves when executed and may not be able to
uncover all of the malware’s functionality since most existing malware are
obfuscated. Dynamic analysis, on the other hand, is better for seeing how
the malware behaves when executed, but it can be more difficult to set up
and can be more time-consuming. Thus, these two types of analyses can be
combined together resulting in what we call hybrid analysis. Based on the
chosen analysis type, we can distinguish two main categories of features,
namely, static features and dynamic features.
Operation Code
Operation code, also known as opcode, is a part of the assembly code
instructions that identifies the operation to be performed (e.g., Push,
Move, ADD, etc.) by the processor. Malware detection systems use
DEEP LEARNING FOR WINDOWS MALWARE ANALYSIS 133
PE Metadata
PE (portable executable) is the common file format for Microsoft Windows
executable files [71]. A PE file is composed of several parts including
headers (optional header, file header, etc.) as depicted in Fig. 6.4. The
latter contains rich metadata regarding the file. PE metadata has been
successfully leveraged in the context of malware analysis and detection,
allowing the design of lightweight and highly accurate malware detection
systems [7, 8, 78].
PE metadata has been also used to build deep learning-based malware
detection systems, such as in [50, 73, 76, 90, 97].
134 M. BELAOUED
Execution Traces
By execution traces, we mean every action that is accomplished by the
malware during its execution and that modifies the state of the system (i.e.,
host-based indicators). These can be file manipulations, registry updates,
etc. The solutions presented in [22, 30, 85] employed such type of features.
Network Traffic
By analyzing network traffic, it is possible to identify malicious activities
and take appropriate steps to mitigate the threat [9]. There is a variety of
techniques that can be used for network traffic analysis, including packet
DEEP LEARNING FOR WINDOWS MALWARE ANALYSIS 135
inspection, flow analysis, and log analysis. Packet inspection is the most
granular form of traffic analysis, as it allows analysts to examine each
individual packet of data that passes through the network. Flow analysis
groups together packets that are part of the same communication. Log
analysis relies on data that has already been collected by network devices.
The work of David et al. [22] and Shibahara et al. [79] employed network
traffic features.
6.5.3.1 Vectors
There are many ways to represent features for deep learning, but one of
the most popular is using a vector. A vector is simply a list of numbers,
and each number in it represents a particular feature. For example, a vector
might represent the set of features extracted from a malware sample. There
are many benefits for using vector representations for deep learning. First,
they are easy to work with and can be fed directly into most deep learning
algorithms. Second, vectors are often able to capture complex relationships
between features, which can be helpful for detecting patterns in data.
Finally, vectors can be easily extended to include additional features, which
can improve the accuracy of deep learning models. The solutions presented
in [20, 31, 36, 46, 47, 49, 50, 61, 62, 79, 85] employed vector-based
feature representation.
6.5.3.3 Graphs
There are two main types of graphs that are frequently employed in
malware analysis, namely, the control flow graph (CFG) and the function
call graph (FCG). A control flow graph (CFG) is a graphical representation
of the sequence of operations in a program. It is a directed connected
graph, where each node represents an instruction of the file’s assembly
code and each edge represents an execution sequence [28]. A function
call graph (FCG), on the other hand, is a graphical representation of the
sequence of function calls in a program. Both CFGs and FCGs can be used
to visualize the behavior of a program and to help debug it. However, they
have different uses. CFGs are more useful for understanding the overall
flow of a program, while FCGs are more useful for understanding the
sequence of function calls.
In the case of DL-based malware detection, many researchers have opted
for behavior graphs as the main feature representation for the analyzed
samples. For instance, Ding and Siyi [25, 40, 102] opted for a CFG
representation, while [95] opted for FCG one. In these pervious solutions,
only the one of Hua et al. [40] used CFGs in its original from, since
the latter solution employed a graph neural network, namely, deep graph
convolutional network (DGCNN). For the rest of the solutions, they either
transformed it into n-grams like the work of [102] or vector [95].
also to track the changes made to existing malware. There are various ways
of representing images for this purpose, including 2D matrices, histograms,
run-length encoding, and wavelets. In addition, images can be generated
from the entire file(i.e., bytecode) or specific parts of it. Each of these
approaches has benefits and drawbacks of its own, and the choice of
representation will depend on the specific application and the used DL
algorithms. Utilizing a convolutional neural network (CNN) is the most
popular approach. CNNs can automatically extract features from images
and learn to classify them. Other approaches include using a recurrent
neural network (RNN) or a long short-term memory (LSTM) network,
which can learn to detect patterns over time. Deep learning models can
also be combined with traditional machine learning methods to improve
the performance. There are a few potential advantages to this approach.
For instance, it can be much faster than traditional scanning methods. In
addition, it can be more accurate, since the entire file can be analyzed. The
solutions presented in [3, 5, 18, 19, 21, 23, 38, 42, 44, 48, 52, 54, 65–
67, 88–90, 97, 100, 101, 103] used image-based feature representation.
namely, random forest [14] and extra random trees [29]. The second
baseline approach generates byte n-grams from the first 328 bytes of the PE
file, which represent the location of the PE headers. These features are then
fed to logistic regression algorithm. The last approach consists of using two
types of neural networks, namely, fully connected neural networks (FCN)
and recurrent neural network (RNN), both fed with the aforementioned
raw byte region (328 bytes). Experimental results showed that the FCN
model outperforms the three others with regard to area under ROC curve
(AUC) and the balanced accuracy (BAC).
Choi et al. [18] introduced a malware detection approach that is based
on the grayscale image representation of malware and benign binaries.
They proposed a .256 × 256 images to represent the binaries, meaning that
they only consider the first 64KB of the analyzed binaries. These images are
then fed to a CNN composed of three convolutional layers, each followed
by a pooling layer, in addition to two fully connected layers. They evaluated
the model on a dataset composed of 10,000 benign and 2000 malware
samples and achieved an accuracy of 95.66%.
In [46], a lightweight deep convolutional neural network-based method
for detecting windows malware (CNN) is proposed. The proposed system
is composed of two main components, which are the instructions analyzer
and the classifier. The first components aims at disassembling the analyzed
binaries, extracting the set of opcodes, grouping them by functionalities,
and, finally, mapping them as 2D array. The results on the experiments of
the detection system, based on a dataset contains around 70,000 samples,
show an overall accuracy of 95% with a promising 10 hours as a training
time of the system with one convolutional layer.
The study in [97] introduced MalNet, a novel self-learner malware
detection approach, which uses CNN and LSTM networks. MalNet has
two stages; the first one aims at statically analyzing the binaries and gen-
erating three types of features, namely, the grayscale image representation
of bytecode, the opcode sequences, and various PE metadata. The second
one is the core process of MalNet, in which the CNN and LSTM networks
learn, respectively, from the grayscale images and the opcode sequences.
In addition, and in order to optimize the detection performance, the
authors used a stacking ensemble that integrates the two networks’ output
alongside with the metadata features and outputs the final prediction result.
The model was evaluated on more than 40,000 samples collected from
online software providers and Microsoft. The evaluation to an interesting
140 M. BELAOUED
Table 6.4 Summary of malware detection solutions that employ static features
Ref. Year Features used Feature rep. Dataset DL Results
size Algo.
[76] 2015 Byte/entropy, Histogram 431,926 DNN Acc: 95.6–96.85%
PE imp meta
[36] 2016 API calls 1-hot vect 50k SAEs Acc: 95.64%
[18] 2017 Bytecode Image 12k CNN Acc: 91%
[102] 2017 Opcode CFG, 4600 DBN Acc: .≈98%
n-grams
[97] 2018 Bytecode, Images, 40k CNN, Acc: 99.88%
Opcodes, PE sequences LSTM
Metadata
[46] 2018 Opcodes 2-D array 70k CNN Acc: 95%
[19] 2019 Bytecode Grayscale 5k CNN Acc: 91.9–97.6%
image
[23] 2019 Bytecode Grayscale .≈10,000 CNN Acc: 99.4 91.9–97.6%
image
4 https://marcoramilli.com/2016/12/16/malware-training-sets-a-machine-learning-
dataset-for-everyone/.
DEEP LEARNING FOR WINDOWS MALWARE ANALYSIS 143
of the control flow graph are extracted as the input of the DGCNN for
training, and the classifier is obtained to detect the packed malware.
Ding and Zhu [102] focused on studying the following problems: (a)
how to build a malware detection system based on DBNs, (b) whether
the unlabeled data can be used to improve the accuracy of malware
classification, and (c) whether the deep representation generated using
DBNs is helpful for feature extraction and dimension reduction. Therefore,
authors represented the malware program as opcode sequences and extract
the opcode n-grams to specify the behavioral features of malware. The
architecture of proposed system consists of three main components: the
PE parser, the feature extractor, and the malware detection module. The
testing results show that the proposed model has better classification results
than other models: support vector machines, decision trees, and the k-
nearest.
Darabian et al. [20] studied the potential of applying deep learning
techniques to detect cryptomining malware by using both static and
dynamic analysis approaches. They used long short-term memory (LSTN)
and convolutional neural network (CNN) techniques to advance the
analysis of cryptomining malware. They considered a set of hybrid features
composed of the captured system call events and opcode sequences. The
proposed system achieved an accuracy rate of 95% using static features and
an accuracy rate of 99% using dynamic ones.
An effective approach based on deep learning analysis for malware
detection and explanation is proposed by Wang et al. in [91]; they used
a classifier to predict whether the sample is malicious and an interpreter
to explain the classifier result via a system call number sequence of the
target sample with instrumentation tools in an elaborated sandbox. The
approach just needs a small amount of feature data and can reduce the input
dimension of the training model. The authors also adopted the layer-wise
relevance propagation (LRP) algorithm to save the malware analyst time
and to find which slice of a sequence makes the greatest contribution in
the decision.
Aditya et al. [1] introduced an approach for detecting malware based
on deep neural network and utilized a API call sequences. The model is
implemented with two different recurrent neural network architectures
for comparison (LSTM and GRU). The classification model that has been
created employs the LSTM architecture with RMSProp optimizer, and a
learning rate parameter shows that LSTM is better than GRU, achieving
an accuracy of 97.3%.
144 M. BELAOUED
Table 6.5 Summary of malware detection solutions that employ dynamic features
Ref. Year Features used Feature rep. Dataset DL Algo. Results
size
[69] 2015 API calls Sequences 500K RNN,ESN,MLP TPR:71%,
FPR:0.1%
[79] 2016 Network traffic 1-hot Vect 29,562 RSTNN F-score: 96.9%
[85] 2016 Process behavior 1-hot vect, 26 RNN(LSTM), AUC: 96%
Image CNN
[6] 2017 System Calls Sequence 75k LSTM, MLP Acc: 95.6%
[60] 2017 API Sequence 157 LSTM 96.67%
[61] 2018 PE, Bytec, APIs, Vector 3772 DNN Acc: 97%
Net. traffic
[95] 2019 Call API graph 1760 SAE Acc: 98.6%
[20] 2020 Opcodes, system Scale values, 1500 LSTN,ATT- Acc: 95.99%
calls binary vectors LSTM,CNN
[40] 2020 Functions calls Graph 600 DGCNN Acc: 96.4%
[1] 2021 API calls Sequences 2210 LSTM Acc: 97.3%
[91] 2021 API calls Sequences 2950 M-Bi-LSTM Acc: 97.39%
[30] 2021 Execution traces Sequences 4000 CNN .+ LSTM Acc: 91.63%
[56] 2022 API calls n-gram, 43,007 CNN .+ LSTM Acc: 97.31%
sequences
are used to construct the execution paths of the analyzed samples. The
redundant sequences are then removed, and the remaining ones are
represented as 1-hot vector of length 60 (i.e., the 60 documented system
calls) representing the presence or not of a specific kernel API call. These
feature vectors are then fed to a convolutional network as 3-grams (3x60).
The CNN will act as a feature extractor and will generate two features
vectors, which are forwarded to the RNN part of the neural network.
API traces dependencies are then modeled, and mean-pooling approach
is used to extract features of highest importance from the LSTM output,
which are forwarded to the Softmax layer that classifies each instance into a
family. The experimental results show that the combination of CNN-RNN
achieved better performances than the two models separately as well as
two machine learning classifiers, namely, hidden Markov model (HMM)
and support vector machine (SVM).
Zhang et al. [103] introduced IRMD, which is a malware families
classification method based on CNN and an image-based representation of
opcode sequences. In the proposed method, the authors first disassemble
binary executables and extract opcode sequences that are represented
as 2-D array, which is then converted to grayscale images. The authors
then applied image processing techniques on the generated images, such
as histogram normalization, dilation, and erosion. The resulting images
were then fed to a CNN. The latter has a baseline three-level archi-
tecture composed of a convolutional layer, a pooling layer, and a fully
connected layer. A softmax function is then used to classify malware variant
and benign images. IRMD was evaluated on a dataset collected from
VxHeavens repository and composed of 9168 malware samples from 10
distinct malware families and 8640 benign samples and was able to achieve
96.7% of accuracy. Mourtaji et al. [65] also used a CNN with image
representation of malware samples. They were able to achieve the highest
accuracy (i.e., 99.88%) on two distinct experiment settings on Microsoft
BIG15 dataset. Similarly, Kumari et al. [52] relied on image representation
of the analyzed binaries and CNN for malware families classification.
However, they introduced three different CNN architectures. The first one
has baseline architecture that is composed of three convolutional layers.
Each convolutional layer is followed by a max-pooling layer and a ReLU
activation layer. The second one is based on the VGG-16 architecture
which is pre-trained on the ImageNet dataset, which is composed of 1000
classes. In the last model, the authors fine-tuned the last convolutional
block of the VGG-16 model as well as the top-level classifier. Yue et al.
146 M. BELAOUED
[101] and Rahul et al. [74] also employed CNN for malware families
classification using image representation. In the first work [101], they
trained a very deep neural network (DNN) composed of ten layers and
a complex pre-processing method on the MalImg dataset and achieved
an accuracy of 97.32%, while in the second one [74], they trained a
baseline CNN architecture, two convolutional and two dense layers on
the BIG 2015 dataset Kalash et al. [44]. Hemalatha et al. [38] used a
pretrained densely connected convolutional network (DenseNet) model
with class-balanced loss function for reweighting the categorical cross-
entropy loss in the final classification layer. The DenseNet model uses fewer
parameters and ensures information flow by connecting all the layers in
the network with their feature maps. The performance of the proposed
model was evaluated on four malware datasets, namely, Malimg, BIG 2015,
Malicia, and Malvis, achieving, respectively, 98.23%, 98.46%, 89.48%, and
98.21% of accuracy. Zhihua et al. [19] developed an approach to advance
the detection of malicious programs using convolutional neural networks
(CNNs) and non-dominated sorting genetic algorithm II (NSGA-II). The
CNNs are used to identify and classify grayscale images converted from
executable files of malicious code. NSGA-II is then employed to deal
with the imbalanced data of malware families. A series of experiments are
performed for malware image data from Vision Research Lab, and the
results show that the proposed method is effective maintaining higher
accuracy. Ni et al. [67] considered opcode sequences instead of bytecode.
They encoded these sequences using SimHash, which they considered as
pixels and converts them to grayscale images. Kebede et al. [48] opted
for a deep learning architecture composed of multilayer neural network
with auto-encoders applied on malware images. An approach based on
visualization and fine-tuned CNN is proposed by Vasan et al. in [88]; they
used color instead of grayscale images generated from the malware binaries
to identify and detect both packed and unpacked malware. The proposed
method is called image-based malware classification using ensemble of
CNNs (IMCEC). According the experimental result on Malimg malware
benchmark, the proposed model demonstrated 99% accuracy for unpacked
malware and 98% accuracy for packed malware. The problem with this
approach is that it considers the entire program’s binary, which is very large
and takes considerable time to process.
Venkatraman et al. [89] presented a hybrid model by employing simi-
larity mining and deep learning architectures for accurately detecting and
classifying obfuscated malware into their malware families. The proposed
DEEP LEARNING FOR WINDOWS MALWARE ANALYSIS 147
model used two types of learning approaches: CNN and LSTM. The
objectives are (1) to describe the use of image-based techniques for
identifying suspicious system behavior and (2) to suggest and research the
use of hybrid image-based approaches with deep learning architectures for
an efficient malware detection. The performance of the models is evaluated
on the three datasets: VX Heavens, Malimg, and Microsoft. The model
accuracy achieved 96% on average and the advantage that it required less
computational cost as compared to the classical machine learning-based
methods.
The work of [100] introduces MDMC, a byte-level malware classifi-
cation approach based on Markov images and deep learning. In contrast
to grayscale images, the first phase of MDMC does not take the issue of
resizing into account and instead attempts to transform malware binaries
into Markov pictures according to the bytes transfer probability matrix. The
deep convolutional neural network is then used to classify Markov pictures.
In this procedure, only malware binaries were employed; dynamic and
reverse analysis were not used. On the Microsoft dataset and the Drebin
dataset, two malware datasets, the performance of the suggested model has
been assessed. On the two datasets, the average MDMC accuracy rates are
99.26% and 97.36%, respectively.
Lin and Yeh [57] presented an efficient one-dimensional convolutional
neural network CNN models for malware classification. The 1D CNN
models explore both bit-level and byte-level sequences extracted from
malware executables. The authors designed a simple architecture of 1D
CNN to learn the features from raw binary sequences and to convert
malware executables into images. The experiments show that the proposed
1D CNN model achieves better performance with smaller resizing byte
sequences with an accuracy of 96.32% and 98.70% using two benchmark
datasets.
The comparison of the classification capabilities of convolutional neural
networks (CNN) and extreme learning machines (ELM) for malware
images classification is the main objective of Jain et al.’s [42] work. They
used both two-dimensional images and one-dimensional vectors produced
from images to view malware samples as images and apply image analysis
algorithms. Results on the Malimg dataset showed that ELMs train faster
than CNNs and produce results with higher accuracy while processing 1D
data. The authors also noted that ELMs handle 2D data more quickly than
CNNs. Finally, authors concluded that ELMs are faster to train than CNNs,
but only by a relatively small factor as compared to image-based training.
148 M. BELAOUED
The deep learning approach in [54] is practical for real-life uses since
it has two interesting properties: it does not require neither feature
engineering nor a long time to classify the malware class of a binary
file. Indeed, the malware samples are converted into grayscale image
representation and then fed to different neural network models. The latter
are a combination of convolutional layers which process the input, with
RNN and LSTM layers. The test conducted on the malware data from the
Microsoft Malware Classification Challenge (i.e., BIG 2015) available on
Kaggle (10,868 samples) shows an accuracy of 98.2% in the cross-validation
procedure through the CNN bi-directional LSTM model.
In [66], authors proposed an ensemble learning-based classification
system comprised of convolutional network to classify malware programs.
In their research, they used the nine-class Microsoft Malware Classification
Challenge (BIG 2015) dataset. For each malware file in this dataset, there
is an assembly file and a compiled file. Convolutional neural networks
are used to classify compiled files and display them as images; then
convolutional neural networks (CNNs) are used to classify these images.
Long short-term memory (LSTM) networks are used to classify machine
language opcodes in assembly files after they have been converted into
sequences. When identifying assembly files using an LSTM network,
accuracy is 97.2 percent; when categorizing compiled files with a CNN
architecture, accuracy is 99.4 percent.
Darem et al. [21] suggested a semi-supervised method for detecting
obfuscated malware that combines opcode analysis, feature engineering,
image processing, and deep learning approaches. The proposed approach
transforms the malware binary into image for visual analysis of the malware
executable and contrasts with well-known grayscale image-based classifica-
tion methods. As a result, the approach identifies and predicts associated
malware families with minimal running time overhead. They validated
the proposed method through comprehensive experiments and compared
it with other methods. Experimental results proved that the proposed
approach achieved the highest performances with 99.12% of accuracy.
The work of [4] presents a new malware classification framework
based on a hybrid deep learning algorithm. The framework combines
two pretrained deep neural networks, namely, RestNet and Alexnet, in
order to learn features from malware samples, which are represented as
grayscale images, and classify them into different families. The framework
DEEP LEARNING FOR WINDOWS MALWARE ANALYSIS 149
accuracy was 67.60%, with six classes containing five separate malware
kinds.
A malware detection and family classification framework for malware
based on deep neural networks and visualization is proposed by Jian et al.
in [43]; they convert an executable file samples into asm files bytes files
by disassembly technology. As a result, a balanced experimental dataset
containing normal software samples and malware samples is constructed.
To this end, the authors designed a new data representation approach
based on the binaries and word vectors extracted from both asm files and
bytes files and combined visualization technology with data augmentation
to build an optimized deep neural network architecture, i.e., SERLA (
SER esNet50 .+ Bi- L STM .+ A ttention) for malware detection. The
experimental results show that proposed method is superior to the state-of-
the-art methods and can achieve 98.31% accuracy. Li et al. [56] proposed
a deep framework for malware detection using deep learning models,
which is based on multiple API sequence intrinsic features. The proposed
method is able to detect whether the software is malicious or not and to
distinguish between malware and goodware. The authors firstly applied
embedding and convolutional layers to well depict the actual software
behaviors. Secondly, they designed an encoder to represent the semantic
information of APIs and the relationship between API calls using the Bi-
LSTM module. The experiments show that the proposed method performs
better than all the baselines in using API sequence to detect the malware,
achieving an accuracy score of 97.31%. In Table 6.6, we provide a summary
of the discussed malware classification solutions that employ static features.
Table 6.6 Summary of malaware classification solutions that employ static fea-
tures
Ref. Year Features used Feature rep. Dataset DL Algo. Results
size
[103] 2016 Opcode Image 17,808 CNN Acc: 96.7%
[101] 2017 Bytecode Image 9435 CNN Acc: 97/32%
[52] 2017 Bytecode Image 21,741 CNN Acc: 97.07%
[48] 2017 Bytecode Image 10,826 AE 99.15%
[62] 2017 API calls Word2Vect 5647 CNN Acc: 98%
[50] 2017 PE Meta, n-gram, 22,757 FFNN, CNN F1S: 92%
imp, opcod vector
[54] 2018 Bytecode Image 10,860 CNN .+ biLSTM Acc: 98.2%
[44] 2018 Bytecode Image .≈30k CNN Acc: 99.97%
[67] 2018 Opcode Image .≈10k CNN Acc: 98.862%
[3] 2019 Bytecode Image 19,740 CNN Acc: 97.19%
[65] 2019 Bytecode Grayscale .≈30k LSTM Acc:
image 97.02–99.88%
[47] 2019 Opcode, API Binary 10,868 LSTM Acc: 97.59%
calls vectors
[89] 2019 Opcode, API Image .≈30k CNN,LSTM Acc: 96%
calls
[73] 2017 PE metadata n-gram .≈95K CNN, RNN Acc: 90.8–97.7%
[31] 2020 APIs, Bytec, Binary .≈10K Multimodal CNN Acc: 99.75%
Opcode vectors,
n-gram
[88] 2020 Bytecode Sequence, .≈10K CNN Acc: 98.99%
Image
[66] 2020 Opcode Image .≈10K LSTM,CNN,RNN Acc: 97.2, 99.4,
99.8%
[100] 2020 Bytecode Image .≈15K CNN Acc: 99.26,
97.36%
[42] 2020 Bytecode Image 9300 CNN,ELM Acc: 96.3, 97.7%
[21] 2021 Opcode, n-gram, 10,868 CNN .+ XGBoost Acc: 99.12%
bytecode image
[4] 2021 Bytecode Image .≈40k Alexnet, restnet Acc: 97.78%
[43] 2021 Opcode n-gram 10,868 CNN .+ RNN Acc: 98.31%
[64] 2021 Bytecode n-gram, 179,725 CNN Acc: 96.15%
image
[38] 2021 Bytecode Image 21,741 DenseNet Acc: 98.46%
[57] 2022 Bytecode Image 10,868 CNN Acc: 96.32%
152 M. BELAOUED
6.8 CONCLUSION
In this paper, we surveyed the state-of-the-art solutions for Windows
malware analysis using deep learning. We first provided the necessary back-
ground information regarding malware analysis as well as deep learning.
We then introduced our proposed taxonomy, and we discussed the existing
solutions with regard to this taxonomy.
In conclusion, we believe that deep learning can be extremely effective
in malware analysis and detection, especially when dealing with obfuscated
and zero-day malware. However, it is important to remember that no single
solution is perfect, and there are always trade-offs to be made. For example,
deep learning models may require more resources to train and deploy than
conventional machine learning solutions or signature-based approaches.
Additionally, deep learning models may be more susceptible to adversarial
attacks. Therefore, it is important to carefully consider the risks and benefits
of deploying a deep learning model for malware analysis.
REFERENCES
1. Aditya WR, Hadiprakoso RB, Waluyo A et al (2021) Deep
learning for malware classification platform using windows API
call sequence. In: 2021 international conference on informatics,
multimedia, cyber and information system (ICIMCIS). IEEE,
Piscataway, pp 25–29
2. Alzubaidi L, Zhang J, Humaidi AJ, Al-dujaili A, Duan Y, Al-
Shamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L
(2021) Review of deep learning: concepts, CNN architectures,
challenges, applications, future directions. J Big Data 8:1–74
3. Andrade EDO, Viterbo J, Vasconcelos CN, Guérin J, Bernardini
FC (2019) A model based on LSTM neural networks to identify
five different types of malware. Proc Comput Sci 159:182–191
4. Aslan Ö, Yilmaz AA (2021) A new malware classification frame-
work based on deep learning algorithms. IEEE Access 9:87936–
87951
156 M. BELAOUED
71. Pietrek M (1994) Peering inside the pe: a tour of the win32 (r)
portable executable file format. Microsoft Systems Journal-US
Edition, pp 15–38
72. Qiu J, Zhang J, Luo W, Pan L, Nepal S, Xiang Y (2020) A survey
of android malware detection with deep neural models. ACM
Comput Surv 53(6):1–36
73. Raff E, Sylvester J, Nicholas C (2017) Learning the PE
header, malware detection with minimal domain knowledge. arXiv
preprint arXiv:170901471
74. Rahul R, Anjali T, Menon VK, Soman K (2017) Deep learning
for network flow analysis and malware classification. In: Interna-
tional symposium on security in computing and communication.
Springer, Berlin, pp 226–235
75. Sahin M, Bahtiyar S (2020) A survey on malware detection with
deep learning. In: 13th international conference on security of
information and networks, pp 1–6
76. Saxe J, Berlin K (2015) Deep neural network based malware
detection using two dimensional binary program features. In:
2015 10th international conference on malicious and unwanted
software (MALWARE). IEEE, Piscataway, pp 11–20
77. Schultz MG, Eskin E, Zadok E, Stolfo SJ (2001) Data mining
methods for detection of new malicious executables. In: Security
and privacy, 2001. S&P 2001. Proceedings. 2001 IEEE Sympo-
sium on. IEEE, Piscataway, pp 38–49
78. Shafiq MZ, Tabish SM, Mirza F, Farooq M (2009) Pe-miner:
mining structural information to detect malicious executables
in realtime. In: International workshop on recent advances in
intrusion detection. Springer, Berlin, pp 121–141
79. Shibahara T, Yagi T, Akiyama M, Chiba D, Yada T (2016) Effi-
cient dynamic malware analysis based on network behavior using
deep learning. In: 2016 IEEE global communications conference
(GLOBECOM). IEEE, Piscataway, pp 1–7
80. Siddiqui M, Wang MC, Lee J (2008) A survey of data mining
techniques for malware detection using file features. In: Proceed-
ings of the 46th annual southeast regional conference on xx.
ACM, New York, pp 509–510
DEEP LEARNING FOR WINDOWS MALWARE ANALYSIS 163
7.1 INTRODUCTION
Recent years have seen a sharp rise in the usage of Internet of Things (IoT)
devices in a number of industries, including industry, health, automation,
S. E. ud Din Arshad
National University of Sciences and Technology, Islamabad, Pakistan
e-mail: [email protected]
M. M. Nasralla () • S. B. A. Khattak
Smart Systems Engineering Lab, Department of Communications and Networks
Engineering, Prince Sultan University, Riyadh, Saudi Arabia
e-mail: [email protected]; [email protected]
T. A. Alhaj
Information Assurance and Security Research Group, Faculty of Computing,
Universiti Teknologi Malaysia, UTM, Johor Bahru, Malaysia
I. ur Rehman
School of Computing and Engineering, University of West London, London, UK
e-mail: [email protected]
and education, as well as smart homes and smart cities [1, 2]. According
to current estimates, there will be 75.44 billion connected IoT devices
worldwide by 2025 [3]. According to another study, the next significant
step in achieving the Internet’s goal of linking the entire world is the
network of connected “smart” products [3]. The IoT technology is closely
related to our daily life and is applied in several real-life related applications.
It has evolved rapidly and now has covered almost every aspect of modern
life, with its applications ranging from home-based services to emergency
management services and from societal and environmental applications to
industrial and technological applications [4]. Under the umbrella of each
of these domains lie thousands of use cases and applications, for example,
smart living rooms, smart kitchens, smart garages, smart doors, smart
cooling, and refrigerating systems, healthcare applications for older people,
or any other monitoring, tracking, or reporting systems [2, 5]. Intelligent
transportation and traffic management is another example of IoT, which
has a significant effect on our lives. The societal applications improve the
lifestyle of the general public and bring a lot of services at the tip of their
fingers. Security and surveillance have been revolutionized with the advent
of IoT, like intrusion detection systems, and smart surveillance systems.
Wildlife monitoring, environmental monitoring, smart farming, observing
energy consumption patterns, electricity management, water distribution,
waste management, smart marketing, and many similar applications are
an essential part of our society now. IoT devices are usually connected
through the wireless channel because of their flexibility and mobility.
Several wireless communication technologies are used for IoT deployment,
depending upon the application requirements [4]. These communication
technologies can also be classified as long and short range. The most
commonly used short-range communication technologies are RFID, Wi-
Fi, ZigBee, and Bluetooth. The widely used long-range communication
technologies for IoT are Sigfox, LoRaWAN, Weightless, Narrow Band IoT,
and Enhanced Machine Type Communication (eMTC) (Fig. 7.1).
The common features among most applications are low cost, low
processing, low power, low storage, and low bit rate. Computers, smart-
phones, communication interfaces, RFID tags, actuators, readers, cameras,
controllers, GPS, sensors, operating systems, lightweight services, and
preloaded apps generally make the IoT infrastructure. This technology is
not as secure as it seems, and it also raises additional security and privacy
issues. IoT networks have weak or no security since they rely on inexpensive
devices (such temperature sensors, security cameras, etc.) with constrained
MALWARE ANALYSIS FOR IOT AND SMART AI-BASED APPLICATIONS 167
manipulate data or disrupt systems over the global IoT network. IoT
faults and assaults may overshadow its benefits. In addition, standard
security methods and mechanisms are inadequate due to the low scalability,
integrity, and interoperability of existing devices. To address the security,
privacy, and dependability requirements of IoT, new approaches and
technologies should be created. The topic of cybersecurity challenges on
IoT platforms and AI-based applications is a big global concern that neces-
sitates a comprehensive evaluation from both the research and industrial
groups. This chapter evaluates security issues that are expected to limit IoT
deployment and intends to explore different methods for the detection and
evasion of cybersecurity threats in IoT domain. The chapter is structured
as follows: Sect. 7.2 discusses the work related to IoT cybersecurity. In
Sect. 7.3, the potential threat challenges in relation to IoT applications and
services are assessed. In Sect. 7.4 malware attacks and threats are discussed.
In Sect. 7.5 malware detection and evasion approaches are presented, and
the final section concludes the chapter.
some of these researches and provide a brief overview of the existing work
in this domain.
The authors of [12] examined home automation systems, such as mag-
netic sensors, motion sensors, and industrial IoT devices, and discovered
that smart meters are susceptible to a number of assaults because of the
inadequate security measures used during their development and deploy-
ment. These gadgets advocate using an encrypted channel as a security
mechanism while communicating via an RF (radio frequency) channel.
Mutual authentication, the physically unclonable function, finite resources,
side-channel analysis, and cloning attacks were all topics covered in the
research in [13]. In their article, security mechanisms for protocol verifica-
tion, session key formation, and mutual authentication are given. Mutual
authentication offers useful information about key distribution between
devices and during sessions. The suggested remedy lessens the danger of
replay and man-in-the-middle attacks. The author of [14] discussed several
security difficulties and threats, privacy worries, IoT device integration
with the blockchain, and various security research fields. One of the most
critical problems in Android/iOS applications is application repackaging.
The authors of [15] presented their research on repackaged software. They
addressed five issues: (1) the current unfavorable repackaging practices,
(2) the way adware is embedded in the code, (3) the kinds of apps used
to repackage, (4) the motivations behind people downloading repackaged
software, and (5) the way an app’s characteristics change in the repackaged
version. The drawback of static malware detection techniques, such as
TinyDroid, DroidFDR [16], DroidEnsemble, and NsDroid, is that they
are not appropriate for dynamic analysis. A quick and efficient Android
malware detection tool is NsDroid. NsDroid is 20 times faster than
previous graph-based techniques [17, 18] since it is built on a local function
graph [19]. The author of [20] cited a dispute between various sensors
managed by a smartphone and offered a LOD-based solution (Linked
Open Data). LOD makes it possible to more effectively utilize the services
and features of the resident’s profile while also defining the connections
between the various services and items in the home. The authors of the
article [21] provided useful insight into the use of signature- and anomaly-
based methods for detecting mobile malware.
Authors in [22] discuss the cybersecurity threats in Mobile Adhoc
Networks (MANETS), which plays a key function in many IoT settings.
MANETS are vulnerable to numerous packet-drop attacks, including as
gray- and black-hole attacks. The authors looked at numerous black-
170 S. E. UD DIN ARSHAD ET AL.
hole attack types and employed learning, cooperative, and other detection
strategies. Their study concludes that trust-based scheme performs better
when compared to other schemes. For availability, security, and reliability
of MANETS, the threat of botnets must be taken care of. The newly
developed botnets are designed to dodge the detection systems. Large
amounts of data processing are required for high computational require-
ments to differentiate between normal and botnet traffic. The authors in
[23] proposed a system to address this problem by developing a scalable
and decentralized framework, based on characterization of the behavior of
legitimate hosts, and detect unseen botnet traffic.
Cross-architecture detection of IoT malware is a very challenging task
because these IoT devices are very heterogeneous. A solution to this
problem is proposed by using graph-based malware detection methods
to detect malware in IoT devices [24]. Graph-based techniques detected
complicated and zero-day malicious codes with greater accuracy. MalIn-
sight, [25] a malware detection system, breaks down malware into three
categories: basic structure, low-level behavior, and high-level behavior.
Operations were carried out based on the three elements on files, structural
features, networks, and registries. The framework might quickly identify
malware that hasn’t been seen and make it simple for future researchers to
find spyware.
Wang et al. [26] utilized lightweight network analysis and machine
learning to develop a framework for malware identification in Android
devices. In this work, authors combine machine learning with network
traffic analysis on the server-side, with minimum resource consumption
and minimum impact on the user experience. For the purpose of identi-
fying cyber vulnerabilities and threats, a unique machine learning-based
methodology was put forth by [27] to identify cyber threats using novel
machine learning-based framework. This framework used observed attack
patterns, and in result it was able to identify and detect cyberattacks.
Another machine learning technique based on hamming distance is used
for malware detection [28]. This method made use of k-medoid-based
nearest neighbors (KMNN), weighted all nearest neighbors (WANN),
and first nearest neighbors (FNN). These algorithms, which have high
recall and precision rates, were employed to identify malicious software.
A classification model is proposed to detect mobile malware attacks in
IoT systems [29]. Mobile malware attacks are mainly caused because of
fraudulent mobile applications and injected malicious applications. Other
machine learning techniques for malware detection adopted in IoT and
MALWARE ANALYSIS FOR IOT AND SMART AI-BASED APPLICATIONS 171
AI-based smart systems are decision tree, SVM, random forest, and logistic
regression [30, 31].
Other popular techniques for malware detection are sandbox envi-
ronment techniques, blockchain technology, and deep learning. Sandbox
is a testing environment used to investigate malware behaviors [32].
Kachare et al. [33] propose a concept for a sandbox environment that
analyzes malware, produces reports automatically, and fixes issues. In
order to study malware at three different levels–static malware analysis,
real-time malware analysis, and network analysis–the suggested model
employs multiple machine learning algorithms. Advanced persistent threats
(APTs) are immune to anti-malware and anti-virus systems along with
other conventional security systems. Advanced evasive techniques are used
to tackle these malwares. The work in [34] measures the divergence
from a program’s typical behavior utilizing Analysis Evasion Malware
Sandbox to discover malware evasive behavior (AEMS). Blockchain uses its
principles of cryptography, decentralization, and consensus for security. In
[35], a blockchain-based malware detection technique leveraging shared
signatures of suspicious malware files is put out. With the help of this
technique, users can quickly respond to the growing threat of malware
by sharing the signatures of dubious files. Deep learning is a part of
machine learning family and has been widely used recently in wide range of
applications including cybersecurity [36]. Authors in [37] develop a tool
to detect IoT-malware infections in smart home networks. It analyzes IoT
traffic as captured by means of a spoofing technique.
• Hajime: Hajime and Mirai are comparable in that both use username
and password tables to spread via unsecured open Telnet ports. Unlike
Mirai, Hajime is a part of a peer-to-peer network. The controller
issues commands to its peer network, and over time, the peer network
spreads the message to all other peers. This has a strong design,
making it more challenging to knock it over. Aside from design,
Hajime has a few other benefits over Mirai. Hajime takes several steps
to conceal its operating processes and data on file systems, making it
more stealthy [56].
IoT Based
Heuristic Behavior
Based Based
Model Malware
Detection Signature
Checking Based
Based Techniques
Deep Mobile
Learning Devices
Based Based
Fig. 7.4
Signature-based malware
detection
Fig. 7.5
Behavior-based malware
detection
180 S. E. UD DIN ARSHAD ET AL.
7.6 CONCLUSIONS
Cybersecurity has become a global concern for establishing improved secu-
rity mechanisms to investigate and react to cyberattacks. In this chapter,
we identify several application and service domain vulnerabilities inherent
to the IoT and smart systems. The ineffectiveness of conventional security
solutions in detecting novel cyberattacks renders them insufficient. Numer-
ous applications of cybersecurity systems make use of machine learning
techniques. In this chapter, we’ve covered threats to IoT and smart systems,
as well as a quick overview of malware detection and evasion approaches. It
is essential to investigate novel cyberattacks while simultaneously building
188 S. E. UD DIN ARSHAD ET AL.
Training Phase
Finish Training?
No
Yes
Conform the reliable prototype
series as a corpus
Finish Validation?
No
Yes
Fig. 7.10 Bayesian model to detect abnormal data traffic and discriminate DDoS
attacks from FC [8]
and executing solutions to resist these cyberattacks, so the IoT and smart
systems could be utilized to their full potential.
REFERENCES
1. Nobakht M, Sivaraman V, Boreli R (2016) A host-based intrusion
detection and mitigation framework for smart home IoT using
openflow. In: 2016 11th International conference on availability,
reliability and security (ARES). IEEE, pp 147–156
2. Nasralla MM (2021) Sustainable virtual reality patient rehabilita-
tion systems with iot sensors using virtual smart cities. Sustainability
13(9):4716
3. Bendiab G, Shiaeles S, Alruban A, Kolokotronis N, IoT malware
network traffic classification using visual representation and deep
learning. In: 2020 6th IEEE Conference on Network Softwariza-
tion (NetSoft). IEEE, pp 444–449
4. Khattak SBA, Jia M, Marey M, Nasralla MM, Guo Q, Gu X (2022)
A novel single anchor localization method for wireless sensors in 5G
satellite-terrestrial network. Alexandria Eng J 61(7):5595–5606
190 S. E. UD DIN ARSHAD ET AL.
60. Barriga JJ, Yoo SG (2017) Malware detection and evasion with
machine learning techniques: a survey. Int J Appl Eng Res
12(18):7207–7214
61. Shaukat K, Luo S, Varadharajan V, Hameed IA, Xu M (2020) A
survey on machine learning techniques for cyber security in the last
decade. IEEE Access 8:222310–222354
62. Patel C, Vyas S, Saikia P et al (2022) A futuristic survey on learning
techniques for internet of things (IoT) security: developments,
applications, and challenges.
63. Chen Z, Liu J, Shen Y, Simsek M, Kantarci B, Mouftah HT, Djukic
P (2022) Machine learning-enabled IoT security: open issues and
challenges under advanced persistent threats. ACM Comput Surv
(CSUR) 55(5):1–37
CHAPTER 8
8.1 INTRODUCTION
The concept Internet of Things (IoT) revolves around a time when there
will be more objects linked to the Internet than there will be humans.
Under the current Internet infrastructure, the Internet of Things refers to
the normal and attack traffic [21]. With the datasets used for the detection
of intrusions, the multiclasses must have identified feature variables that
could influence and play a big role in the prediction of this class to adopt
valid test and evaluation to reflect the trends and evident diversity [22].
Putting all these stages together, it forms a pipeline for the classification
of IoT intrusion detection with the use of feature selection, oversampling,
and feature importance. Although this research has been adopted before,
it has not been considered on one of the most recent imbalanced datasets
using new popular classification techniques. Hence, in this chapter, the
main contributions are as follows:
• Develop the multiclass classification to classify the category label for
IoT intrusion attacks.
• Application of heuristic approach for feature selection to adopt valid
predictions for detection of IoT intrusion attacks.
• Application of oversampling technique to use for the multiclass dataset
problem to solve imbalance distribution.
• Specify the influencing feature variables that play a big role in the
prediction power.
• Application of pipeline on recent dataset IoTID2020 using new
popular algorithms.
The rest of this chapter is organized by discussing the literature review,
security system framework, background, research methodology, experi-
mental results, and discussions along with the conclusion and future works
of this research.
of the data in the classes. With all the procedures done, it was experimented
on the IoTID2020 dataset by dividing the training data into three clusters
with the use of k-means clustering which were then reduced by 10%
before aggregating them to the three reduced clusters. SVM-SMOTE was
furtherly used with an oversampling ratio of 0.9 and then aggregated
into an enlarged one for supplementarily producing the oversampled
data classification model using the single-hidden layer feed-forward neural
network (SLFN) classification method. When evaluated, it was shown to
be exceeding other approaches based on the G-mean (GM), precision
(PREC), accuracy (ACC), and recall (REC) [23]. The same authors also
proposed a multi-layer approach for the IoT intrusion detection using the
IoTID2020 dataset to predict the intrusion identification and the category
label using SLFN and long short-term memory (LSTM) with oversampling
[17].
With the ongoing research on detecting IoT intrusions, there are many
aspects and pathways to follow with many theories and techniques to
investigate. Finding approaches for the classification of IoT intrusions
can be quite challenging especially due to the many issues and obstacles
surrounding it. For example, it has been studied that class imbalance is a
big type of issue in classification since there are classes that are marginalized
when compared to others. This raises an effectiveness conflict particularly
in minority class prediction when using algorithms of machine learning.
There are different numerous approaches to tackling this matter although
the majority focus on bi-class scenarios in the imbalance problems. There-
fore, it has been proved that dealing with multiclass problems based on
these algorithms is less efficient and has negative consequences. With this
said, Wang and Yao [21] have addressed this point by considering why
addressing multiclass problems tends to be difficult using these approaches.
It was concluded unsatisfactory of strategizing the effect of the multiclass
on the random and undersampling execution processes in the multi-
majority and multi-minority class cases. Due to this, they proposed their
developed ensemble algorithm named AdaBoost. NC [21] along with
oversampling to resolve the multiclass problem and improve the balance
and recognition of minority classes that can improve the performance in
classification [21].
Furthermore, Abdi and Hashemi [20] worked on opposing the mul-
ticlass imbalance problem using Mahalanobis distance-based oversampling
technique (MDO) to minimize the consequential challenge of the overlap-
ping risk that can occur between regions of different classes in the detection
202 Z. AMIERH ET AL.
8.4 BACKGROUND
This section discusses the preliminary information needed to understand
the main parts of the proposed methodology. It mainly includes a discus-
sion on the XGBoost and CatBoost classifier algorithms.
8.4.1 XGBoost
XGBoost algorithm, like many other ensemble learning algorithms, is used
for regression and classification for supervised learning problems and large
datasets where there is multiple features in the training data to predict a
target variable. This algorithm was developed by Chen and Guestrin [30]
and was optimized with the structure of gradient boosting. The XGBoost
is a regression tree that is popular for its scalability in all scenarios as it
can allow the system to run ten times faster than any solution on a single
machine. This is due to its algorithmic optimizations as sparse data is han-
dled by a novel tree learning procedure; instance weights in approximation
tree learning is handled by a theoretically justified weighted quantile sketch
procedure, and this helps in the split finding algorithms [31]. Learning is
sped up with the use of parallel and distributed computing. This helps in
solving complex problems in machine learning allowing for more rapid
model generation in a fast and accurate way [32]. It works by integrating
the estimates of several simpler, weaker models to try to accurately predict
206 Z. AMIERH ET AL.
Fig. 8.1 Security system framework to classify and alert IoT intrusion attacks
A MULTICLASS CLASSIFICATION APPROACH FOR IOT INTRUSION… 207
8.4.2 CatBoost
CatBoost is a depth-wise gradient boosting algorithm that was developed
by Yandex. Within gradient boosting, the trees are made one after the
present one where the previous trees can’t be altered, but the results
are used to improve the next trees. The CatBoost algorithm uses past
208 Z. AMIERH ET AL.
decision trees to grow a balanced tree where the left and right splits
for each level of the tree are made from the same features as shown in
Fig. 8.3. This algorithm can handle categorical features and numerical
values and reduce overfitting with very less prediction time at a high
accuracy giving the advantage that it can be used in complex problems with
large datasets for classification and regression. CatBoost has the flexibility
of giving categorical columns indices so one-got encoding can be used
using one-hot-max-size. The CatBoost utilizes an encoding method to
reduce the overfitting by permuting the set of inputs in an irregular order
and converting label values from floating point or category to integer in
addition to transforming the categorical feature values to numeric values
using the formula:
where the countInClass is number of times the label value was equal to
1 for the present categorical feature value objects. Prior is the preliminary
value for the numerator and is determined by the parameters at the start.
The total count is the objects total number to the current that has the
categorical value feature matching. Furthermore, minimal variance sample
(MVS) is a stochastic gradient boosting weighted sampling form that
CatBoost employs. Weighted sampling happens at the tree level instead
of the split level in this technique. Each boosting tree’s observations are
sampled in such a way that split scoring accuracy is maximized. The reason
A MULTICLASS CLASSIFICATION APPROACH FOR IOT INTRUSION… 209
why CatBoost is popular and in demand for usage is that great results are
provided with the default parameters; thus, less time is needed for tuning
parameters, and it reduces overfitting due to improved accuracy, and the
usage of the CatBoost model applier allows fast prediction [34].
8.5 METHODOLOGY
In the multiclass perspective approach for detecting intrusion attacks in this
research, the primary quantitative research method is used as a model for
the experiments as it is developed to analyze specific data from a dataset by
using sets of variables setting one as constant and measuring the differences
against the other (training and testing sets), and this means that the
information is gathered through the self-conducted research methods. The
secondary research was used in the literature review, and the information
was taken from different studies. This supports the proposed research.
Moreover, as quantitative research has other approaches such as using
surveys, they are not used, and neither is the qualitative research (e.g.,
interviews) be used as they are not necessary for this project and do
not meet our objectives. The reason behind it is that both mentioned
techniques focus on human experiences, behavior, and opinions, while
this experiment looks at the home network’s traffic and corresponding
connected devices for malicious activities, and this is not associated with any
known ethical issues due to no human participation in research as the data
is readily available online with no human data [35]. Thus, the quantitative
experimental research approach is the most effective to reach the objectives.
Yet, caution must be taken upon working on quantitative research as
it involves limitations and drawbacks of difficulty in understanding the
context of the phenomenon and explaining complex issues due to data not
being robust enough and requires time and cost which is expensive [36].
Furthermore, in this methodology, the main focus is on the following
aspects as shown in Fig. 8.4:
• Problem understanding and formulation
• Data collection
• Data preprocessing
• Model development
• Evaluation and assessment
210 Z. AMIERH ET AL.
Table 8.3 Number of features selected and not selected based on the variance
threshold technique
Feature selection (overall 80)/0.8 variance threshold
# of features selected # of features not selected
60 20
TP +TN
Accuracy =
. (8.2)
T P + T N + FP + FN
TP
P recision =
. (8.3)
T P + FP
TP
Recall =
. (8.4)
T P + FN
2P recision × Recall
F − measure =
. (8.5)
P recision + Recall
√
.G − Mean =
n
x1 , x2 , . . . xn (8.6)
CatBoost Feature Importance In the CatBoost, the known method for the
feature importance is calculated by taking the difference between the loss
function metric obtained using the original model with the feature and with
the feature removed from all the trees in the model. The higher the value,
the higher its importance and relevance in the prediction of the target value
[41]. But SHAP is a technique that can be used to measure the impact
216 Z. AMIERH ET AL.
0.025. Figure 8.6 shows the metric results for the top 3 performing models
for this experiment.
Table 8.9 Experiment performance results with feature selection and oversampling
XGBoost CatBoost RandomForest Classifier KNN GaussianNB LogisticRegression DecisionTreeClassifier
ACC 0.966 ACC 0.950 ACC 0.919 ACC 0.814 ACC 0.587 ACC 0.742 ACC 0.946
GM 0.932 GM 0.924 GM 0.850 GM 0.733 GM 0.625 GM 0.721 GM 0.911
PREC REC F1 PREC REC F1 PREC REC F1 PREC REC F1 PREC REC F1 PREC REC F1 PREC REC F1
DoS 1.000 1.000 1.000 1.000 0.990 1.000 1.000 0.995 0.998 0.998 0.985 0.991 1.000 0.966 0.982 0.993 0.995 0.994 1.000 0.998 0.999
0.781 0.794 0.788 0.710 0.780 0.750 0.575 0.580 0.577 0.387 0.525 0.446 0.162 0.920 0.275 0.199 0.395 0.264 0.645 0.739 0.689
MITM
ARP
Spoofing
Mirai 0.978 0.980 0.979 0.970 0.960 0.970 0.948 0.940 0.944 0.912 0.843 0.876 0.980 0.547 0.702 0.961 0.709 0.816 0.972 0.955 0.964
Scan 0.945 0.939 0.942 0.920 0.940 0.930 0.846 0.894 0.869 0.590 0.724 0.650 0.207 0.241 0.223 0.441 0.847 0.580 0.915 0.949 0.932
Normal 0.989 0.960 0.974 0.980 0.960 0.970 0.934 0.913 0.923 0.612 0.671 0.640 0.565 0.812 0.667 0.657 0.823 0.731 0.942 0.939 0.940
A MULTICLASS CLASSIFICATION APPROACH FOR IOT INTRUSION… 223
than the Src_Port for each of the classes but was quite similar in predicting
the DoS category.
Experiment (2) Feature selection was added to the procedure, and the fea-
ture variables that influenced and impacted the prediction power differed.
For XGBoost, Fig. 8.10 shows that the most influential feature variable is
the Flow_Duration as it has the highest impact on the prediction of the
category type classes, while the second most influential is the Src_Port
feature although it did not have much impact on the prediction of the
DoS class.
CatBoost differs slightly when compared to XGBoost as in Fig. 8.11
displaying the Src_Port feature variable as the highest impact on detec-
tion prediction of the category types, while Flow_Duration came second
highest for the prediction although it can be noticed that it had a higher
impact on predicting the DoS class than Src_Port. Yet, Src_Port was more
impactful on the other classes in comparison.
category type classes. Figure 8.12 reveals for the XGBoost that the feature
Flow_Duration ranked the highest impact for prediction of all category
classes, while the Src_Port had the second highest impact on the prediction
although not much impact on predicting the DoS category.
Figure 8.13 that represents the feature importance for CatBoost indi-
cates that Src_Port had the highest impact on the prediction of the category
type classes classification for all types, whereas the Flow_Duration had the
next highest impact and importance with a higher impact on detecting DoS
in comparison, while Src_Port predicted all other categories better.
In general, feature importance interpretations in all three experiments
show that the two highest influential feature variables for the prediction
power in both XGBoost and CatBoost are Src_Port and Flow_Duration.
The reason for this is that the source port identifies the process that sent
the data to the network, so it could indicate if the packets came from
a malicious source or not since it shows the origin and destination of
a given flow in the network, while the flow duration feature shows the
total duration of a flow in seconds indicating whether the flow pattern is
suspicious or not.
A MULTICLASS CLASSIFICATION APPROACH FOR IOT INTRUSION… 227
8.6.5 Discussion
In summary, the classification of the category label was tested on the
IoTID2020 dataset with the classifiers XGBoost and CatBoost and com-
pared with the other basic classifiers. The effect of the addition of over-
sampling and feature selection using variance threshold was experimented
in two experiments concluding that the XGBoost and CatBoost classifiers
have only made a small improvement in presence of the oversampling
and feature selection, yet the results stayed approximately the same in
all experiments stating that these two algorithms can get high accurate
results without heavy data preprocessing. Also, XGBoost has automatic
feature selection; it has internal features that address imbalance distri-
bution. Yet, the oversampling and feature selection are needed for the
simple classifiers as the oversampling especially helps improve the recall.
Furthermore, although all features are important to detect and prevent
IoT intrusions, it is essential knowing which features have the highest
influential impact on predicting the category types to get accurate results
fast, and in the experiments, it was shown how the two features Src_Port
and Flow_Duration play a fundamental role in prediction. As an overall, the
XGBoost performed best in all conditions in experiments validating how
powerful and reliable the algorithm is in predicting the category labels of
the intrusion attacks.
The experiments were limited to the classification of the category labels,
which could extend to the subcategory labels for the IoTID2020 dataset.
Also, the SVMSMOTE oversampling was only considered and was not
compared with other oversampling methods with different ratios. It also
did not consider automatic clustering and data reduction although it
could provide more insight toward the consumption behavior on different
regions of the data distribution and undersample the data. Additionally, dif-
ferent feature selection techniques and techniques for predicting the most
important feature variables were not taken into account. Moreover, it is
noted the specific distribution of the activities for the IoTID20 dataset, and
this should be tested on different datasets having a different distribution of
activities for validation. Another limitation to the experiments is the lack
of prior experience and repetition of the experiment as the experiments
should be run repeatedly on an average of 30 times to get the mean and
standard deviation for reliable results.
A MULTICLASS CLASSIFICATION APPROACH FOR IOT INTRUSION… 229
REFERENCES
1. Ezechina M, Okwara K, Ugboaja C (2015) The internet of things
(IoT): a scalable approach to connecting everything. Int J Eng Sci
4(1):09–12
2. Khan R, Khan SU, Zaheer R, Khan S (2012) Future internet:
the internet of things architecture, possible applications and key
challenges. In: 2012 10th International Conference on Frontiers
of Information Technology, pp 257–260
3. Ashton K et al (2009) That ‘internet of things’ thing. RFID J
22(7):97–114
4. Evans D (2011) How the next evolution of the internet is changing
everything, p 11
5. Chaudhary S, Johari R, Bhatia R, Gupta K, Bhatnagar A (2019)
Craiot: concept, review and application(s) of IoT. In: 2019 4th
International Conference on Internet of Things: Smart Innovation
and Usages (IoT-SIU), pp 1–4
6. Reddy AN, Marks AM, Prabaharan SRS, Muthulakshmi S (2017)
IoT augmented health monitoring system. In: 2017 International
Conference on Nextgen Electronic Technologies: Silicon to Soft-
ware (ICNETS2), pp 251–254
7. Razalli H, Alkawaz MH, Suhemi AS (2019) Smart IoT surveillance
multi-camera monitoring system. In: 2019 IEEE 7th Conference
on Systems, Process and Control (ICSPC), pp 167–171
8. Krasniqi X, Hajrizi E (2016) Use of IoT technology to drive the
automotive industry from connected to full autonomous vehicles.
IFAC-PapersOnLine 49(29):269–274
9. Kim T-H, Ramos C, Mohammed S (2017) Smart city and IoT
10. Kumar CS (2017) Correlating internet of things. Int J Manag
(IJM) 8(2):68–76
11. Williams R, McMahon E, Samtani S, Patton MW, Chen H (2017)
Identifying vulnerabilities of consumer internet of things (IoT)
devices: a scalable approach. In: 2017 IEEE International Confer-
ence on Intelligence and Security Informatics (ISI), pp 179–181
12. Meneghello F, Calore M, Zucchetto D, Polese M, Zanella A (2019)
IoT: Internet of threats? A survey of practical security vulnerabilities
in real IoT devices. IEEE Internet Things J 6(5):8182–8201
A MULTICLASS CLASSIFICATION APPROACH FOR IOT INTRUSION… 231
9.1 INTRODUCTION
Cloud computing is one of the decade’s most trending discussions in
information technology (IT). A preponderant of IT either has integrated
or has plans to adopt products and services around the cloud computing
paradigm. Cloud computing is defined as a model for providing on-
demand, convenient and ubiquitous network access to a shared pool of
computing resources that can be configured (such as storage, networks,
servers, services and applications) and may be provisioned rapidly and
released with little interaction with the service provider or little manage-
ment effort. “Cloud” itself is a shared resource which is widely influential
since it is not merely shared among a large volume of users but offers
S. K. Medaram
CTI, De Montfort University, Leicester, UK
e-mail: [email protected]
L. Maglaras ()
School of Computing, Edinburgh Napier University, Edinburgh, Scotland
e-mail: [email protected]
dynamic access which is dependent on the demands (Ou, 2015). The cloud
is not an array of software, hardware or services. It is an integration and
combination of vast provisions in information technologies. In the cloud
environment, users do not need to possess the infrastructure enabling
various computing services. But much more, the services become accessible
to a computer from any location in the world. The features integrated
into the environment include those offering support multi-tenancy and
high scalability and enhanced flexibility compared to older methodologies
for computing. It can be used to allocate, reallocate or deploy resources
dynamically while being able to monitor their performance continuously.
Cloud computing has four deployment models (public, private, hybrid
and community), and three service models (infrastructure as a service,
IaaS; software as a service, SaaS; and platform as a service, PaaS), which
provides a description of the relationship that exists between cloud service
producers and cloud service consumers. Thus, a user can access one or
multiple cloud deployment models. However, the increased adoption of
cloud services and products has met a growth of malicious activities, codes
and programs targeted at the infrastructure. Even though the potentials of
cloud computing are yet to be fully tapped, public consent already reveals
security as its most critical flaw at the moment. Many of these activities
and attacks are generically described as security threats that dissuade
users from exploring these benefits. Nowadays, the number and severity
of cyber-related attacks are on a drastic increase. Commonly reported
security threats in cloud computing (CC) infrastructure include data loss
and breaches, malicious insiders, account or service hijacking, identity
theft, phishing attacks, man-in-the-middle attacks, denial of service (DOS),
distributed denial of service (DDOS) attacks, cookie poisoning attacks,
wrapping attacks, etc. [1]. In general, several variants of malware are the
reason for these attacks. Malware is any type of software which put harmful
and malicious effects on the OS (operating system), software or other
components. It is designed with the intention to cause harm or damage
to its target system. Trojan horses, worms, backdoors, viruses, spyware,
rootkits, ransomware and botnet are typical examples of malware [2].
Each variant and family of the malicious code is designed for peculiar
purposes. While some variants of malware steal sensitive data, many others
initiate DDoS attacks and give room for remote code execution [3]. Highly
sophisticated attacks employ more than one type and family of malware.
The amount of malware samples has increased rapidly over the years.
MALWARE MITIGATION IN CLOUD COMPUTING ARCHITECTURE 237
REMICS, etc. [13], many companies since then have been migrating to the
CC model, while others are evaluating their transition. In collaboration
with Economist Intelligence Unit 2011, IBM carried out a survey which
engaged 572 business as well as technology executives from all over the
world to identify how establishments utilise CC today and what their plans
for the future are [13]. Nearly 75% of establishments had adopted, piloted
or substantially implemented CC in their operations (while the remaining
proposed their adoption within 3 years). This survey also demonstrated
that the adoption of cloud is not exclusive to big companies, as 67% of
organisations with revenues lower than US1$ billion and 76% of companies
with revenues between 1 and 20 billion US dollars have at some point
adopted cloud computing. As it relates to quality attributes, over 31% of
executives replied that cost flexibility was a strong justification for
their adoption of cloud computing. Following cost flexibility are secu-
rity, scalability, masked complexity and adaptability [13]. According to
Pathak et al., in [15], the primary goal of CC is to enable inexpensive and
scalable on-demand computing infrastructure that presents high service
levels.
taking off the burden for software maintenance from the customer. In
software as a service, there is the convergence coherence mechanism
and the divided cloud by which all data items have either the “Write
Lock” or “Read Lock” [17]. Two kinds of servers are adopted by SaaS:
the domain consistence server (DCS) and the main consistence server
(MCS). Cache coherence is actualised by the agreement between DCS
and MCS [18]. In this infrastructure, if the main consistence server is
compromised or damaged, there is a consequential loss of control over
the cloud environment. Therefore, the security of the MCS is a vital
requirement.
(b) Platform as a Service (PaaS) This enables the user of the service
to deploy apps on the cloud infrastructure, apps built using libraries,
tools, languages and services of the service provider. The provision
also comprises an environment for software execution. For instance,
there can be a Platform as a Service app server which affords the lone
developers to deploy applications based on the web without the need
to buy actual servers and carry out set-up. This model targets the
protection of data, which is very paramount, especially in storage as
a service. In the event of congestion, there can be the challenge of
cloud environment outage. Therefore, the requirement of security to
prevent outages is vital in ensuring load-balanced service. For security
reasons, the data is required to be encrypted whenever it is hosted
on a platform. There has been the proposition of CC architectures
that employ multiple techniques for cryptography in order to provide
cryptographic cloud storage.
(c) Infrastructure as a Service (IaaS) This is concerned with hard-
ware resources’ sharing for services execution, typically by the use
of virtualisation technology. Potentially, by the use of IaaS, several
consumers utilise available resources. These resources may be scaled
up easily depending on the user’s demand, and payments are ideally
on a pay-per-use basis. These all require management since they are
virtual machines. Therefore, there is a requirement for a governance
framework to regulate the creation as well as the usage of virtual
machines. This helps to also prevent unsanctioned access to sensitive
information of users [19]. This is a provision that affords access to the
platform to give room for the consumer to access services of networks,
storage, processing, etc., to enable the consumer to access applications
and operating systems that necessitate service provision through the
provision of the infrastructure [20].
MALWARE MITIGATION IN CLOUD COMPUTING ARCHITECTURE 245
which alludes to the guaranteeing that the data of clients are not
changed outside their consent or approval. In order to guarantee data
integrity, from the perspectives of both the supporter and supplier,
secure encryption algorithms are most often adopted. Nevertheless,
mere encryption does not absolutely guarantee noxious alteration of
data [24]. As a result of the circulated as well as dynamic shared nature
of the cloud, privacy is another fundamental requirement for cloud
clients. This alludes to exactness and data security which gives room
for ensuring delicate and private information is kept so. This means
that the framework of the cloud framework is expected to be made
available to approved, validated clients anywhere, at any time and across
any platform. There are some threats in cybersecurity that cloud service
availability may be faced with and are majorly network-based attacks,
e.g. DDoS attacks [25]. Meanwhile, cloud suppliers should maintain a
befitting activity plan in order to handle these threats and dangers.
(b) Cloud Network Infrastructure Security A provider of cloud service
must accept trustful network traffic and have provisions for blocking
malicious ones [23]. The security infrastructure of the cloud network
should be able to identify and prevent intrusions, deny and protect
against DoS attacks, to enable notification and logging. Denial of
service defences is anchored on network security, which must efficiently
filter queries and recognise attackers in order to prevent harmful attacks
[23]. The intrusion prevention and detection systems IPS and IDS,
respectively, block or detect malware attacks, spam signatures and virus
signatures, but some also report positive results. Moreover, logging and
notification create the avenue for cloud users to have certain hints into
the cybersecurity health of the network.
(c) Cloud Applications Security Companies are expected to protect
their cloud-based apps against a vast array of cybersecurity attacks
and threats. Additionally, the security of cloud apps resembles the
security of web applications when they are hosted in server centres.
Several businesses put out a single sign-on (SSO) which is to allow
clients to have access to different individual cloud administrations
[26]. In an overview, it is hard to accurately update SSO arrange-
ments since it is anchored on a safe programming layer, which is a
requirement for different confirmation strategies. The International
Standards Organization gave a definition of information security as
concerns or bothers, which may also be guided as it relates to the CC
principal security requirements for a secure and effective technology
248 S. K. MEDARAM AND L. MAGLARAS
traffic hijacking. Although several vulnerabilities and risks exist, the threats
listed are popular in the CC environment; the environment also has risk
factors. Tang in [16] reported these risks: inherent platform, virtualisation,
storage data sharing, human resources management, security, operational
management, misuse, network security, interoperability, multi-directional
audit and multiparty audit. Security risks and threats are major sources of
concern in cloud computing for many organisations, largely because of
the physical infrastructure location dispersal as well as the data’s residency,
which is geographically spread. The laws for data protection are general
dependent on country; therefore, data location is an issue: where data
residents in a country without adequate laws to protect sensitive data,
therefore making user data vulnerable [30].
Worm This is a program with features similar to the viruses but rather
affects the network instead of the host machine. It is designed to infect
another machine after reproducing itself [39]. They are spread across a
computer network and depend on security failures for the penetration
of their target machine. The majority of worms are designed to steal
data, delete data and ultimately have them spread to other systems. Virus:
The major characteristic of a virus is malware built by cybercriminals by
infecting the target machine’s file [39]. This sort of program self-replicates
on the host machine and then connects to documents that eventually turn
out to be their carriers. The design of a virus is such that it would spread
from one machine to another.
Adware Unlike Trojan and many others, adware is not a direct harmful
code but slows down the host machine’s functioning by consistently
displaying ads that land users on harmful pages or sources. Adware is
targeted at displaying advertisements and redirecting search requests to
other websites being advertised. Adware can explore functions like cookies
to collect information about the user, e.g. the websites visited. Taking
advantage of the information, customised ads are then displayed.
Bots and botnets These are harmful codes designed to invade a computer
and carry out instruction the moment it receives instruction from a
remotely controlled server. Just like viruses and Trojans, bots can replicate
itself. An array of bots described as botnets may be employed in launching
DDoS (distributed denial of service) attacks to render the communication
across a network temporarily inaccessible.
Key logger This malware tool takes a record of all the activities carried
out on a monitoring tool similar to a machine. It regularly bypasses the
permission of the user to execute. A key logger is predominantly utilised
in obtaining confidential data, security phrases, passwords and usernames.
code, or they input junk data, which changes the file’s hash and then
renders it undetectable. These agents use password-protected approaches
or encryption tools to escape detection. Conventional approaches for the
detection of attacks on cloud infrastructures or the virtual machines they
host are insufficient in addressing cloud-related issues, in spite of the great
efforts put into previous studies as regard the behaviour of some kinds of
malicious programs on the Internet [42].
Techniques for malware detection basically utilise two inputs for detec-
tion: (i) malware signature, rules or behaviour from the database and (ii)
the target program to be evaluated for malicious intent. For higher security
in the cloud, MDPS also employ real-time malware analysis. This real-time
technique for prevention is very vital in dealing with the daily growing array
of malware since it shields the user from unknown attacks and malware that
may compromise the host machine and affect the user. The following is a
detailed description of various approaches employed in malware detection
in cloud infrastructures today.
infrastructure. The hosts that the cloud system supports are supported
by the IaaS clients [52]. A vital technology-Web 2.0, which enables the
utilisation of SaaS, takes away tasks such as installation and maintenance of
software from users. With the increase in the use of Web 2.0, there’s an
urgent need for the environment now more than ever [53].
SQL Injection Attacks This involves the insertion of a malware code into
a standard SQL code. By this, malicious persons secure unsanctioned access
to the database and are then able to penetrate sensitive information [54].
Different techniques, such as preventing the use of SQL that is dynamically
generated in the code or the use of filtering techniques, help in sanitising
the input of the user, etc. and consequently help to mitigate SQL injection
attacks. There has been a proposal for an architecture that is based on a
proxy which dynamically detects and extracts the input of the users for
SQL control sequences [52].
Also, test of data communication between license parties and proper SSL
configuration has been found helpful in mitigating the risk of MITM
attacks [55].
Sniffer Attacks These attacks are achieved using malicious software that
is able to capture a network’s packets, and without encryption of the data
transferred across the packets, vital data can be traced, read or captured.
A platform anchored on address resolution protocol (ARP) and round
266 S. K. MEDARAM AND L. MAGLARAS
trip time (RTT) for malicious sniffing detection can be employed on the
network for detecting running sniffing systems [52].
BGP Prefix Hijacking This kind of attack at the network level involves
making a false announcement concerning the IP addresses that are linked
to an autonomous system (AS). Therefore, malicious attackers are able to
get into untraceable IP addresses. An AS can do an information broadcast
of an IP within its regime across its entire neighbours [53].
9.6.1 Recommendations
From our findings, various recommendations that may be considered for
further studies or help to guide the decision and activities of stakeholders
are presented below:
• There is a need to design a comprehensive framework for mitigating
multiple malware and other security attacks in the cloud. This frame-
work will be able to interface with any kind of cloud environment
and have the capacity to detect and address predefined and tailored
security threats. This will be a cost-effective approach and will help
to secure the system against attackers who launch multiple attacks
simultaneously, which would otherwise have overwhelmed the cloud
and hurt the services provided.
• Since there is yet no omnibus approach to all security challenges, this
study recommends that approaches that implement the combination
of multiple detectors and/or mitigation approaches can be consid-
ered.
MALWARE MITIGATION IN CLOUD COMPUTING ARCHITECTURE 273
REFERENCES
1. Amara N, Zhiqui H, Ali A (2017) Cloud computing security threats
and attacks with their mitigation techniques. In: 2017 International
Conference on Cyber-Enabled Distributed Computing and Knowl-
edge Discovery (CyberC). IEEE, pp 244–251
2. Aslan V, Ozkan-Okay M, Gupta D (2021) Intelligent behavior-
based malware detection system on cloud computing environment.
IEEE Access 9:83,252–83,271
3. Aslan Ö, Samet R (2017) Investigation of possibilities to detect
malware using existing tools. In: 2017 IEEE/ACS 14th Inter-
national Conference on Computer Systems and Applications
(AICCSA). IEEE, pp 1277–1284
4. Aslan ÖA, Samet R (2020) A comprehensive review on malware
detection approaches. IEEE Access 8:6249–6271
5. Ferrag MA, Friha O, Maglaras L, Janicke H, Shu L (2021) Fed-
erated deep learning for cyber security in the internet of things:
concepts, applications, and experimental analysis. IEEE Access
9:138,509–138,542
6. Ferrag MA, Maglaras L, Moschoyiannis S, Janicke H (2020)
Deep learning for cyber security intrusion detection: approaches,
datasets, and comparative study. J Inf Secur Appl 50:102419
7. Mazumdar A, Alharahsheh H (2019) Insights of trends and devel-
opments in cloud computing. South Asian Res J Eng Tech 1(3):98–
107
8. NIST Cloud Computing Security Working Group et al (2013)
NIST cloud computing security reference architecture, National
Institute of Standards and Technology, Technical Report
274 S. K. MEDARAM AND L. MAGLARAS
© The Editor(s) (if applicable) and The Author(s), under exclusive 279
license to Springer Nature Switzerland AG 2024
I. Almomani et al. (eds.), Cyber Malware, Security Informatics and
Law Enforcement, https://doi.org/10.1007/978-3-031-34969-0
280 INDEX
G
S
Graph-based analysis, 95, 97–103
Security and privacy, 166, 168, 237,
248
I
Security applications, xxi, 198, 247,
Imbalanced, 146, 199, 200, 202–204,
251, 266–268, 270
211, 217, 229
Service networks, 77, 244, 251
Internet of Things (IoT), v–vii, xxii,
Static-based analysis, 4
xxiii, xxiv, xxvi, xxix, 23, 25, 27,
Static parsing, 43–46
43, 127, 128, 130, 165–189,
Support vector machine (SVM), vii, 6,
197–230, 237
82, 83, 91–111, 138, 140, 143,
Intrusion detection, 166, 184–186,
145, 171, 184, 212, 256
197–229, 252, 256, 261, 267,
SVM-SMOTE oversampling, 201,
269, 271
219, 229
M
Machine learning (ML), v, vi, xxiv, V
xxvi–xxix, 2–6, 25–27, 29, 42, Variance threshold feature selection,
43, 46, 58, 59, 81–83, 92, 96, 214, 216, 217, 229
97, 111, 120–122, 126–128, Virtual machine introspection (VMI),
130, 137, 138, 145, 147, 152, 263, 269
153, 155, 170, 171, 176, 178, Vision-based analysis, 10, 12, 18
182, 184–187, 199–203, 205,
211, 213, 256 W
Malicious services, 86 Windows malware, 119–155
Malware analysis, vi, vii, xiv–xviii, 3, 4,
6, 18, 24, 26, 27, 43, 58, 92, 95, X
96, 119–155, 165–189, 258 XGBoost, 151, 203, 205–207,
213–226, 228, 229