0% found this document useful (0 votes)
234 views50 pages

Development of Malware Detection and Analysis Mode

The document discusses the development of malware detection and analysis models using machine learning. It outlines how malware poses a significant threat and traditional detection methods are insufficient. The research aims to apply convolutional neural networks and recurrent neural networks to detect and analyze malware using datasets. It seeks to develop advanced approaches to identify malware and enhance accuracy. Keywords include malware detection, neural networks, machine learning, and cybersecurity.

Uploaded by

jamessabraham2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
234 views50 pages

Development of Malware Detection and Analysis Mode

The document discusses the development of malware detection and analysis models using machine learning. It outlines how malware poses a significant threat and traditional detection methods are insufficient. The research aims to apply convolutional neural networks and recurrent neural networks to detect and analyze malware using datasets. It seeks to develop advanced approaches to identify malware and enhance accuracy. Keywords include malware detection, neural networks, machine learning, and cybersecurity.

Uploaded by

jamessabraham2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 50

DEVELOPMENT OF MALWARE DETECTION AND ANALYSIS MODE

ABSTRACT

Malware threats represent an ongoing and escalating challenge to the security and integrity
of digital systems worldwide. These malicious software programs, designed with malicious
intent, target various platforms, including computers, mobile devices, and networks. Their
objectives range from data breaches and identity theft to financial fraud and disruption of
critical services. The impact of malware is far-reaching, affecting individuals, businesses,
and even national security. The consequences of successful malware attacks can be severe.
They can lead to the compromise of sensitive information, such as personal and financial
data, intellectual property, and trade secrets. Malware can also disrupt operations, causing
system failures, network outages, and loss of productivity. Furthermore, malware attacks
often result in financial losses, both in terms of direct damages and the costs associated with
incident response, recovery, and reputation management. Given the pervasive and evolving
nature of malware, it is crucial to develop effective methods for its detection and analysis.
Traditional signature-based approaches are often insufficient to keep pace with the rapidly
evolving malware landscape. As malware authors continuously modify their code and employ
sophisticated evasion techniques, the need for advanced detection models becomes
paramount. In this research, Convolutional Neural Network (CNN) and Recurrent Neural
Network (RNN) algorithms are applied to detect and analyze malware. The model is
evaluated using medium-sized datasets comprising clean and malware files, which are
integral to the framework. Furthermore, a scaling-up process is employed to facilitate the
detection of large-scale datasets containing both clean and malware files. In addition to CNN
and RNN, the study utilizes the decision tree and random forest methods, which further
enhance the accuracy and reliability of malware detection. The combination of these methods
provides a comprehensive approach to combat malware threats effectively.

Keywords: Malware Detection, Malware Analysis, Convolutional Neural Network, Recurrent


Neural Network, Scaling-up Process, Medium-sized Datasets, Decision Tree, Random Forest,
Security.
CHAPTER ONE

GENERAL INTRODUCTION

1.1 Background of the Study

Malware threats continue to pose significant challenges to the security of computer systems

and networks. These malicious software programs, including viruses, worms, Trojan horses,

ransomware, spyware, and adware, exploit vulnerabilities to compromise the integrity,

confidentiality, and availability of digital assets. Different types of malware, such as viruses,

worms, Trojan horses, ransomware, spyware, and adware, employ various techniques to

evade detection and infiltrate systems. Traditional detection techniques, including signature-

based detection, heuristic analysis, and behavior-based analysis, have limitations in detecting

emerging and polymorphic malware. Malware detection is a critical security concern that has

significant implications for legal, reputational, and economic aspects of organizations.

Traditional signature-based detection methods have become less effective against the ever-

evolving malware landscape. Machine learning techniques have emerged as a promising

approach to address the challenges associated with malware detection.

The detection of malware using machine learning techniques has gained significant

popularity due to its ability to achieve high levels of detection accuracy. Machine learning

algorithms have been utilized in previous studies to make decisions based on learned data

patterns, minimizing the need for human intervention in computing systems (Bassel et al.,

2022). Supervised and unsupervised learning methods are commonly employed to analyze

features and train models in malware detection. In supervised learning, the machine learning

model is provided with input and target labels, enabling it to differentiate between malware

and normal activities. The training process continues until the model accurately predicts all

samples (Mat, 2022). Various machine learning algorithms, such as support vector machines

(SVM), K-nearest neighbors (KNN), Bayesian estimation, and genetic algorithms, have been

used to develop effective malware detection systems. Additionally, some studies have
combined supervised and unsupervised learning methodologies (Arora et al., 2018). Various

ML models, including KNN, SVM, Random Forests (RF), AdaBoost, Logistic Regression

(LR), Naïve Bayes (NB), and Deep Neural Network (DNN), have been applied to malware

datasets, resulting in high accuracy rates (Vinayakumar et al., 2019). Additionally, studies

have focused on specific datasets, such as those related to desktop or mobile malware. For

instance, Jeon and Moon (2020) introduced a DL-based malware detection system that

utilized a convolutional encoder to translate opcode sequences extracted from Windows

executable files. The subsequent RNN-based malware detection achieved a 96% detection

accuracy and a 95% true positive rate (ibid, 2020). Yazdinejad et al. (2020) applied the

LSTM model to opcodes extracted from a dataset of 200 benign and 500 malware records,

achieving a detection accuracy of 98%.. Other studies have focused on Android malware

detection using CNN models, achieving accuracy rates ranging from 94% to 98% (Ban, 2022;

Hwang, 2018).

This research therefore aims to explore and evaluate the application of machine learning

methods for malware detection. By leveraging the power of machine learning algorithms,

which can learn from large datasets and detect intricate patterns, we seek to develop more

advanced and proactive approaches to identify and classify malware.

1.2 Statement of the Problem

Malware poses a significant and ever-growing threat to computer systems and networks, with

potentially devastating consequences for individuals, organizations, and society as a whole.

Malicious software, such as viruses, worms, trojans, and ransomware, can infiltrate systems,

compromise data integrity, disrupt operations, and even facilitate unauthorized access to

sensitive information. Existing security systems, including traditional signature-based

antivirus software and rule-based detection methods, often fall short in effectively detecting

and mitigating the rapidly evolving and sophisticated nature of malware.

The failures of existing systems are evident in their inability to keep pace with the continuous

emergence of new and polymorphic malware variants. Signature-based approaches rely on


known patterns and signatures, making them ineffective against previously unseen or

modified malware samples. Rule-based detection methods, on the other hand, struggle to

cope with the complexity and diversity of malware behaviors, often resulting in high false

positive rates and missed detections.

To address this problem, our research aims to explore and evaluate the application of

machine learning methods for malware detection. By leveraging the power of machine

learning algorithms, which can learn from large datasets and detect intricate patterns, we seek

to develop more advanced and proactive approaches to identify and classify malware. By

addressing the limitations of existing systems and harnessing the capabilities of machine

learning, our research endeavors to enhance the detection accuracy, reduce false positives,

and improve the overall resilience of systems against malware threats. The proposed research

aims to contribute to the development of more robust and proactive cybersecurity solutions,

mitigating the risks posed by malware and fostering a safer digital environment for

individuals and organizations.

1.3 Justification of the Study

The justification for conducting this research lies in the urgent need to address the escalating

threat of malware and the critical role that machine learning methods can play in enhancing

cybersecurity. Malware has become increasingly sophisticated, with attackers constantly

evolving their techniques to bypass traditional security measures. As a result, there is a

pressing need for advanced detection mechanisms that can effectively identify and classify

malware in real-time. Machine learning techniques have shown promise in this regard,

leveraging their ability to learn from vast amounts of data and detect complex patterns that

may be indicative of malicious behavior.

By comprehensively analyzing and evaluating different machine learning methods for

malware detection, this research can provide valuable insights into their effectiveness,

limitations, and applicability. Understanding the strengths and weaknesses of these

techniques is crucial for developing robust and proactive defense mechanisms. Moreover,
investigating the impact of malware on hardware systems is essential for understanding the

potential vulnerabilities and risks associated with malware attacks, enabling the development

of targeted mitigation strategies.

1.4 Aim and Objectives

This research aims to comprehensively investigate and analyze machine learning methods for

malware detection, with a focus on enhancing our understanding of their effectiveness,

limitations, and applicability in addressing the evolving malware threats.

The objectives of this research are as follows:

1. Analyze different machine learning methods for malware detection in detail, with a

focus on their effectiveness and limitations.

2. Investigate the impact of malware on hardware systems, including vulnerabilities

exploited and potential damage caused.

3. Explore and analyze the application of Convolutional Neural Networks (CNNs) and

Recurrent Neural Networks (RNNs) in malware detection, examining their

effectiveness in identifying and classifying malware samples.

4. Evaluate the role of one-sided perceptron, a type of artificial neural network, in

detecting malware and investigate its strengths and weaknesses in this context.

5. Examine the use of decision trees, a popular machine learning algorithm, for malware

detection, assessing their ability to capture complex patterns and make accurate

predictions.

6. Assess the effectiveness of random forest, an ensemble learning method that

combines multiple decision trees, in detecting malware, exploring its advantages and

limitations in this domain.


1.5 Research Methodology

This research will follow a systematic approach consisting of the following steps:

1. Review and analyze existing literature on malware threats, detection techniques, and

machine learning in the context of cybersecurity.

2. Collect and preprocess malware datasets, including known and unknown samples, for

training and evaluation.

3. Design and implement a machine learning model, such as a deep learning neural

network, for malware detection and analysis.

4. Train the model using the collected datasets and evaluate its performance using

appropriate metrics, such as accuracy, precision, recall, and F1 score.

5. Compare the performance of the developed model with existing malware detection

techniques using benchmark datasets.

6. Conduct experiments, analyze the results, and validate the effectiveness of the

proposed approach.

1.6 Significance of the Study

The findings of this research will have several significant implications including:

 Improved Malware Detection: The developed machine learning-based model has the

potential to enhance the accuracy of malware detection by effectively identifying both

known and unknown malware variants.

 Reduced False Positives: The proposed model aims to minimize false positives,

providing more reliable detection results and reducing unnecessary alerts.

 Proactive Defense: Machine learning-based malware detection models can enable

proactive defense mechanisms by identifying and blocking malware before it can

cause significant harm.


 Enhanced Cybersecurity Solutions: The outcomes of this research can contribute to

the development of more robust and efficient cybersecurity solutions, strengthening

defense mechanisms against rapidly evolving malware threats.

 Benefit to the Cybersecurity Community: The research findings, methodologies, and

insights generated through this study will be valuable to the broader cybersecurity

community, fostering further research, collaboration, and innovative approaches to

combat malware threats.

1.7 Scope of the Study

This research will encompass a comprehensive examination of machine learning methods for

malware detection, with a focus on their application and effectiveness in addressing the

dynamic landscape of malware threats. The scope of the study includes a thorough analysis of

various machine learning techniques. The research will focus on the impact of malware on

hardware systems, investigating the vulnerabilities exploited and potential consequences for

system integrity and performance. Additionally, the study will evaluate the strengths,

limitations, and suitability of the examined machine learning methods for accurately

detecting and classifying malware. The research will acknowledge the inherent challenges in

this domain, such as evolving and polymorphic malware, scalability issues, and

interpretability concerns. Through this scope, the research aims to contribute to the

advancement of effective and resilient malware detection techniques, thereby enhancing

cybersecurity practices and mitigating the risks posed by malicious software.

1.8 Definition of Terms

To ensure clarity, the following terms will be defined as related to the study:

 Malware: Malicious software designed to infiltrate, damage, or gain unauthorized

access to computer systems or networks.

 Machine Learning: A subfield of artificial intelligence that involves the development

of algorithms and models that enable computers to learn from data and make

predictions or decisions without being explicitly programmed.


 Signature-based Detection: A method of malware detection that involves comparing

the signature or unique characteristics of known malware samples with the files or

processes being analyzed.

 Heuristic Analysis: A technique that identifies potentially malicious behavior based

on predefined rules and patterns.

 Behavior-based Analysis: A method that monitors the behavior of files or processes to

identify suspicious or malicious activities.

 False Positives: Instances where a detection system incorrectly identifies a benign file

or process as malicious.

 Deep Learning Neural Network: A type of machine learning model that consists of

multiple layers of interconnected nodes (neurons) that can learn complex patterns and

representations from data.


\

CHAPTER TWO

LITERAURE REVIEW

2.1. Introduction

The proliferation of malware and malicious code on the internet has become a critical

security concern in recent years. With millions of websites launching attacks through exploit

downloads targeting vulnerable hosts, the risk of malware infections and their subsequent

consequences has increased significantly (Kolbitsch et al., 2009). Exploits targeting

vulnerable hosts are commonly used to download and execute malware programs, with the

resulting compromised machines often becoming part of a botnet. These botnets are then

utilized by malicious actors to carry out activities such as launching denial-of-service (DoS)

attacks, sending spam emails, and hosting scam pages. The prevalence of malware poses a

significant threat to individuals, organizations, and even entire networks. The consequences

of malware infections can range from the loss of sensitive data and financial losses to the

disruption of critical services and reputational damage. As a result, there is a pressing need

for effective techniques and strategies to detect, analyze, and mitigate the impact of malware

attacks.

In this literature review, we will explore various research studies, methodologies, and

approaches employed in the field of malware analysis. By examining the existing body of

knowledge, we aim to gain a comprehensive understanding of the current state of malware

analysis techniques, their strengths and limitations, and emerging trends. This review will

provide valuable insights into the advancements made in malware analysis and help identify

areas that require further research and development to combat the ever-evolving malware

threat
2.2 Malware analysis techniques
Malware analysis techniques involve various methods and approaches to understand and

analyze malicious software (malware) in order to identify its behavior, characteristics, and

potential threats. These techniques are essential for detecting, classifying, and mitigating the

impact of malware attacks. In this section, we will discuss some commonly used malware

analysis techniques and their significance in cybersecurity. These techniques include:

Static Analysis: Static analysis involves examining the malware without executing it. It

focuses on analyzing the binary or source code of the malware to identify patterns, signatures,

and characteristics associated with malicious behavior (Gandotra et al., 2014). This technique

relies on examining file headers, disassembling code, and inspecting function calls to gain

insights into the malware's purpose and potential impact. Static analysis can be effective in

detecting known malware patterns but may struggle with polymorphic or obfuscated malware

(Sikorski & Honig, 2012).

Dynamic Analysis: Dynamic analysis involves executing malware in a controlled

environment, such as a virtual machine or sandbox, to observe its behavior and interactions

with the system and network (Egele et al., 2012). It monitors system calls, network traffic,

file access, and other runtime activities to identify malicious actions. Dynamic analysis can

provide valuable insights into the malware's behavior, including code injection, data

exfiltration, or attempts to exploit vulnerabilities (Gandotra et al., 2014).

Behavioral Analysis: Behavioral analysis focuses on observing the actions and interactions

of malware during runtime. It aims to identify malicious behavior based on the actions

performed by the malware, such as modifying system settings, creating new processes, or

accessing sensitive data (Shalaginov et al., 2018). By monitoring the behavior of malware,
analysts can identify indicators of compromise (IOCs) and potential security risks (Gandotra

et al., 2014).

Code Analysis: Code analysis involves examining the code of the malware to identify

vulnerabilities, exploits, or specific techniques employed by the malware to evade detection

or perform malicious activities (Sikorski & Honig, 2012). This technique helps in

understanding the inner workings of the malware and enables the development of

countermeasures or signatures for detection.

Machine Learning-Based Analysis: Machine learning techniques have gained prominence

in malware analysis due to their ability to identify patterns and classify malware based on

features extracted from the samples (Hardy et al., 2016). Machine learning models can be

trained on large datasets of known malware and benign files to learn the characteristics of

malicious software. These models can then be used to classify new samples as either

malicious or benign (Kolter & Maloof, 2004).

Hybrid Analysis: Hybrid analysis combines multiple analysis techniques, such as static and

dynamic analysis, to gain a more comprehensive understanding of the malware (Damodaran

et al., 2015). By leveraging the strengths of different analysis approaches, hybrid analysis can

provide a more accurate and detailed assessment of the malware's behavior and potential

threats.

These are just a few examples of malware analysis techniques used in the field of

cybersecurity. Each technique has its strengths and limitations, and the choice of technique

depends on the specific goals of the analysis, available resources, and the nature of the

malware being analyzed. By employing a combination of these techniques, analysts can

enhance their ability to detect, analyze, and mitigate the impact of malware attacks.

2.2.1 Challenge of Malware Detection

According to the author Kuntz et al (2017), whenever a system is infected by malware the IT staff

tries to re-image the computer to better understand the malware and the ways that can be used for
its prevention. Although cost-effective this may not be the best possible solution. Some malicious

software directly attaches to system BIOS and can remain on the device even if it is re-imaged. A

software company by the name of the “Hacking Team” developed a tool which attaches itself to a

computer’s UEFI and it reinstalls itself even if the hard is wiped clean. Due to this, malware can

be present even though the end-user or IT staff is unaware of it. Some types of viruses can change

the firmware when being installed hence not being able to be detected by any anti-virus software.

This methodology may be cost-effective in the short term does not work well in the long term as

it can impose fines on organizations for not being HIPAA compliant. In case of malware

detection and prevention in an organization, proper documentation is necessary so that this does

not happen to anyone else again and if this happens, they know what measures to take. The flaw

is that software development takes more time than finding bugs (Kuntz et al , 2017).

2.3 Review of Machine Learning Classification Techniques

There are number of investigations are achieved to diagnose and predict CKD using ML

approaches. These approaches are used for prediction and classification in bio-medical fields.

2.3.1. k-Nearest Neighbor

k-Nearest Neighbor (k-NN) is a machine learning prediction technique that is widely used for

classification and regression tasks. It is a non-parametric algorithm that makes predictions

based on the similarity between input data points. The mathematical representation of k-NN

can be described as follows:

Given a labeled training dataset D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, y ₙ)}, where each data point xᵢ

is associated with a corresponding class label yᵢ, and a new input data point x, the k-NN

algorithm aims to predict the class label or value for x.

Fig 2.1 depicts the pictorial representation of k-NN.


Fig 1 k-NN classifier

2.3.2. Decision Tree

Decision tree is a machine learning prediction method that uses a hierarchical tree-like

structure to make decisions based on a set of rules learned from training data. It starts with a

root node that represents the entire dataset and recursively splits the data based on the most

informative attributes at each internal node. The splitting process continues until the

algorithm reaches leaf nodes, where predictions are made. In classification tasks, the majority

class label of instances falling into a leaf node is assigned as the predicted label. For

regression tasks, the predicted value is often the mean or median of the target values of the

instances in the leaf node. Decision trees provide an interpretable model that allows us to

understand the decision-making process and identify important features. They can handle

both categorical and numerical data and are robust to missing values. Decision trees can be

prone to overfitting, but techniques such as pruning can be used to mitigate this issue. Fig 2

depicts the pictorial representation of DT.


Fig 2 Basic structure of Decision Tree (DT)

2.3.3. Artificial Neural Network (ANN)

ANNs consist of interconnected nodes, called artificial neurons or units, organized in layers:

an input layer, one or more hidden layers, and an output layer. Each neuron receives input

signals, applies a weighted sum and an activation function, and passes the result to the next

layer. The weights represent the strength of the connections between neurons and are adjusted

during the training process to optimize the network's performance. ANNs learn from labeled

training data through a process called backpropagation, where the network's output is

compared to the desired output, and the error is used to update the weights iteratively. This

iterative training process aims to minimize the difference between the predicted output and

the true output. ANNs can handle both classification and regression tasks, and their ability to

model complex relationships makes them suitable for various domains. While ANNs can

capture intricate patterns and exhibit high predictive accuracy, they can be computationally

intensive and require significant amounts of training data. Fig 3 depicts the pictorial

representation of ANN.
Fig 3 Artificial Neural Networks (ANN)

2.2.4. Naïve Bayes

Naïve Bayes is a classification approach which is based on probability and are utilized for

diverse disease prediction. It is a popular machine learning prediction method based on the

principles of Bayesian probability. It is particularly well-suited for text classification tasks

and is known for its simplicity and efficiency. Naïve Bayes assumes that the features are

conditionally independent given the class variable, which is a strong assumption but often

holds reasonably well in practice. The algorithm works by calculating the posterior

probability of each class given the input features and then selecting the class with the highest

probability as the predicted class. Naïve Bayes leverages Bayes' theorem, which states that

the posterior probability of a class given the data is proportional to the prior probability of the

class multiplied by the likelihood of the data given the class. Fig 4 depicts the pictorial

representation of Naïve Bayes.

Fig 4 Simple Bayesian network structure


2.3.5. Random Forest

Random Forest is a cooperative learning model that uses classification and regression issues.

Random Forest is an ensemble learning method that combines the predictions of multiple

decision trees to make accurate and robust predictions. It constructs an ensemble of decision

trees by randomly sampling the training data and selecting a subset of features at each tree's

node. This randomness introduces diversity among the trees and reduces overfitting. During

prediction, Random Forest aggregates the individual tree predictions using majority voting

for classification tasks or averaging for regression tasks. It offers advantages such as handling

high-dimensional data, robustness to outliers, and feature importance measures. Random

Forest is widely used due to its strong predictive performance and interpretability. Fig 5

depicts the pictorial representation of Random Forest.


Fig 5 Random Forest model

2.3.6. Support Vector Machine (SVM)

Support Vector Machine (SVM) is a powerful machine learning prediction method that is

commonly used for both classification and regression tasks. SVM seeks to find an optimal

hyperplane that maximally separates the data points of different classes while maintaining a

margin of separation. It is a binary classifier by nature but can be extended to handle multi-

class classification problems as well. Support Vector Machine (SVM) is recognized as a

highly influential learning method that builds upon recent advancements in statistical theories

applied to machine learning. In a study by Jyothi et al. (2015), SVM was employed to

classify patient data related to liver conditions using the UCI Machine Learning repository.

The original dataset yielded an accuracy of 71%, and after employing sampling techniques,

the accuracy was still a respectable 68%. Figure 6 provides a visual representation of SVM in

action.
Fig 6 Support Vector Machine (SVM)

2.3.7. Logistic Regression (LR)

LR quantities are considered as an association among least one independent variable, and

continuous dependent variables continuously from the most part that uses likelihood scores as

prediction values of dependent variable. The variations are considered as the proportion

among the success probability over failure probability, that is, = (1 < ), where ′ ′ is probability

of model with class ′0′. In some conditions, when > 0: 5; the for an instance, the value is

provided as class 0. However, it is given to make a decision with clas 1. As , the computed

output probability is based on various condition – the coefficient refers to all . The proportion

variations are based on exponential weights. The coefficients are weighted certainties that are

used for every attribute before considering them together. In some cases, the results are based

on probability newer occasion has to be placed with class yes (> 0.5) respectively. Fig 7

depicts the pictorial representation of LR.

Fig 7 Logical regression

2.3.8. Multi-Layer Perceptron (MLP)

Multi-Layer Perceptron (MLP) is regarded as one of the most crucial categories within neural

networks, comprising an input layer, an output layer, and at least a single hidden layer. This
architecture has found effective application across various domains to address diverse and

challenging problems, often in a supervised manner, utilizing well-established methods such

as backpropagation techniques (Vijayarani et al., 2015). These techniques rely on error

correction and learning principles. In essence, MLP is considered a versatile and adaptive

modeling framework. Figure 8 illustrates a graphical representation of MLP. Fig 8 depicts the

pictorial representation of MLP.

Fig 8 Multi-layer perceptron

2.4. Proposed Models and Methods

Malware detection is of paramount importance in the face of escalating cyber-attacks,

including botnets, denial-of-service (DOS) attacks, and other forms of malware. These

attacks not only compromise sensitive information but also cause significant damage to

critical structures, resulting in substantial financial losses (Dan Lo et al., 2016). With the

rapid growth of the internet, the proliferation of malware has become more prevalent, with

approximately 317 million new pieces of malware created in recent years, equating to an

average of one million new threats released every day (Dan Lo et al., 2016). The increasing

trend of malware poses significant security threats that computer users must contend with.
Consequently, there is a pressing need for automatic malware detection and classification

tools, such as the Norman Sandbox, CWS, and Box, to mitigate these risks (Dan Lo et al.,

2016).

Figure 9: Malware detection (Samantray & Tripathy et al., 2018)

In the realm of bot malware detection, Shin et al. (2012) propose an approach that is both

effective and efficient. Traditional malware detection frameworks for bot detection have

limitations and advantages when focusing on either the host or network level. To overcome

these shortcomings, the authors present a detection framework that combines the strengths of

both approaches while leveraging the intrinsic characteristics of bots. By analyzing human-

process-network interactions and examining interactions between processes, as well as active

DNS connections and related file indicators, the authors achieve efficient detection by

filtering out benign programs and focusing on suspicious automated programs that interact

with DNS servers (Shin et al., 2012). This approach allows for a more targeted and effective

bot malware detection process.

These proposed models and methods for efficient and effective malware detection, including

automatic malware detection and classification tools, as well as the integration of host and

network-level analysis for bot detection, offer promising avenues for enhancing cybersecurity

measures and mitigating the risks associated with malware threats.

2.5 Empirical Review


S.N
Year Title Algorithms used Focus of work
o

Two different malware datasets

are used to detect malware and

differentiate it from benign

activities. The datasets are

preprocessed, and then

Malware detection correlation-based feature

using deep learning selection is applied to produce

1 2023 and correlation- Feature selection different feature-selected

based feature datasets. The results indicate that

selection some feature-selected scenarios

preserve almost the same original

dataset performance. The

different nature of the used

datasets shows different levels of

performance changes.

Focusing on the usage of RNN

and CNN algorithms for malware

detection and comparing with


Malware Detection CNN, RNN,
other algorithms such as Decision
2 2022 Using Machine Decision Tree,
Tree and Random Forest. Also
Learning Random Forest
focusing on detecting malwares

in different formats, not only PE

files.

3 2021 MLDroid— Deep learning Empirical results reveal that a


S.N
Year Title Algorithms used Focus of work
o

model developed by considering

all four distinct machine learning

algorithms parallelly (i.e., deep


algorithm, farthest
framework for learning algorithm, farthest first
first clustering, Y-
Android malware clustering, Y-MLP, and nonlinear
MLP, nonlinear
detection using ensemble decision tree forest
ensemble decision
machine learning approach) and rough set analysis
tree forest
techniques as a feature subset selection
approach
algorithm achieved the highest

detection rate of 98.8% to detect

malware from real-world apps.

Classification Of Naive Bayes,

Malware Detection Support Vector Only focuses on the use of

4 2020 using Machine Machine, Random machine learning algorithms for

Learning Forest, K-nearest malware detection.

Algorithms neighbor

Detection of Decision Tree,


Use of signature-based method,
advanced malware Random Forest,
5 2019 which is traditional and does not
by Machine Naive Bayes, J48
provide the best accuracy.
Learning techniques Graft

6 2019 Comparison of KNN, Decision The result indicates that the

malware detection Tree, SVM Decision Tree algorithm has the

techniques using best detection accuracy compared


S.N
Year Title Algorithms used Focus of work
o

to other classifiers, with 99%

machine learning accuracy and 0.021% False

algorithm Positive Rate (FPR) on a

relatively small dataset.

SVM, Decision
Malware detection
Tree, Naive
using Machine Use of signature-based method,
7 2017 Bayes, Multi-
Learning which is traditional.
Naive Bayes
Algorithms
Algorithm

Heuristic,
Malware Detection
Artificial
and Evasion with
Intelligence, Use of traditional methods for
8 2017 Machine Learning
Behavior, malware detection.
Techniques: A
Signature-Based
Survey
Methods

Malware Detection
SVM, Decision Use of signature-based method,
9 2017 usin Machine
Tree, Naive Bayes which is traditional.
Learnin

10 2017 Malware Detection Heuristic, Use of traditional methods for

and Evasion with Artificial malware detection.

Machine Learning Intelligence,

Techniques: A Behavior,

Survey Signature Based


S.N
Year Title Algorithms used Focus of work
o

Methods

Malware Detection

Module using

Machine Learning Decision Tree, Some methods of machine

11 2012 Algorithms to Assist Random Forest, learning are not appropriate due

in Centralized Naive Bayes to heavy processors.

Security in

Enterprise Networks
CHAPTER THREE

METHODOLOGY

3.1 Introduction

The increasing prevalence and sophistication of malware pose significant challenges to

computer systems' security. Traditional signature-based methods are often insufficient to

detect new and evolving malware variants. Therefore, the utilization of machine learning

techniques has emerged as a promising approach to address the problem of malware

detection. This chapter presents the methodology for developing a malware detection and

analysis model using machine learning methods with a one-sided perceptron. The objective is

to detect malware from different files present in computer systems by applying machine

learning algorithms on the required dataset.

3.2 Overview of the Methodology

The proposed methodology encompasses several stages that collectively contribute to the

development of an effective malware detection and analysis model. These stages include

dataset preparation, algorithm selection, model training, and evaluation. The following

subsections provide an overview of each stage.

3.2.1 Dataset Preparation

The first step in the methodology involves the acquisition and preparation of a

comprehensive dataset that contains both benign and malicious files. This dataset serves as

the foundation for training and evaluating the machine learning models. It should encompass

various types of malware and cover a diverse range of file formats commonly encountered in
computer systems. Additionally, the dataset should be labeled to indicate the presence of

malware accurately.

3.2.2 Algorithm Selection

Once the dataset is prepared, the next step is to select suitable machine learning algorithms.

In this research, the focus will be on utilizing the one-sided perceptron algorithm for malware

detection. The one-sided perceptron is a binary classification algorithm that can effectively

distinguish between benign and malicious files. Its simplicity and efficiency make it a viable

choice for this application. Furthermore, the algorithm's ability to update its weights based on

misclassified samples enhances its adaptability to evolving malware threats.

3.2.3 Model Training

With the algorithm selected, the dataset is divided into training and testing sets. The training

set is used to train the one-sided perceptron model by iteratively adjusting the weights to

minimize classification errors. During the training process, the model learns the

distinguishing characteristics of malware and develops a decision boundary that separates

benign and malicious files effectively. The training phase aims to optimize the model's

performance and enhance its ability to generalize to unseen data.

3.2.4 Model Evaluation

Following model training, the performance and effectiveness of the developed malware

detection model are evaluated using the testing set. Various metrics such as accuracy,

precision, recall, and F1 score are computed to assess the model's performance in correctly

identifying malware instances while minimizing false positives. Additionally, other

evaluation techniques, such as cross-validation, can be employed to validate the model's

robustness and generalizability.

3.3 Machine Learning Techniques

Machine learning offers several advantages that make it a compelling approach for malware

detection. Firstly, it enables cybersecurity specialists to rapidly detect and categorize threats,

providing valuable insights for further investigation and mitigation. Machine learning models
can analyze clusters of requests or network traffic with similar characteristics, facilitating the

identification of anomalous patterns that may indicate malware activities.

Machine learning techniques form the core of the proposed methodology for malware

detection. By leveraging algorithms for data analysis and pattern detection, machine learning

enables the identification of distinguishing features that differentiate malware from benign

files. One advantage of machine learning is its capability to detect zero-day malware, which

refers to previously unknown malicious software. By analyzing a large number of benign and

malicious files, the algorithm can learn the underlying patterns and make accurate

predictions.

In the context of PE files, machine learning approaches can be categorized into three groups:

1. Recurrent Neural Network (RNN) Algorithm: RNNs are particularly suitable for

sequential data analysis, making them well-suited for detecting malware in PE files.

Their ability to capture temporal dependencies and memory-like behavior enables

them to uncover intricate patterns that may be indicative of malicious activities.

2. Convolutional Neural Network (CNN) Algorithm: CNNs excel at extracting spatial

features from data, making them effective in detecting visual patterns. In the context

of malware detection, CNNs can analyze the structural elements of PE files and

identify suspicious patterns that indicate the presence of malware.

3. Decision Tree: Decision trees provide a transparent and interpretable framework for

classifying data. By constructing a tree-like structure of decisions based on different

features, decision trees can effectively separate benign and malicious files.

Furthermore, decision trees offer insights into the features that contribute most to the

classification, aiding in the understanding of malware characteristics.

4. Random Forest: Random forest is an ensemble learning method that combines

multiple decision trees to improve classification performance. By aggregating the

predictions of individual trees, random forest models can enhance accuracy and
mitigate the risk of overfitting. This makes them suitable for robust and reliable

malware detection.

3.5 Analysis and Design

In the analysis and design phase, themain focus is on extracting relevant features from the

imported files and encoding them for analysis. Virus extraction techniques can be utilized to

isolate and capture key characteristics associated with malware. These features, obtained

from the previously altered files, are then applied to the projected dataset.

The dataset is subsequently subjected to various machine learning algorithms, including the

one-sided perceptron, RNN, CNN, decision trees, and random forest. Through the analysis of

the dataset using these algorithms, valuable insights and results can be obtained regarding the

effectiveness of different approaches for detecting malware.

The main feature taken from the imported file will be encoded with virus extraction and taken

from the previous altered file. Due to this, it is easy to apply the projected dataset for tasing value

from the initial Addy for highlighting vector with file data. Therefore, the data set will analyze

the machine learning algorithms and draw various results regarding detecting malware.
CHAPTER FOUR

DATASET

4.1. Malware Datasets

The importance and value of information cannot be overstated. Without access to a

significant amount of data, it is not feasible to develop a model using machine learning. The

data is vital and harmful to consider in relation to malware because of its context. Malware in

binary format can be collected, but since it is in executable form, doing so carries some

inherent risk. When dealing with executable files, it is necessary for the analyst to set up a

virtual computer and carefully check or extract features from the virus. Even though there are

now 30,386,102 virus samples that may be downloaded from VirusShare.com, "Access to the

site is authorized only via invitation" [21].

A sizeable dataset was made accessible to the general public by Microsoft in 2015 as part of

the "Microsoft Malware Classification Challenge" hosted on Kaggle [5]. The dataset contains

20,000 malicious samples that come from 9 different families. These samples are provided in

binary form as well as in the disassembled assembly format (.asm) using the IDA Pro

disassembler. Although a great number of research papers have made use of this dataset, we

were unable to include it into our investigation because of its enormous size (400 GB), as

well as the lack of any harmless files contained within it. This publication aggregates a list of

citations to over fifty unique research publications and theses that all make use of the dataset.

These citations are included in the publication.


The majority of publicly accessible malware datasets are somewhat small, despite the fact that

there are various static and dynamically extracted feature datasets available. Because it comprises

5210 samples, of which 2722 are dangerous and the other samples are benign, ClaMP

(Classification of Malware using PE Headers) [7] served as a great beginning point for our

testing. ClaMP is a publication that was released in 2016 and has 69 extracted features.

28
These features include things like md5, size, entropy, fileInfo, VirusTotal report, file type,

and more.

4.2. Eliminating Noisy Features

Data mining algorithms take as input information from the real world, which may be affected

by a number of factors. The presence of noise is a major contributor to these problems. This

issue will always exist, but any data-driven business must find a solution. Human error and

the fallibility of data-gathering instruments both contribute to inaccuracies in collected data.

Noise refers to the unintended fluctuations. Data noise can be problematic for machine

learning algorithms if it is not properly trained, as the algorithm may mistake it for a pattern

and generalize incorrectly.

The effects of various forms of noise on datasets are shown in the following figure.

Figure 2: Eliminating Noisy Features

Because of this, the quality of the analysis process as a whole may be jeopardized if the

dataset in question was noisy. The signal-to-noise ratio is the primary metric that analysts and

data scientists use to measure the quality of data. The following is a diagram that

demonstrates how noise lowers the quality of a signal.


Source: machinecurve.com
Source: www.machinecurve.com

Therefore, it is necessary for every data scientist to handle the noise in the dataset, regardless

of the method that they choose

4.3. Techniques for Cleaning the Data Used in Machine Learning

Completely Doing Away With the Background Noise Technique for the Encoding of

Automatic Data

When it comes to de-noising, auto-encoders, and more especially the stochastic

variation of auto-encoders, are of great use. The fact that they can be taught to recognize

certain noise in a signal or collection of data enables them to be used as de-noisers. In this

application, the noisy data serves as the input, and the de-noised data is created as the output.

Encoders and decoders are both necessary parts of auto-encoders. The encoder is responsible

for converting incoming data into an encoded form, while the decoder is responsible for

reverting the data to its original condition. De-noising auto-encoders are built with the

intention of forcing the hidden layer to pick up more robust features via some kind of

psychological manipulation. 31
After that, the auto-encoder is taught to recover the original data from the damaged one while

reducing the amount of data that is lost in the process.

The acronym PCA stands for "Principal Component Analysis" (Principal Component Analysis)

Principal components analysis (PCA) is a statistical technique that uses the orthogonal

property to divide a set of potentially connected variables (linked variables) into a set of

variables that are not related to one another (uncorrelated). A new group of independent

variables is represented by the principal components. The basic objective of principal

component analysis (PCA) is to improve the quality of a signal or image by reducing or

removing noise while maintaining the integrity of the key information. The principal

component analysis (PCA) is a geometric and statistical approach that projects an input signal

or data set along numerous axes in order to minimize the dimensionality of the signal or data.

To have a better understanding of the notion, see it as the projection along the X-axis of a

point that is located in the XY plane. It is now permissible to disregard the Y-axis noise

plane. This entire procedure may be described as having a dimensionality reduction. Because

it removes the axes that contain the noisy data, principal component analysis is a technique

that may be used to clean up noisy input data. In this investigation, the principal component

analysis (PCA) is used to carry out a two-stage noise reduction technique. The PCA takes in

noisy input and generates clean outputs.

Data scientists nowadays have a great lot of anxiety over the process of separating signal from

noise due to the possible performance difficulties it may bring. These concerns include the

possibility of overfitting altering the behavior of a machine learning algorithm. It is possible for

an algorithm to begin the process of generalization by using noise as a pattern. Since noise

degrades the quality of your signal or dataset, getting rid of it or reducing it is your best option to

improve things. The problem of noisy data has been addressed with a variety of potential

solutions. It's possible that we may find a solution to this problem by using techniques such

as feature selection and dimensionality reduction.


CHAPTER FIVE

TESTING
5.1. Introduction

We used the EMBER dataset to train deep learning and machine learning algorithms. In deep

learning, we used Convolutional Neural Network (CNN) and RNN (Recurrent Neural

Network), and in machine learning, we used Random Forest and Decision Tree algorithms.

We then compared accuracy performance on test data on all algorithms.

After training the model, we tested it by using NORMAL and Malware files which we

downloaded from the ‘VIRUS TOTAL’ website. The model was able to accurately predict

both types of files.

5.2. Testing results

We have used three files to test normal and malware behaviour.

In the screen above, the first two selected files are the testing files. "antialias.exe" is the normal
file, and "trojan.exe" and bot are virus file. The screen below shows the project output.
In above screen in selected text we are uploading python packages

We read data from the ember folder, split the dataset into train and test parts, and then find
the total malware classes available in the dataset. After executing the code, we get the screen
below.
We can see the total dataset size and then the size of the train and test data. The dataset
contains 3 different types of classes where 0 is the normal class and the others are the
malware classes. In the graph, the x-axis represents class names and the y-axis represents
their count in the dataset. In the next screen, we will train the dataset using a RNN model.
We check to see if the model is trained, and if it is, we load the RNN model. If not, the model

is built from scratch. After the model is trained, we get predictions and model accuracy from

the test data.

On the above screen, we print the RNN model summary with feature details and then print
the training model accuracy and accuracy on test data. Below, the screen shows the RNN
LOSS and ACCURACY for each epoch.
The x-axis in the graph above represents epochs and the y-axis represents accuracy and loss
values. It is evident from the graph that as accuracy increases, loss values decrease. The
screen below shows the training dataset with machine learning algorithms.

We are training the ember dataset with random forest and decision tree machine learning
algorithms and then calculating accuracy on test data.
Looking at the screen, we can see that both algorithms have the same accuracy of 60%. The
graph below shows the accuracy comparison between RNN, Random Forest, and a decision
tree. We took test PE files and then performed predictions.

We check to see if the model is trained, and if it is, we load the CNN model. If not, the model

is built from scratch. After the model is trained, we get predictions and model accuracy from

the test data.


On the above screen, we print the CNN model summary with feature details and then print
the training model accuracy and accuracy on test data. Below, the screen shows the CNN
LOSS and ACCURACY for each epoch.

The x-axis in the graph above represents epochs and the y-axis represents accuracy and loss
values. It is evident from the graph that as accuracy increases, loss values decrease. The
screen below shows the training dataset with machine learning algorithms.
Looking at the screen, we can see that both algorithms have the same accuracy of 60%. The
graph below shows the accuracy comparison between CNN, Random Forest, and a decision
tree. We took test PE files and then performed predictions.

In the screen above, you can see that I took the 'antialias.exe' file as input and then used a

classifier to predict the file. The output showed that the file was 'NORMAL'. In the screen

below, I took a virus PE file and the model gave the prediction result shown.
I tested the model with a virus file and it outputted that the file was a MALWARE file. You

can upload other files and test the model yourself.

Comparision of Accuracy Results B/w CNN, RNN, Decision Tree and Random Forest

We are getting more accuracy in CNN (approximately 85%) compare to remaining

algorithms that means RNN, Decision Tree and Random Forest.

Algotithm Accuracy
CNN 86%
RNN 70%
Decision Tree 60%
Random Forest 60%

Table:2: Accuracy comparisions for different algorithms


CHAPTER SIX

CONCLUSION AND RECOMMENDATIONS

6.1 Conclusion

This research focused on the development of a robust malware detection and analysis model.

The primary objective was to leverage machine learning techniques to accurately detect and

classify malware, thereby mitigating the substantial risks posed by malicious attacks. While

significant progress has been made towards achieving the ultimate goal of zero false

positives, further refinement is required to reduce the existing false positive rate. The findings

highlight the efficacy of machine learning algorithms, with the CNN algorithm emerging as

the most accurate among the evaluated options.

The research outcomes underscore the critical importance of continuous efforts to refine and

enhance machine learning algorithms. Specifically, the decision tree and Random Forest

algorithms should be subjected to further exploration and optimization. By incorporating

advanced techniques, optimizing feature selection, and diversifying the dataset, these

algorithms can be fine-tuned to achieve higher levels of performance and accuracy. This

refinement process is essential in fortifying the malware detection model and ensuring its

reliability and effectiveness in real-world scenarios.

Moreover, it is recommended to adopt an integrated approach to malware detection. By

combining signature-based, rule-based, and machine learning-based techniques, a

comprehensive and multifaceted detection system can be developed. This integration allows

for the synergistic utilization of the strengths of each approach, enhancing the overall

accuracy and robustness of the model. Additionally, continuous research and adaptation are

imperative in this field. The landscape of malware is dynamic, with new variants and attack

vectors emerging regularly. Staying abreast of the latest trends, continuously monitoring new

threats, and integrating threat intelligence are critical for maintaining an up-to-date and

effective malware detection and analysis model.


In conclusion, the research findings present promising progress towards the development of a

malware detection and analysis model. By further refining machine learning algorithms,

integrating diverse detection approaches, and staying vigilant in the face of evolving threats,

the model's accuracy and reliability can be significantly enhanced. These advancements will

contribute to bolstering cybersecurity measures, safeguarding computer systems and

networks, and mitigating the detrimental impacts of malware attacks.

6.2 Recommendations

Based on the findings of the research, the following recommendations are proposed:

1. Refinement of machine learning algorithms: Further refinement of machine learning

algorithms, including the decision tree and Random Forest algorithms, is

recommended. This can be achieved by exploring advanced techniques, optimizing

feature selection, and incorporating more diverse datasets. Improving the performance

of these algorithms will enhance the overall accuracy and reliability of the malware

detection model.

2. Integration of multiple detection approaches: To enhance the effectiveness of the

malware detection and analysis model, it is recommended to integrate multiple

detection approaches. This includes combining signature-based, rule-based, and

machine learning-based techniques. By leveraging the strengths of each approach and

developing a comprehensive detection system, a higher level of accuracy and

robustness can be achieved.

3. Continuous research and adaptation: The field of malware detection and analysis is

constantly evolving, as new malware variants and attack techniques emerge. It is

crucial to stay updated with the latest trends and developments in malware, and to

continuously adapt the detection model accordingly. This involves ongoing research,

monitoring of new threats, and the integration of threat intelligence to ensure the

model remains effective and up to date.


By implementing these recommendations, the development of a malware detection and

analysis model can be further improved, leading to more accurate and efficient identification

of malware threats. These advancements will contribute to strengthening cybersecurity

measures and protecting computer systems and networks from the detrimental effects of

malware attacks.
REFERENCES
Abdullah, M., Agal, A., Alharthi, M., & Alrashidi, M. (2019). Arabic Handwriting

Recognition Model based on. International Journal of Advanced Trends in Computer

Science and Engineering, 8(1.1).

Alomari, E. S., Nuiaa, R. R., Alyasseri, Z. A. A., Mohammed, H. J., Sani, N. S., Esa, M. I., &

Musawi, B. A. (2023). Malware detection using deep learning and correlation-based

feature selection. Symmetry, 15(1), 123.

Anderson, S., & Roth, P. (2018). EMBER: An Open Dataset for Training Static PE

Malware (arXiv No. 1804.04637).

Arora, A., Peddoju, S., Chouhan, V., & Chaudhary, A. (2018). Hybrid Android malware

detection by combining supervised and unsupervised learning. In Proceedings of the

24th Annual International Conference on Mobile Computing and Networking (pp.

798–800). New Delhi, India.

Ban, Y., Lee, S., Song, D., Cho, H., & Yi, J. H. (2022). FAM: Featuring Android Malware

for Deep Learning-Based Familial Analysis. IEEE Access, 10, 20008–20018.

Bassel, A., Abdulkareem, A., Alyasseri, Z., Sani, N., & Mohammed, H. J. (2022). Automatic

Malignant and Benign Skin Cancer Classification Using a Hybrid Deep Learning

Approach. Diagnostics, 12, 2472.

Beaucamps, P., & Filiol, E. (2007). On the possibility of practically obfuscating programs -

Towards a unified perspective of code protection. Journal in Computer Virology, 3(1),

3-21.

Christodorescu, M., & Jha, S. (2003). Static analysis of executables to detect malicious

patterns. In SSYM’03: Proceedings of the 12th Conference on USENIX Security

Symposium, 12.

Damodaran, A., Troia, F. D., Visaggio, C. A., Austin, T. H., & Stamp, M. (2015). A

comparison of static, dynamic, and hybrid analysis for malware detection. Journal of

Computer Virology and Hacking Techniques, 13(1), 1–12.


Egele, M., Scholte, T., Kirda, E., & Kruegel, C. (2012). A survey on automated dynamic

malware-analysis techniques and tools. ACM Computing Surveys, 44.

Gandotra, E., Bansal, D., & Sofat, S. (2014). Malware analysis and classification: A

survey. Journal of Information Security, 5(2), 56-64.

Gandotra, E., Bansal, D., & Sofat, S. (2014). Malware analysis and classification: A survey.

Journal of Information Security, 5(2), 56-64.

Hardy, W., Chen, L., Hou, S., Ye, Y., & Li, X. (2016). DL4MD: A deep learning framework

for intelligent malware detection. In International Conference on Data Mining

(DMIN).

Hardy, W., Chen, L., Hou, S., Ye, Y., & Li, X. (2016). DL4MD: A deep learning framework

for intelligent malware detection. In International Conference on Data Mining

(DMIN).

Hunting for Malware with Machine Learning. (2016). EndGame.

Hwang, C., Hwang, J., Kwak, J., & Lee, T. (2020). Platform-independent malware analysis

applicable to Windows and Linux environments. Electronics, 9, 793.

Jeon, S., & Moon, J. (2020). Malware-detection method with a convolutional recurrent neural

network using opcode sequences. Information Sciences, 535, 1–15.

Kolosnjaji, B., Zarras, A., Webster, G., & Eckert, C. (2016). Deep learning for classification

of malware system call sequences. Australasian Joint Conference on Artificial

Intelligence, 137-149.

Kolter, J., & Maloof, M. (2004). Learning to detect malicious executables in the wild. In

Proceedings of the 10th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining.

Mahindru, A., & Sangal, A. L. (2021). MLDroid—framework for Android malware detection

using machine learning techniques. Neural Computing and Applications, 33(10),

5183-5240.
Selamat, N., & Ali, F. (2019). Comparison of malware detection techniques using

machine learning algorithm. Indones. J. Electr. Eng. Comput. Sci, 16, 435.

Majid, A. A. M., Alshaibi, A. J., Kostyuchenko, E., & Shelupanov, A. (2023). A review of

artificial intelligence based malware detection using deep learning. Materials Today:

Proceedings, 80, 2678-2683.

Mat, S. R. T., Razak, M. A., Kahar, M., Arif, J., & Firdaus, A. (2022). A Bayesian

probability model for Android malware detection. ICT Express, 8, 424–431.

Microsoft Malware Classification Challenge. (2015). Kaggle. Retrieved

from https://www.kaggle.com/c/malware-classification

Morgenstern, T. (2016). Malware Terms for Non-Techies: Code Entropy. Retrieved

from https://www.cyberbit.com/blog/endpoint-security/malware-terms-code-entropy/

Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., & Ahmadi, M. (2018). Microsoft

Malware Classification Challenge (arXiv No. 1802.10135).

Shalaginov, A., Banin, S., Dehghantanha, A., & Franke, K. (2018). Machine learning aided

static malware analysis: A survey and tutorial. Cyber Threat Intelligence, 7-45.

Shalaginov, A., Banin, S., Dehghantanha, A., & Franke, K. (2018). Machine learning aided

static malware analysis: A survey and tutorial. Cyber Threat Intelligence, 7-45.

Siddiqui, M., Wang, M. C., & Lee, J. (2009). Detecting Internet Worms Using Data Mining

Techniques. JournalI apologize for the incomplete response. Due to the limited space,

I can only provide references for the first few sources. Here they are:

Sikorski, M., & Honig, A. (2012). Practical Malware Analysis. No Starch Press.

Sung, A., Xu, J., Chavez, P., & Mukkamala, S. (2004). Static analyzer of vicious executables

(save). In Proceedings of the 20th Annual Computer Security Applications

Conference (ACSAC '04), 326-334.

Tian, R., Batten, L., & Versteeg, S. (2008). Function length as a tool for malware

classification. In Proceedings of the 3rd International Conference on Malicious and

Unwanted Software, 57-64.


Tomar, R., & Awasthi, Y. (2019). Prevention Techniques Employed In Wireless Ad-Hoc

Networks. International Journal of Advanced Trends in Computer Science and

Engineering, 8(1.2).

Vinayakumar, R., Alazab, M., Soman, K., Poornachandran, P., & Venkatraman, S. (2019).

Robust intelligent malware detection using deep learning. IEEE Access.

Yazdinejad, A., HaddadPajouh, H., Dehghantanha, A., Parizi, R., Srivastava, G., & Chen,

M.-Y. (2020). Cryptocurrency malware hunting: A deep recurrent neural network

approach. Applied Soft Computing, 96, 106630.

Ye, Y., Li, T., Zhu, S., Zhuang, W., Tas, E., Gupta, U., et al. (2011). Combining file content

and file relations for cloud-based malware detection. In Proceedings of ACM

International Conference on Knowledge Discovery and Data Mining (ACM

SIGKDD), 222-230.

You might also like