A_Honeypot_with_Machine_Learning_based_Detection_Framework_for_defending_IoT_based_Botnet_DDoS_Attacks
A_Honeypot_with_Machine_Learning_based_Detection_Framework_for_defending_IoT_based_Botnet_DDoS_Attacks
Abstract— With the tremendous growth of IoT botnet DDoS the type of devices targetted by it and also about the malware
attacks in recent years, IoT security has now become one of the executables and its commands, etc [27]. In the past years in the
most concerned topics in the field of network security. A lot of field of computer security, honeypots have been proved out as
security approaches have been proposed in the area, but they still a great source for researching out the various malware and its
lack in terms of dealing with newer emerging variants of IoT variants. It first came into existence in the late 1990s as ‘The
malware, known as Zero-Day Attacks. In this paper, we present a Deception Toolkit’ which was developed by Fred Cohen in
honeypot-based approach which uses machine learning 1998 [28] and later became publically and commercially
techniques for malware detection. The IoT honeypot generated available especially to tackle with the self-replicating programs
data is used as a dataset for the effective and dynamic training of
called worms.
a machine learning model. The approach can be taken as a
productive outset towards combatting Zero-Day DDoS Attacks Nowadays, there are different types of Honeypots available
which now has emerged as an open challenge in defending IoT to be used by various applications. It can be classified
against DDoS Attacks. depending on the level of interaction it allows with the attacker.
The level of interaction depends upon the amount of data that
Keywords— Zero-Day DDoS Attack; Machine Learning; IoT needs to be get collected. Therefore, it is categorized into Low
Honeypots; IoT Botnets interaction honeypots and High interaction honeypots [9]. It
can also be classified on the basis of objective it wants to attain
I. INTRODUCTION i.e either they can be used for carrying out any research to get
IoT which is a network of interconnected things without knowledge of possible threats and shortcomings in the system
human intervention has also now become the source of called as Research Honeypots, or they can be used for
propagating DDoS Attacks [1]. IoT devices can be more easily protecting the companies assets from the attacks in real time to
compromised than desktop computers. Therefore, there is a improve the overall security called as Production Honeypots.
significant increase in the occurrence of IoT-based botnet Thus, honeypots are quite effective in dealing with Zero-Day
attacks [7]. The botnet referred to as a network of bots DDoS Attacks without compromising IoT devices [29].
(compromised IoT devices), is the result of malware infections However, there is a difference between traditional
in an IoT network [2]. According to the recent survey, there are honeypots and IoT honeypots. Traditional honeypots have
over 6 billion IoT devices on the planet, such a huge number of similar architectures (mainly x86 and x86-84) whereas the
potentially vulnerable gadgets cannot go easily unnoticed by architectures of IoT honeypots are heterogeneous due to
cybercriminals. The thousands of malware are detected in different types of IoT devices.
previous years, and about half of them were in the year 2017
only [5]. In our proposed solution we have used a honeypot
framework to catch several malware installation attempts into
A honeypot, as its name suggests, used for luring in the IoT device. The collected information in the form of log
attackers with an intention to observe and analyze their method files can be used as input to the machine learning model we are
of launching an attack by capturing information about the using for training purpose. The advantage of using honeypot
attacking agent like malware for a DDoS attack [9]. It is a over datasets to train the model is that we would be able to
device capable of getting compromised on the behalf of the learn the model by unknown variants of malware families also
main server by simulating any vulnerability which can easily instead of using only limited known data [13].
be exploitable by an attacker. The information which it can
capture by monitoring the activities between the attacker and The concept of machine learning is used in our solution to
itself are IP addresses as well as MAC address, port numbers, automate the process of detection and prediction of the
incoming security threats to the IoT devices by using Another machine learning based solution was proposed to
appropriate learning algorithm and techniques [17]. Learning detect DDoS using deep learning models like: Convolutional
algorithms are generally categorized into supervised and Neural Network (CNN) [22], Recurrent Neural Network
unsupervised. Supervised learning requires the assignment of (RNN) [25], Long Short-Term Memory Neural Network
labels of classification during the training process that can be (LSTM) [23], and Gated Recurrent Unit Neural Network
used to predict the labels if corresponding features are (GRU) [24]. A network-based anomaly detection method was
relatively the same. On the other hand, unsupervised learning proposed which extracts behavior snapshots of the network and
[6] does not require such labels to be assigned, in fact, it uses deep autoencoders to detect anomalous network traffic
classifies on the basis of similarities among the various features emanating from compromised IoT devices [26]. However, deep
of the training dataset. In our solution, we prefer to use learning models need a large amount of data to train itself for
unsupervised learning algorithm as we do not want human producing accurate outcomes. In spite of that, they have
intervention in the process because an expert is needed to form extremely computationally expensive and complex training
the rules and assign the labels accordingly. Some of the most procedure and often require a significant amount of time to
common unsupervised learning methods are clustering, learn. IoT devices cannot afford such extensive procedures as
anomaly detection, and neural networks. The malware they are quite constrained in terms of resources as well as in
detection can be characterized by a classification problem or a providing real-time services to the user. Moreover, there is a
clusterization problem [10, 11]. Classification problem has need to develop new methods for detecting attacks launched
known instances of data, hence uses supervised learning to from compromised IoT devices and differentiate between hour
predict the nature of the problem into classes. In the and millisecond long IoT-based attacks [26].
clusterization problem, unknown malware types are clusterized
into several clusters based on the certain properties identified III. METHODOLOGY
by an unsupervised learning algorithm [8].
In our proposed solution, we are not only concentrated over
Moreover, the advantage of using machine learning for the detecting the malware but also interested in identifying the
detection of malware lies in its ability to generate a lesser unknown malware families responsible for the category of
number of false positives and false negatives as compared to Zero-Day DDoS attacks. Zero-Day attacks are caused by
other anomaly detection methods [4]. different possible variants of malware infections that yet not
have been identified entirely for creating a complete DDoS
II. RELATED WORK defense against it [19]. This issue is solved by using a honeypot
approach with a machine learning based detection framework.
There are several honeypot based approaches are present in A honeypot is used to intentionally lure in attackers with the
the literature for defending DDoS. The concept of the signature purpose to capture the malware properties and its style of
matching method had been used as a detection framework in invading the security of IoT devices by recording the whole
some of these approaches [16]. Malware is detected on the information about it in log files [16]. In addition to it, a
basis of signatures obtained from their corresponding generated machine learning based detection framework is used next to
log files from the honeypot [18]. This type of detection was predict the possibility of an abnormal activity based on the log
able to deal with only stored signatures and its variations, files generated by the honeypot using a light weighted
hence throw a limitation on dealing with an unknown and classification algorithm preferably an unsupervised one as it
wider range of malware families. Another solution is anomaly does not require any expert to classify the training tuples into a
based detection [12] which does not make use of rules, but a malicious one or a normal on [20].
threshold is set for normal user behavior and any deviation
from it leads to a declaration of possible malicious behavior. The architecture for our proposed solution is as follow: The
Such systems do suffer from high false positive rates because process starts when an attacker attempts to inject the malware
attackers now can imitate normal behavior too. Moreover, a through an open port (telnet port 23 or 2323) by logging into an
machine learning based solution is capable to deal with such IoT device using several combinations of ID and Passwords.
problem due to its ability to learn and teach over time. Thus, a Honeypot here comes into the picture for intentionally allowing
more accurate classification with less number of false positive to gain access to the attacker by invading its own protection
can be achieved by training the model with effective and wall. The main intention is to get the information about the
updated data. The machine learning concept is used to better malware as well as about the attacker by recording each
utilize the dynamic data produced by honeypot and increase the activity between the device and the invader in the form of log
predictability for future attacks. files. These log files capture the information that enables us to
get the idea about the nature of new malware families, their
Many machine learning methods have also been proposed variants, type of targeted devices and also about the C&C
to identify DDoS based on the selection of statistical features server IP address, port number, etc. Now, since we have to
using several supervised learning algorithms like SVM, Naïve-
train our machine learning model, we need to transform our log
Bayes, etc [15,17]. However, these methods require extensive file data into a proper tabular format that will work as datasets.
network expertise for selecting appropriate features out of the For classification, we will prefer to use a memory efficient
dataset and usually are limited to only one or several DDoS which use minimum possible training data to predict the useful
vectors. In addition, they require regular updates of the system information, for avoiding it from becoming an overhead for an
to keep it functioning in diverse situations. IoT device [20]. At last, based on the classification result,
appropriate action is performed. Fig.1. represents the whole
Fig. 1. Process flow for the honeypot-based solution with machine learning based detection framework.
process flow for the proposed solution. The process of training implementing it virtually by simulating the IoT platform using
repeats every time it exceeds the allowable size of training data IoT communication protocols. The attack strategies like
to make the process dynamic, and easily runnable on resource- network traffic, payload, malware samples, the toolkit used by
constrained IoT devices. the attacker, etc. are then can be recorded by the honeypot.
There is a list of some recently developed IoT honeypots for
IV. IMPLEMENTATION ASPECTS DDoS detection:
Implementation is a necessary part of any novel approach IoTPOT [32]: This honeypot also emulates Telnet services
or idea in order to check the feasibility and evaluate its of various IoT devices and consists of a frontend low
efficiency over the currently available similar solutions. As interaction responder cooperating with a backend high
discussed in the above section, our proposed approach consists interaction virtual environment called IoTBOX capable of
of different subsequent steps. At each step, we can apply the operating at different CPU architectures.
latest methods for the underlying concept to keep our solution Telnet IoT honeypot [30]: Telnet server is used for
updated enough to handle the current IoT challenges. implementing the trap for IoT.
Following are the recent developments that took place in recent HoneyThing [31]: This honeypot emulates a vulnerable
years in the field of IoT honeypots and real-time machine modem/router (having RomPager embedded web server)
learning detection which are the two most important steps used and is TR-069 (CPE WAN Management Protocol)
in our approach for carrying out the desired implementation: specific.
Dionaea [33]: This honeypot uses MQTT protocol to
A. IoT Virtual Honeypot: simulate the IoT behavior.
ZigBee Honeypot [34]: This honeypot simulates a ZigBee
Our very first step in our proposed approach is to attract the gateway.
attackers for deliberately exploiting the vulnerability present in Multi-purpose IoT honeypot [35]: This IoT honeypot
IoT devices. For emulating this behavior, we need a system or focuses on Telnet, SSH, HTTP, and CWMP.
device which can exactly act as an exploitable IoT device and ThingPot [29]: This IoT honeypot is capable of simulating
prompt the attacker to play his malevolent move without an entire IoT platform, rather than a single application-
having the second thought about the genuineness of the layer communication protocol (e.g., Telnet, HTTP, etc.).
exploits. Such systems are widely known as IoT honeypots. As
discussed above in the introduction based on the level of The most appropriate IoT honeypot should be able to
interaction, honeypots can be classified as High Interaction emulate the IoT devices not by just emulating some selected
Honeypots (HIH), Low Interaction Honeypots (LIH) and IoT communication protocols, but it should be capable enough
Medium Interaction Honeypots (MIH) which is a combination to simulate the whole IoT platform along with all the supported
of both. Since it’s infeasible to set up a high interaction application layer protocols. Some of the most popular
honeypot (HIH) for resource-constrained IoT devices, it would application protocol which is used for IoT communication are
be preferable to select medium interaction honeypot (MIH) MQTT (Message Queue Telemetry Transport) by IBM,
over the other two honeypots. That is the reason why it is XMPP(Extensible Messaging and Presence Protocol) that
named as IoT ‘Virtual’ honeypot as in this case we would be provides basic instant messaging (IM) and presence
Fig. 2. Process flow for the machine learning based detection framework.
functionality, AMQP (Advanced Message Queuing Protocol) Anomaly detection is the process that goes via different
that arose from the financial industry, CoAP (Constrained phases starting from Traffic Capture, then on Grouping the
Application Protocol) designed for resource-constrained packets by device and time, and then coming on the Feature
devices., UPnP (Universal Plug and Play) set of network Extraction phase and finally ends on Binary Classification
protocols used for the discovery of network devices and HTTP phase. The traffic capture process is about recording the source
REST. REST is an architectural style that has been widely used IP address, source port, destination IP address, destination port,
in Machine-to-Machine (M2M) communications and IoT packet size, and timestamp of all sent IP packets from IoT
platforms. Among all the above-listed honeypots, we can use device that is a part of some smart home application. This task
ThingPot for our purpose of an intriguing number of possible of collecting the DDoS traffic is a quite challenging task due to
malware attacks. some involved security risks and complexity. It has simulated
the three most common variations of DDoS attack i.e. a TCP
B. Real Time Machine Learning Detection Framework SYN flood, a UDP flood, and an HTTP GET flood to capture
Machine learning based detection framework is another the new coming variants in the malware properties.
important step in our whole process of DDoS detection. There Grouping is performed on packets from IoT devices based
is a number of machine learning algorithms available for on source IP address which is further divided into
carrying out the desired classification. However, we are not nonoverlapping timestamps which were recorded at the earlier
interested in just the classification of malware, but we want a stage.
real-time implementable machine learning solution that can
classify the malware features accurately without generating a The feature extraction process is responsible for generating
number of false positives. The recent researches in the field of stateless and stateful features for each packet depending upon
real-time machine learning based detection in IoT devices the IoT device behavior. Stateless features are lightweight
include a solution proposed by R. Doshi et al.,2018 [17] which features derived from flow independent characteristics of each
has proved to classify the malware with an accuracy of 0.99. sent packet i.e. they are generated without splitting the
The solution is especially targeted to IoT botnet attacks that incoming traffic stream by IP source. On the other hand,
have shown a drastic increase in the past recent years. stateful features are about capturing the aggregated flow
information in the network traffic with respect to the short time
IoT traffic behaves differently from that of traditional spans. Packet size and Inter-packet interval are considered as
laptops and smartphones as the devices communicate with stateless features whereas bandwidth and IP address cardinality
endpoints within a small range rather than large web servers. and novelty are called stateful features. At last, binary
This kind of behavior of IoT traffic can be observed closely via classification is processed using different classification
a machine learning process. The process compromises of algorithms like K- nearest neighbors, random forests, support
several steps starting from data collection, feature extraction, vector machines and deep neural networks to distinguish the
and then finally binary classification. The features extracted are normal traffic from the DDoS traffic flow [36]. Fig. 2 shows
mainly IoT-specific network behaviors and possesses network the complete flow of the processes involved. Moreover, using
flow characteristics such as packet length, inter-packet deep learning classifiers will be much effective as they work on
intervals, and protocol. A variety of classifiers for attack the additional data generated from the real-world deployments
detection, including random forests, K-nearest neighbors,
support vector machines, decision trees, and neural networks To summarize, the proposed solution can be implemented
are compared against each other. The random forest, K-nearest by using an IoT honeypot inspired by the ThingPot [29] which
neighbors, and neural net classifiers were found to be is an IoT virtual honeypot capable of catching various botnet
particularly effective [17]. The IoT specific network behaviors binaries by emulating different IoT communication protocol
like the limited number of endpoints, the regular time interval along with entire IoT platform behaviors. To keep it isolated
between packets, etc can be used to perform feature selection from the original IoT platform, the virtual box should be used
process to achieve the higher accuracy in detecting DDoS in to deploy it over every IoT device in a network. Since due to
IoT network traffic with the assistance of various machine the IoT constraints, it is not possible to implement classifiers
learning algorithms, including neural networks. on each device, it should be implemented on the router level.
Also, the amount of traffic coming to a particular IoT device is
insufficient to perform any training over a machine learning
model. To generate an adequate amount of IoT traffic, any IoT [8] Sommer, R., & Paxson, V. (2010, May). Outside the closed world: On
network simulators can be used. IoT simulators are known for using machine learning for network intrusion detection. In Security and
Privacy (SP), IEEE Symposium on (pp. 305-316). IEEE (2010).
generating an IoT environment for testing any IoT-based
[9] M. Anirudh, S. A. Thileeban And D. J. Nallathambi, "Use of Honeypots
application and add storage facility using cloud if required. for Mitigating DoS Attack Targeted on IoT Networks," 2017
However, if we are using the preferred honeypot, then there is International Conference On Computer, Communication And Signal
no need to bother about IoT simulators as our honeypot itself Processing (ICCCSP), Chennai, Pp. 1-4, (2017).
be responsible for it. The transformation of log files into the [10] Rieck, K., Holz, T., Willems, C., Düssel, P., & Laskov, P. (2008, July).
format required for input to Machine learning model can be Learning and classification of malware behavior. In International
done by using bash scripts on Linux. For carrying out the Conference on Detection of Intrusions and Malware, and Vulnerability
Assessment (pp. 108-125). Springer, Berlin, Heidelberg.
machine learning tasks, machine learning tools like Microsoft
[11] Bailey, M., Oberheide, J., Andersen, J., Mao, Z. M., Jahanian, F., &
Azure, MATLAB, etc. in a virtualized environment can be Nazario, J. Automated classification and analysis of internet malware. In
used. We can use the approach as discussed above for the real- International Workshop on Recent Advances in Intrusion Detection
time machine learning detection framework. Springer, Berlin, Heidelberg, pp. 178-197 (2007).
[12] Binkley, J. R., & Singh, S. An Algorithm for Anomaly-based Botnet
Detection. SRUTI, 6, 7-7. (2006).
V. CONCLUSION
[13] Song, Y., Keromytis, A. D., & Stolfo, S. J. U.S. Patent No. 8,844,033.
Internet-of-things is the biggest reason for the Washington, DC: U.S. Patent and Trademark Office. (2014).
modernization of the real world in terms of technology. But it [14] The New Threat: The IoT DDoS Invasion.
is also the main reason for the increasing number of cyber https://www.a10networks.com/sites/default/files/resource-files/A10-
attacks especially DDoS attacks. That’s why defending against TPS-GR-The_New_Threat_The_IoT_DDoS_Invasion.pdf.
such attacks that use IoT as a medium to harm network security [15] Zammit, DA machine learning based approach for intrusion prevention
has become the primary concern in the field of Internet using honeypot interaction patterns as training data. University of Malta,
1-55 (2016).
Security. A number of defense mechanisms have been
[16] Pa, Y. M. P., Suzuki, S., Yoshioka, K., Matsumoto, T., Kasama, T., &
proposed in the concerned field to make the IoT network Rossow, C. IoTPOT: analysing the rise of IoT compromises. EMU, 9,
immune to such attacks but they become incapable of handling 1(2015).
new variants of IoT botnet attacks. We came up with a [17] Doshi, R., Apthorpe, N., & Feamster, N. Machine Learning DDoS
honeypot based solution for the DDoS detection which uses Detection for Consumer Internet of Things Devices, arXiv preprint
real-time machine learning detection framework. Use of arXiv:1804.04159 (2018).
honeypots will ensure the logging of newly coming malware [18] Pa, Y. M. P., Suzuki, S., Yoshioka, K., Matsumoto, T., Kasama, T., &
features which will be utilized by ML-based detection Rossow, C., IoTPot: A novel honeypot for revealing current IoT threats.
Journal of Information Processing, 24(3), 522-533 (2016).
framework to train their classifiers effectively. For the future
scope, we need to extend this approach to the next level where [19] Musca, C., Mirica, E., & Deaconescu, R. Detecting and analyzing zero-
day attacks using honeypots. In Control Systems and Computer Science
we can find out the open challenges or issues by implementing (CSCS), 2013 19th International Conference on (pp. 543-548). IEEE.
over the real-time scenarios. There is also scope for employing (2013).
a cloud server to deal with extremely resource-constrained IoT [20] Hofmann, T. Unsupervised learning by probabilistic latent semantic
devices. Finally, we can come up with a comparative analysis analysis. Machine learning, 42(1-2), 177-196 (2001).
of our proposed solution by evaluating its performance in [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
contrast to other proposed models. with deep convolutional neural networks,” in Advances in neural
information processing systems, pp. 1097–1105 (2012).
[22] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
REFERENCES computation, vol. 9, no. 8, pp. 1735–1780, (1997).
[1] K. Chen, S. Zhang, Z. Li,Yi Zhang, Q.Deng, Sandip Ray, Yier Jin, [23] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
“Internet-of-Things Security and Vulnerabilities: Taxonomy, gated recurrent neural networks on sequence modeling,” arXiv preprint
Challenges, and Practice” Journal of Hardware and Systems Security, arXiv:1412.3555, (2014).
vol. 2, Issue 2, pp. 97–110, (2018). [24] Yuan, X., Li, C., & Li, X. DeepDefense: Identifying DDoS Attack via
[2] W. Zhou, Y. Jia, A. Peng, Y. Zhang and P. Liu, "The Effect of IoT New Deep Learning. In 2017 IEEE International Conference on Smart
Features on Security and Privacy: New Threats, Existing Solutions, and Computing (SMARTCOMP) (pp. 1-8). IEEE. (2017).
Challenges Yet to Be Solved," IEEE Internet of Things Journal. 2018. [25] Meidan, Y., Bohadana, M., Mathov, Y., Mirsky, Y., Shabtai, A.,
[3] J. Lin, W. Yu, N. Zhang, X. Yang, H. Zhang, and W. Zhao, "A Survey Breitenbacher, D., & Elovici, Y. N-BaIoT—Network-Based Detection
on Internet of Things: Architecture, Enabling Technologies, Security of IoT Botnet Attacks Using Deep Autoencoders. IEEE Pervasive
and Privacy, and Applications,” IEEE Internet of Things Journal, vol. 4, Computing, 17(3), 12-22. (2018).
no. 5, pp. 1125-1142 (2017). [26] Nawrocki, M., Wählisch, M., Schmidt, T. C., Keil, C., & Schönfelder, J.
[4] Honeypots and the Internet of Things. Available at A survey on honeypot software and data analysis. arXiv preprint
https://securelist.com/honeypots-and-the-internet-of-things/78751. arXiv:1608.06249. (2016).
[5] Hastie, T., Tibshirani, R., & Friedman, J. Unsupervised learning. In The [27] Cohen, F. Special feature: A note on the role of deception in information
elements of statistical learning (pp. 485-585). Springer, New York, NY protection. Computers and Security, 17(6), 483-506. (1998).
(2009). [28] Syversen, J. , U.S. Patent Application No. 11/632,669 (2008).
[6] C. Kolias, G. Kambourakis, A. Stavrou and J. Voas, "DDoS in the IoT: [29] Wang, Meng, Javier Santillan, and Fernando Kuipers. "ThingPot: an
Mirai and Other Botnets," in Computer, vol. 50, no. 7, pp. 80-84 (2017). interactive Internet-of-Things honeypot." arXiv preprint
[7] Dougherty, J., Kohavi, R., & Sahami, M. Supervised and unsupervised arXiv:1807.04114 (2018).
discretization of continuous features. In Machine Learning Proceedings [30] Phype. Telnet IoT honeypot. https://github.com/Phype/ telnet-iot-
1995, pp.194-202 (1995). honeypot.
[31] Omererdem. Honeything. https://github.com/omererdem/honeything.
[32] Yin Minn Pa Pa, Shogo Suzuki, Katsunari Yoshioka, Tsutomu [35] Roy T. Fielding and Richard N. Taylor. Architectural styles and the
Matsumoto, Takahiro Kasama, and Christian Rossow. IoTpot: A novel design of network-based software architectures. University of California,
honeypot for revealing current iot threats. Journal of Information Irvine Doctoral dissertation, 2000.
Processing, 24(3):522–533, 2016. [36] Singh, K., Guntuku, S. C., Thakur, A., & Hota, C. (2014). Big data
[33] DinoTools. dionaea - catches bugs. https://github.com/DinoTools/ analytics framework for peer-to-peer botnet detection using random
dionaea/blob/master/README.md. forests. Information Sciences, 278, 488-497.
[34] S. Dowling, M. Schukat, and H. Melvin. A zigbee honeypot to assess iot
cyberattack behaviour. In 2017 28th Irish Signals and Systems
Conference (ISSC), pages 1–6, June 2017.