0% found this document useful (0 votes)

22 views17 pages

Fault-Tolerance in The Scope of Software-Defined Networking SDN

This document summarizes fault tolerance techniques in software-defined networking (SDN). It discusses fault tolerance concepts, highlights issues in the SDN architecture, and reviews state-of-the-art fault tolerance research according to the SDN data, control and application planes. The document concludes by outlining future research directions to improve SDN fault tolerance.

Uploaded by

PT. Telemedia Prima Nusantara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views17 pages

Fault-Tolerance in The Scope of Software-Defined Networking SDN

Uploaded by

PT. Telemedia Prima Nusantara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Received July 20, 2019, accepted July 31, 2019, date of publication September 2, 2019, date of current version

September 13, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2939115

Fault-Tolerance in the Scope of Software-Defined

Networking (SDN)
A. U. REHMAN , RUI. L. AGUIAR , AND JOÃO PAULO BARRACA
Instituto de Telecomunicações, P-3810-193 Aveiro, Portugal
Departamento de Eletrónica, Telecomunicações e Informática, Universidade de Aveiro, Campus Universitário de Santiago, P-3810-193 Aveiro, Portugal
Corresponding author: A. U. Rehman ([email protected])
This work is funded by Fundação para a Ciência e a Tecnologia/Ministério da Educação e Ciência (FCT/MEC) through national funds and
when applicable co-funded by FEDER - PT2020 partnership agreement under the project UID/EEA/50008/2019, and Fundação para a
Ciencia e Tecnologia under Grant PD/BD/113822/2015.

ABSTRACT Fault-tolerance is an essential aspect of network resilience. Fault-tolerance mechanisms

are required to ensure high availability and high reliability in systems. The advent of software-defined
networking (SDN) has both presented new challenges and opened new paths to develop novel strategies,
architectures, and standards to support fault-tolerance. In this survey, we address SDN fault-tolerance
and discuss the OpenFlow fault-tolerance support for failure recovery. We highlight the mechanism used
for failure recovery in Carrier-grade networks that includes detection and recovery phases. Furthermore,
we highlight SDN-specific fault-tolerance issues and provide a comprehensive overview of the state-of-
the-art SDN fault-tolerance research efforts. We then discuss and structure SDN fault-tolerance research
according to three distinct SDN planes (i.e., data, control, and application). Finally, we conclude enumerating
future research directions for SDN fault-tolerance development.

INDEX TERMS Software-defined networking, fault-tolerance, OpenFlow, failure detection, failure recov-
ery, fault-tolerance issues, network programmability, network softwarization, mission-critical communica-
tions.

I. INTRODUCTION designed in such a way that it can minimize service failures in

Due to the lack of software programmability in today’s the presence of system components faults. On the other hand
networks, it is quite challenging to modify (program) net- ‘‘Fault management’’ is a term used in network management,
works. Traditionally, there was no underlying programming describing the overall processes and infrastructure associated
abstraction provided to deal with the inherent complexity with detecting, diagnosing, and fixing faults, and returning to
of distributed system failures. One of the primary features normal operations in telecommunication systems [5].
that software-defined networking (SDN) provides is data and Generally, fault-tolerance is an essential part of the design
control plane separation, laying the ground for simple net- of any communication system/network. Computer networks
work programmability. Although there is an extensive set of are built on physical infrastructure or virtualized versions of
SDN research, most of the research performed so far focuses the physical infrastructure. These infrastructures are critical
on exploring SDN as a programmatical technology, without because business applications rely on their proper operation
considering fault-tolerance aspects [1]–[4]. of such infrastructures. However, such infrastructures are
Fault-tolerance is a broad area of knowledge, and cover- prone to a wide range of challenges and attacks such as
ing all aspects of fault-tolerance concepts in a single paper natural disasters or Denial of Service (DoS) attacks and major
is difficult. Hence, in this paper, we briefly discuss key issues such as faults, failures, and errors all of which cause
fault-tolerance concepts and focus more on fault-tolerance in failure and disruption in network service. Therefore, to over-
the scope of SDN. It is important to note that fault-tolerance come these network service issues, resilience procedures and
and fault-management concepts are different. On the one fault-tolerance mechanisms are essential to identify and heal
hand, fault-tolerance is a characteristic of a system, which is the system/network in the presence of such failures [6].
SDN provides network flexibility through a clear separa-
The associate editor coordinating the review of this article and approving
tion of control and data planes, inherently simplifying net-
it for publication was Mubashir Husain Rehmani. work management [7], although SDN fault-tolerance is still in

124474 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ VOLUME 7, 2019
A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

FIGURE 1. Condensed structure of this survey.

its infancy. SDN is exposed to new sets of failures and issues perspective SDN fault-tolerance challenges and future
at each layer of its architecture, as discussed in Section V. It is research directions. Section VIII concludes the paper. The list
necessary to address these issues and safeguard each layer of abbreviations/acronyms is provided after the conclusion.
of the SDN architecture to provide enhanced fault-tolerance.
We overview fault-tolerance, its techniques, and its typical II. RELATED WORK
phases of fault-tolerance. We then highlight fault-tolerance Some previous surveys have explored fault management [8],
issues according to the SDN three main layers (data, control, [9] in SDN. We provide further discussions on these efforts
and application) and classify SDN fault-tolerance research as follows:
according to these three layers. The condensed structure of Fonseca and Mota [8] addressed fault management in
this survey is depicted in Fig. 1. SDN. They presented an SDN fault management overview
The rest of the paper is structured as follows. Section II and focused more on issues associated with each layer of the
discusses previous studies generic and specific to SDN fault- SDN architecture. They also discussed general approaches,
tolerance. Section III provides a comprehensive overview trade-offs, major contributions, and research gaps. They fur-
of fault-tolerance. Section IV summarizes the SDN lay- ther extended the discussion on SDN fault management issues
ered architecture and fault-tolerance support in SDN as to optical and wireless networks.
mentioned, Section V discusses briefly SDN architecture Yu et al. [9] also addressed SDN fault management and
based fault-tolerance issues focusing on three principal provided a systematic survey by evaluating existing SDN
planes of SDN architecture: data plane, control plane, and fault management solutions. Furthermore, they presented an
application plane. Section VI provides state-of-the-art of in-depth analysis of SDN fault management focusing system
SDN fault-tolerance research efforts. Section VII discusses monitoring, fault diagnosis, fault recovery, and repair and

VOLUME 7, 2019 124475

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

TABLE 1. Scope and technical contributions of previous and this work.

briefly discussed SDN fault-tolerance. They compared and • We highlight the realization of SDN fault-tolerance
analyzed existing solutions in the context of SDN fault man- challenges according to the data, control, and appli-
agement over the period of 2008-2017. cation planes and outline important future research
Chen et al. [31] addressed traditional fault-tolerance directions from the perspective of programmability
approaches and analyzed their connections with SDN. and network softwarization.
Data-plane failure detection and recovery mechanisms In summary, in this paper, we present a fault-tolerance
(link/node) were compared and briefly discussed. Further- overview, as well as techniques and phases in the scope of
more, they discussed traditional fault-tolerance approaches the SDN and traditional fault-tolerance support for SDN.
and compared restoration and protection methods for We then highlight issues that can affect fault-tolerance in each
link/node failure recovery. layer of the SDN architecture itself, after which we struc-
In this paper, we focus on one of the utmost important ture and organize state-of-the-art research efforts addressing
disciplines of resilience, namely, fault-tolerance. We char- these SDN fault-tolerance issues and opt for solutions to
acterize it according to the SDN planes ( data, control, and safeguard fault-tolerance in each layer of the SDN architec-
application). Furthermore, Table 2 locates the present work ture. We also outline important future research directions for
in the context of other technical contributions. each SDN layer, i.e., the programmable data plane for net-
work softwarization (data plane), controller architecture for
A. CONTRIBUTION AND SCOPE OF THIS SURVEY mission-critical scenarios (control plane) and software tools
As mentioned in Section II, previous studies have not dis- for fault-tolerant SDN application development (application
cussed fault-tolerance phases, techniques, and basic topics plane).
in the context of SDN architecture. This survey addresses
fault-tolerance specific to SDN. We can distinguish our con- III. BACKGROUND AND RELATED CONCEPTS
tribution in this paper in comparison to the other related work In this section, we discuss fault-tolerance and its techniques
as follows: and provide a brief overview of fault-tolerance in traditional
• We focus on SDN fault-tolerance rather than networks and its relationship with dependability.
fault-management of SDN.
• We structure SDN fault-tolerance according to the A. FAULT-TOLERANCE OVERVIEW
SDN layer-based architecture (i.e, data, control, and Any system is prone to some sorts of threats that affect the
application planes). operation of the system. Moreover, in computer networking,
• We discuss in detail SDN fault-tolerance support both distributed and centralized network systems are also
in the context of traditional approaches that can prone to three major issues: Failures, Errors, and Faults.
be applied to address fault-tolerance in the data, A failure happens if a system is unable to implement
control, and application planes. the specified function appropriately. This means the service
• We provide and organize a comprehensive overview deviates from its specifications. An error is caused because
of fault-tolerance issues on each SDN plane and one or more of the sequences of system states deviate from the
discuss state-of-the-art research efforts addressing specified sequence, and can cause service disruption. Faults
these highlighted SDN fault-tolerance issues. can cause errors and lead to single or multiple failures [10].

124476 VOLUME 7, 2019

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

FIGURE 2. Relationship: Fault, Error, and Failure.

A fault is the hypothesized cause of an error, for instance,

a software bug, human-made error or hardware power fail-
ure [11]. Relationship between fault, error, and failure is FIGURE 3. Typical phases in Fault-tolerance.

depicted in Fig. 2 [12].

Fault-tolerance is the outcome of a design process of build-
ing a reliable system from unreliable components [13]. Faults figured to avoid faults. Service continuation is essen-
can be classified into two main categories [14], [15]: Crash tial to ensure that the system will perform its opera-
faults and Byzantine faults. Crash faults can cause system tion normally and without immediate manifestation of
fatal errors (for instance process and machine power-related faults.
failures), while Byzantine faults can cause the system to
deviates from normal operation [14]. Fault-tolerance sys- C. FAULT-TOLERANCE TECHNIQUES
tems are equipped with several mechanisms that not only Several fault-tolerance techniques are being used to avoid
respond to these issues but also continuously offer and main- service failure in the presence of faults [17]. Fault-tolerance
tain correct system operation. However, it is hard to design is carried out through error detection and system recov-
a fault-tolerance system that can guarantee flawless com- ery, or simply detection and recovery mechanisms. Error
munication in practice but, even in worst-case scenarios, detection identifies the presence of an error, while ‘‘recov-
fault-tolerance systems still offer graceful degradation of ser- ery transforms a system state that contains one or more
vices. Nevertheless, we can always design efficient mecha- errors and (possibly) faults into a state without detected
nisms for faults and errors that are most likely to happen and errors and faults that can be activated again ’’ [18]. Recov-
affect any system. Such approach can improve and enhance ery techniques can be further classified into two main cat-
fault-tolerance communication systems. egories: i) recovery with error handling; which eliminates
errors from the system state; and ii) recovery with fault-
B. FAULT-TOLERANCE PHASES handling; which prevents faults from being activated again.
The typical four main stages of a fault-tolerance are as fol- The choice of error detection and recovery techniques are
lows [16]: being adopted based upon the underlying fault assumption.
1) Error Detection: In this stage, faults are first detected In the context of SDN, these fault-tolerance techniques must
and then reported to determine the root cause of failure be explored in order to enhance fault-tolerance in future SDN
(observing failures). environments.
2) Damage Confinement and Assessment: In this stage, The taxonomy of fault-tolerance techniques can be seen
the damaged or corrupted state of the system is assessed in Fig. 4, Table 2 [11] summarizes the details of such
to determine the extent of the damage caused by faulty fault-tolerance techniques.
components.
3) Error Recovery: In this stage, recovery strategies are
imposed to restore the system to a consistent and
fault-free state. There are two different kinds of recov-
ery techniques used:
• Backward Recovery: In this technique, system
states are recorded and stored so that a cor-
rupted state can be discarded and the system
can be restored to the previous fault free (cor-
rect) state.
• Forward Recovery: In this technique, the sys-
tem is being brought to a new correct fault-free
state from the current corrupted state.
4) Fault Treatment and Service Continuation: In this
stage, the location of faults are identified first and
then faults are either repaired or the system is recon- FIGURE 4. Fault-tolerance techniques.

VOLUME 7, 2019 124477

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

TABLE 2. Details of Fault-tolerance techniques.

IV. FAULT-TOLERANCE IN SDN B. CONTROLLER

In this section, we provide an overview of SDN archi- In SDN architectures, the controller is a logically centralized
tecture and discuss SDN fault-tolerance based on the entity. It is responsible for translating the SDN applications
SDN architecture as divided into three main layers: data requirement, via a Northbound API, down to the SDN data
plane, control plane, and application plane. layer. Furthermore, it is also responsible for providing SDN
applications an abstracted view of the network (including
A. SDN ARCHITECTURE OVERVIEW statistics and events).
SDN is a hot research topic, but there is increasing confu- Networks in SDN are managed by an external controller to
sion regarding SDN concepts; architecture, multiple SDN process the flow of packets. This enables the programming
networking planes, and interaction between layers through of the network to be centrally controlled. Hence, the entire
interfaces. Briefly, we discuss the SDN architecture and dis- network and its devices can be managed efficiently regardless
cuss the abstracted view of SDN planes. of the complexity of the underlying infrastructure. Moreover,
The SDN architecture is shown in Fig. 5, which com- SDN offers the flexibility through programming to separate
prises several abstraction layers (abstraction of well-defined the data and control planes with the logically centralized
planes), interfaces (standardized Application Programmable controller and by this, it is possible to modify the packet
Interfaces (APIs) between planes) and well-defined planes forwarding as per the network needs [1]. The OpenFlow
(collection of functions and resources with the same func- standard has been proposed to manage the communication
tionality) [19]. flow between the controller and network entities [21].
The three distinct SDN planes are as following: It is important to understand that different multi-controller
1) Data Plane: The data plane (also known as forward- architectures can exist with SDN. Bilal et al. [22] describe
ing plane) is responsible for handling data packets different types of SDN architecture and existing implementa-
sent by the end-user through network devices that are tions. They further classify multi-controller SDN architecture
responsible for traffic forwarding (based on instruc- in two broad categories (logically centralized and logically
tions received from control plane). distributed architecture) and discuss in details with an exam-
The Forwarding Information Base (FIB), also known as ple of implementation of such designs.
forwarding table and Medium Access Control (MAC) SDN controller fault-tolerance issues still exist and are
table for routers and switches. FIB is used in the data addressed in current research, but are still far from provid-
plane to perform IP forwarding of unlabeled packets ing the optimal solutions. Later in Section VI, we structure
[20]. the SDN controller (centralized and distributed architecture)
2) Control Plane: The control plane is responsible for fault-tolerance research efforts.
deciding on how packets must be handled and for-
warded at network devices to properly cross the net- C. SOUTHBOUND AND NORTHBOUND APIS
work. The primary purpose of the control plane is All the SDN networking planes are connected through spe-
to synchronize and update forwarding tables, while cific interfaces, that standardized and simplify intercommu-
packet handling policies reside in the forwarding nication between them. Intercommunication between SDN
plane. networking planes can be achieved in two different ways
3) Application Plane: The plane where applications and depending on the SDN architecture design, and the location
services that define network behavior reside. Applica- of network entities. On one hand, if they are placed in a
tions that directly (or primarily) support the operation different location, a network protocol is used to provide
of the forwarding plane (such as routing processes communication interaction between them. On the other hand,
within the control plane) are not considered part of the if the network entities reside inside the same physical or vir-
application plane. tual location, a communication interaction between network

124478 VOLUME 7, 2019

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

FIGURE 5. SDN high-level architecture: SDN planes and communication interfaces.

entities is possible with APIs. This enables the flexibility to D. OPENFLOW FAULT-TOLERANCE SUPPORT IN SDN
design and implement intercommunication between network This section discusses OpenFlow fault-tolerance from
entities either through network protocols and/or APIs. the point of view of the requirement for Carrier-grade
The SDN architecture has two primary interfaces (which networks. Carrier-grade networks usually provide faster
use either APIs and/or protocols), as depicted in Fig. 5 [23], recovery against service failure (i.e recover within 50ms)
to enable intercommunication between two different SDN [30]. If service is not able to recover within this time,
Planes: The Southbound and Northbound. In SDN termi- then service providers may be jeopardizing their business.
nology they are often referred to as Southbound APIs and OpenFlow was developed to support communication with
Northbound APIs. non-proprietary FIBs. The OpenFlow protocol provides an
The Southbound API is a communication interface abstraction of FIB through the OpenFlow group table con-
between the data and control plane. Currently, OpenFlow cept. Moreover, the OpenFlow protocol communicates with
is a default standard for this communication. Furthermore, the controller, which can trigger modifications in packets
in SDN, OpenFlow is not the only available protocol for forwarding rules. This makes the FIB programmable through
Southbound interface [24]. Other protocols and/or APIs for OpenFlow.
Southbound interface are: Forwarding and Control Element In SDN networks, operations rely on the proper function-
Separation (ForCES) [25], Network Configuration Protocol ing of the controller. The control plane in SDN manages
(NETCONF) [26], and Extensible Messaging and Presence the control logic of switches. The control logic is critical
Protocol (XMPP) [27], but they are more rarely used. in SDN based networks. This problem is minimized in the
The Northbound API is a communication interface latest version of the OpenFlow protocol by a master-slave
between the control plane and the application plane. Cur- configuration at the control layer: to increase resiliency. How-
rently, there is no standardized northbound API. Because of ever, we argue that a tight synchronization must be required
this, the development of network applications for SDN has between a master and slave configuration to maintain an
not been accelerated [28]. Nevertheless, most implementa- identical state of this configuration or the same copy of the
tions use REpresentational State Transfer (REST)-based API master controller and this causes extra overhead in networks
because it is platform and language independent [29]. and sophisticated network management demands. It is quite

VOLUME 7, 2019 124479

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

challenging, to reach the recovery time equivalent to the stan- established after the occurrence of failure and resources are
dards set by the Carrier-grade networks, therefore, in order not reserved before the occurrence of the failure, and the paths
to enhance OpenFlow fault-tolerance support, mechanisms are pre-assigned or allocated dynamically. However, in the
that not only maintain controller persistent state but also case of protection, the alternative paths are already reserved
provides efficient recovery in case of controller fail-over must and assigned before the occurrence of a failure. This needed
be developed. Another research challenge is the optimization no added processing (signaling) to recover from failure.
of recovery time as per the Carrier-grade requirement as well In restoration, additional signaling is needed to recover from
as scalability. Indeed, Carrier-grade networks are a network failure; in large networks, this is not often possible within the
of networks and scalability is a critical aspect. This excludes set requirement of Carrier-grade networks, and thus it is not
any solution not scalable, as such solution is not of interest scalable. However, in protection, as a matter of fact, the addi-
for such Carriers. tional signaling is not required, and recovery process is fast
when compared to restoration, with the recovery process pos-
E. SDN DATA PLANE FAULT-TOLERANCE SUPPORT sible within 50 ms and suitable for Carrier-grade networks.
SDN data plane fault-tolerance is related to the issues already
present in traditional architectures ( e.g. Multiprotocol Label F. SDN CONTROL PLANE FAULT-TOLERANCE SUPPORT
Switching technology). Due to the static nature of traditional Control plane resilience is a requirement for proper operation
networks, these approaches can achieve good performance in networks: the controller is vital, which means that the con-
upon link and node failures. However, failure detection and troller must be able to process all required traffic commands
recovery approaches in dynamic networks such as SDN in all situations. There are several approaches to enhance
must be re-designed to adapt to the dynamics of the rapidly SDN control plane fault-tolerance. The first approach is to
changing networks. Traditionally, reactive and proactive replicate a controller on a different control network. In the
approaches were used to provide Fault-tolerance [31]. In the case of failure, the replicated controller takes over and man-
reactive approach, an alternative path is calculated after the ages traffic. In another approach, the controller must be
fault becomes active. In proactive techniques, the resources embedded with mechanisms (build-in module) to self-heal
and backup paths are pre-programmed before the occurrence from targeted attacks such as Denial of Service (DoS), flood-
of a fault (when a fault is dormant). If the fault becomes ing and fake traffic routing and other network- related tar-
active, the pre-programmed logic starts to defend immedi- geted attacks. However, the control plane time to recover from
ately and recover the system from faults. In this section, such attacks is critical, and ideally, recovery mechanisms
we address such failure detection and recovery approaches. must be developed to mitigate failures within the set network
requirements. In addition to these, the recovery process must
1) FAILURE DETECTION APPROACHES be efficient and must be able to self-heal during a failure event
The high availability of the data plane plays an important with minimum overhead. In-band and out-of-band signaling
role to maintain the required communication from source to solutions have been adopted to offer SDN control plane
destination. To achieve high resiliency in the data plane, two reliability [34]. In practice, most SDN deployments use out-
steps are required: first, design and analyze the topology in of-band control, where control packets are managed by a
the presence of known and unknown failures; and, second, dedicated management network [35].
to design an alternative path according to the type of fail-
ures that occur in the network. In Carrier-grade networks, G. SDN APPLICATION PLANE FAULT-TOLERANCE SUPPORT
two well-known mechanisms exist to detect failures in the On an SDN network, the Application plane is the layer that
data plane, namely Loss of Signal (LOS) and Bidirectional has applications and services that make requests for network
Forwarding Detection (BFD) [32]. LOS detect failures in a functions provided by the control plane and the data plane.
specific port of the forwarding device, while BFD can detect On traditional networks, security, management, and monitor-
path failure between any two forwarding devices. Both meth- ing devices or applications reside in this layer.
ods provide failure detection at an accelerated rate, indepen- The application layer allows business applications to mod-
dent of the media type and routing protocols (such as Open ify and influence the way the network behaves in order to
Shortest Path First (OSPF) and Enhanced Interior Gateway provide services to customers. This requires the definition
Routing Protocol (EIGRP)). of an API, to allow third-party developers to build and sell
network applications to the network operator. The develop-
2) FAILURE RECOVERY APPROACHES ment of such an API has not yet properly addressed by the
In Carrier-grade networks the recovery mechanism must Open Network Foundation (ONF) but is required in order to
guarantee the recovery process within 50 ms [33]. For this guarantee interoperability between a business application and
purpose, restoration and protection are widely used to recover network controllers from different suppliers.
from network service failures — methods based on reactive Existing SDN programming languages offer several fea-
and proactive approaches. Protection is classified as a reac- tures such as flow installation, policy definition, program-
tive technology while restoration is classified as pro-active ming paradigm and abstraction for developing and enabling
technology. In restoration, an alternative path is only network and application fault-tolerance in SDN [36]–[39].

124480 VOLUME 7, 2019

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

V. SDN ARCHITECTURE FAULT-TOLERANCE ISSUES To deal with the controller placement issue, one of the
In this section, first, we highlighted SDN fault-tolerance strategies is to develop algorithms that can provide optimal
issues and then provide state-of-the-art research efforts focus- controller placement in dynamic SDN-based networks, which
ing on such fault-tolerance issues in SDN. Furthermore, is also challenging in itself [45].
we structure them based on three main layers of SDN archi-
tecture. These are later summarized in Table3, Table 4, and 3) INTER-CONTROLLER CONSISTENCY
Table 5.
In order to avoid a single point of failure in SDN networks,
A. DATA PLANE ISSUES multi-controllers architecture approaches are pursued, (either
There are two main data plane layer issues namely: network in physically centralized and logically distributed, or fully
failure detection and recovery. These issues arise either due distributed fashion with the coordination of different SDN
to link or node failure. As discussed, in traditional networks, controllers) [46]. It is important to note that these practices
to detect network failure, a particular protocol, such as LOS increase resiliency, but there is a strict requirement for con-
and BFD, is used [40]. Also, to recover from network fail- trollers consistency [47]. The level of consistency depends
ure, restoration and protection approaches are widely used. on stateful or stateless backup settings. The controller must
However, resolving these issues in the SDN environment is maintain a persistence state to guarantee controller consis-
challenging due to the centralized nature of the controller. For tency.
instance, in the SDN-based environment, the controller can
take a longer time to detect and recover from link or node fail- 4) MULTI-CONTROLLER ARCHITECTURE FAIL-OVER
ure due to the rapid changing abstracted view of the network
In SDN, multi-controller architectures can follow the
(dynamic topology). Therefore, there is a need to develop
flat/horizontal or hierarchical/vertical designs. On the one
mechanisms for SDN that can provide faster recovery [40].
hand, in a flat architecture, the control plane has just one
B. CONTROL PLANE ISSUES layer, and each controller has the same responsibilities [22].
There are multiple SDN control plane issues. The three main The advantage of such architecture design is that it provides
issues that are critical to SDN control plane fault-tolerance more resilience against failure, but the task of managing
can be classified as: controllers is difficult. On the other hand, in a hierarchical
architecture, the control plane has multiple layers, and each
1) CONTROLLER CHANNEL RELIABILITY controller has different responsibilities (due to multiple level
In SDN controllers, communications with underlying devices partitioning). The advantage of such design is that it provides
are critical. Therefore, their availability is a must condition a more straightforward way to manage controllers. Both of
to protect the proper operations of a network. The controller these multi-controller architecture approaches can be used to
channel must be fault-tolerant (reliable) in-case of failure improve switch to controller latency or vice versa. In both
due to loss of switches connection, or error due to the com- designs, it is important to consider that controllers must
munications protocols between the controller and underlying respond to any fail-over [48] request efficiently and without
devices. These issues can disrupt the network and lead to affecting the performance.
several failures in the SDN network. In order to cope with
these issues, controller redundancy [41], [42] and path backup C. APPLICATION PLANE ISSUES
are considered essential. SDN enables programmability to control network devices
more efficiently but this is highly dependent on the quality
2) CONTROLLER PLACEMENT AND ASSIGNMENT of software development. In order to develop reliable SDN
Controller placement (how to choose the location of con- applications, debugging ( the process followed to fix bugs)
trollers) and assignment ( how to assign the controllers to the and testing (verification) tools can not only advance software
switches) are two significant issues [43], generally known as quality but also help in fixing software bugs as service evolves
the controller placement problem [44]. (continuous development process) [49]. To ensure the quality
The controller’s assignment issue (balance of controllers) of software network troubleshooting, debugging and testing
in SDN is important, not only from the point of view of are consider essential [37].
fault-tolerant controller design but also from the point of view Network visualization, network provisioning, and applica-
of network optimization. Improper controller assignment can tion monitoring can be conceptualized as an SDN application
lead to two main problems: i) under-provisioning: When a layer. For this reason, fault-tolerance of both network and
small number of controllers are placed to handle more traffic applications can be supported at the application plane. More-
than its capacity of processing. In this case, the controller over, in order to develop fault-tolerant network applications,
is overloaded and possibly increases downtime and affects all the phases from application design to final application
network performance, and 2) Over-provisioning: When more deployments must undergo proper testing. Currently, there
than the required controllers are placed to handle comparably are certain languages proposed that enable the construction
low traffic environment. In this case, costly controllers are of fault-tolerant programs to write SDN-based fault-tolerant
underutilized. systems. Since fault-tolerance of both network and applica-

VOLUME 7, 2019 124481

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

tions can be supported at application plane, the two main SDN evaluate different network topologies and showed that recov-
application layer issues are as follows: ery time was consistent irrespective of network size.
Mohan et al. [54] carried-out a research study to pro-
1) SOFTWARE TESTING vide fault-tolerance in the specific case of Ternary Content
The network behavior in a SDN-based network is con- Addressable Memory (TCAM) limited SDN. They argue that
trolled by a set of software programs. For proper net- proactive fault-tolerance policies provide faster failure recov-
work troubleshooting support, the SDN applications must be ery based on restoring the re-routing paths. This requires
resilient [50]. Resilient design help to the identify the root large forwarding rules to be installed on the TCAM, but
cause of the bug and the administrator can then track and TCAM has limited memory. Based on these challenges, they
isolate faults so that the system can be restored to the correct have developed an optimized programming formulation that
operating state. determines the set of backup paths to protect a flow and
minimize the number of forwarding rules for the backup paths
2) POLICIES CONFIGURATION to be installed in the switch TCAM. This means that fewer
In SDN, network management becomes more dependent on rules would be required for backup paths. They proposed two
software development due to programmability. There is a algorithms Backward Local Rerouting (BLR) and Forward
risk that policies across the network can be violated due Local Rerouting (FLR) [54] to improves TCAM and band-
to untested errors (bugs) in the application, which can be width usage efficiency for single link failure in SDN system.
propagated to affect SDN controllers, protocols and routing Li et al. [55] carried research studies to enhanced fail-
policies and eventually can lead to network service failure. ure recovery in SDN with customized control. They have
Therefore, constant application monitoring is essential to developed a Declarative Failure Recovery System (DFRS)
avoid any violation of network policies [51]. based on three algorithms backup path construction, add and
subtract. The Backup paths construction algorithm creates
safe backup paths based on the recovery demands. Further,
VI. SDN FAULT-TOLERANCE RESEARCH EFFORTS
it adds and subtracts algorithms to find a minimum number
SDN offers greater flexibility and network automation when
of the paths to be allocated to guarantee network services
compared with traditional distributed systems, at the risk of
during failure with minimum memory overhead [55]. Three
the controller being a single point of failure.
different topologies were evaluated to test the effectiveness
Most of the research carried out has been focused on
and scalability of DFRS. They achieved similar performance
exploring these technologies rather than evaluating the asso-
to the traditional failure protection algorithm, but with 5%
ciated reliability aspects. In recent years, a shift is being made
less backup rules. In the event of failure many switches
towards the evaluation of SDN fault-tolerance. We reviewed
are allocated hundreds of forwarding rule for backup; this
the research studies carried out that address SDN data, con-
burden the switch, affects the performance and delays fail-
trol, and application planes fault-tolerance. The details are
ure recovery. The authors argue that the DFRS system only
summarized in Table3, Table 4, and Table 5. Furthermore,
allocates dozens of forwarding rules to switches, as compared
classification of SDN state-of-the-art research efforts accord-
to the usual hundreds of forwarding rules. Thus, this leads to
ing to SDN principal planes and controller architecture is
effective memory utilization and improved stability.
depicted in Fig. 6.
Kuźniar et al. [56] proposed Automatic Failure Recov-
A. DATA AND CONTROL PLANES ery (AFRO) for SDN, an automated runtime system that
In this section, we discuss main research efforts that have recovers system failure in OpenFlow system. They argue that
addressed data and control plane fault-tolerance in the context they extend the basic functionality of the controller program
of SDN. with additional controller agnostic modules that provide effi-
Current fault-tolerance techniques are not yet proven cient recovery.
to meet Carrier-grade fault-tolerance requirement (50 ms Kim et al. [57] proposed the SDN fault-tolerant system
recovery time) [30]. A research study carried out by CORONET, and they argue that their proposed system pro-
Sharma et al. [33] provided experimental evidence that provides recovery against multipath failure in a data plane. How-
tection provides faster recovery as compared to restoration ever, since the initial published work in 2012, no significant
and is thus more suitable to guarantee resilience in large contribution was made after, although many evolutions rep-
scalable networks [52], [53]. resented on SDN architectures and protocols.
Adrichem et al. [40] argued that time was a critical metric Schiff et al. [58] presented a model to design
in the recovery process during network failures. It is still diffi- self-stabilizing distributed control planes for SDN and argue
cult to develop mechanisms that guarantee efficient recovery. that their proposed technique provides a mechanism to deal
In their research study, they demonstrated that current fail- with key challenges of a distributed system, such as bootstrap-
ure recovery approaches (restoration and protection) suffered ping and in-band control. Further, they implemented a plug
from long delays. They introduced a failover scheme based and play SDN distributed control plane to support automatic
on a per-link BFD approach and showed that implementa- topology discovery and management in dynamic networks.
tion reduced recovery time. They performed experiments to However, it is important to note that the self-stabilizing

124482 VOLUME 7, 2019

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

TABLE 3. Selected work on SDN data and control plane Fault-tolerance efforts.

FIGURE 6. SDN state-of-the-art research efforts: Principal Planes and controller architecture.

distributed plane is still at a very early stage, and a lot more meeting fault-tolerance requirements for SDN and future
effort is needed to step forward to a proof of concept stage. networks.
The authors also affirm that a feasibility study is needed to Chen et al. [59] proposed a method of protection-based
further validate their proposed model of the self-stabilizing recovery in SDN using Virtual Local Area Network (VLAN)
SDN control plane. Formal proofs are required for this tags. They argue that their proposed method provides faster
plug and play distributed model to be shown effective in recovery with low memory usage and without the participa-

VOLUME 7, 2019 124483

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

tion of the controller to switch to backup paths. In their sys- fault models such as Byzantine failures, and it is limited
tem, protection takes 20 ms while recovery on average takes to multithreaded control applications, and the scalability
50 ms to restore from failures. Similarly, Thorat et. al. [61] is also one of the tests that are not evaluated in Ravana
proposed a proactive policy to achieve fast failure recovery protocol.
using VLAN tags and claims that 99 % reduction in flow Botelho et al. [63], [64] carried-out research studies and
storage is achieved as well as fast failure recovery as set in implemented a prototype that integrates a Floodlight based
Carrier-grade networks. distributed controller architecture to BFT-SMaRT (Byzantine
Jain et al. [60] carried out a study to address network Fault-tolerant (BFT) and State Machine Replication (SMR)),
outage and failures. They evaluated three years of produc- a replicated state machine library. This enables the consis-
tion experience with B4, their own SDN enabled Wide Area tency between an SDN-controller and their redundant back-
Network (WAN) that connects Google’s data center. They ups stored in a shared database. In their work, three SDN
implemented fault-tolerance policies such as customized for- applications (learning switch, load balancer, and device man-
warding and dynamic relocation of bandwidth and alternative ager), with slight modifications, were tested to analyze the
link recovery using OpenFlow. Generally, control plane pro- workloads these applications were generating and measure
tection is achieved through resource replication and replicas the performance. The result of their study shows that the data
were placed on different physical servers. They have analyzed store is capable of handling large workloads, but to maintain a
in their study that SDN enabled WAN served more traffic than strong consistency of data there was an increase in latency and
public WAN and offered cost-effective bandwidth and nearly this impacted performance. Thus, the solution seems not to be
100 % link utilization, enabling high availability of resources. scalable, they argue that an acceptable level of fault-tolerance
However, they admit that bottlenecks in the bridging protocol was easy to achieve. Moreover, the authors also proposed a
from the control plane to the data plane exists and needs practical fault-tolerant SDN controller design for small and
to be optimized to improve performance further. Improving medium networks. A shared database is replicated that save
this will offer superior fault-tolerance in future SDN based all network state. This database is created using a Replicated
networks. State Machine (RSM), and in their previous research studies,
they argue that the database meets the performance require-
ments for small and medium networks. They incorporate a
B. CONTROLLER ARCHITECTURE cache in the controller that avoids failure smoothly without
In this section, we discuss key research efforts that have any additional coordination service.
addressed controller architecture fault-tolerance in the con- A master and slave controller configuration is implemented
text of SDN. by Fonseca et al. [65] in which the solution to offer control
Katta et al. [62] studied the fault-tolerance of the controller plan resilience is provided by integrating a Control Plane
under crash failures. They argued that to offer a logically Recovery (CPR) module into a standard OpenFlow controller
centralized controller, it is necessary to maintain a consistent build upon NOX OpenFlow controller. CPR is a two-phase
controller state and ensure switch states consistently during process, consisting of replication and recovery, and offer
controller failure. Therefore, they have proposed Ravana, resilience against several types of failure in an SDN enabled
an SDN based fault-tolerant protocol that provides an abstrac- centrally controlled networks. Similarly, the research stud-
tion of the logically centralized controller. Ravana handles the ies carried-out by Tootonchain and Ganjali [67] introduce
entire event processing cycle and ensures total event ordering HyperFlow to provide control plane resilience. HyperFlow
across the entire system. This enables Ravana to correctly is a distributed event-based control plane, which is physi-
handle switch state and replicas without the need of restoring cally distributed but logically centralized. This enables the
to rollbacks. Moreover, it mitigates control messages during scalability, as well as ensure the benefits of centralized net-
controller failures this help in extending the control chan- work control. They argue that HyperFlow [70] offers a scal-
nel interface. Ravana provides a reliable distributed control able solution for control plane resilience in SDN enabled
for SDN. However, it does not provide support for richer network.

TABLE 4. Selected work on SDN Controller Fault-tolerance Efforts.

124484 VOLUME 7, 2019

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

Aly and Kotb [66] used SDN-centralized architecture SDN offers the flexibility of network programmability but
in which a master controller is connected to a set of this raised an issue of software-based troubleshooting and
slave-controllers. Based on this set-up, they proposed a debugging which need to be addressed, as discussed.
new Petri-net based mathematical framework for SDN Heller et al. [72] proposed a structured troubleshooting
fault-tolerance and named the model FTPNSDN . They claim approach by exploiting the SDN layered architecture. They
that, in order to avoid service disruption, Petri-net capability aim to develop a tool that would identify bugs by systemati-
functions were used to identify the next back-up controller in cally tracking the root cause of detected failures. This would
the event of controller failure. They also showed that transi- save time in diagnosing and enable the network administrator
tion time needed to take over another controller was reduced to directly fix the problems. However, they have not proposed
by 10%. They evaluated the performance of their proposed any system or framework. In a similar way, Scott et al. [73]
model, comparing it with the HyperFlow reference model, also studied SDN troubleshooting and proposed the SDN
and claim that they were able to reduce the 12% packet delay. troubleshooting system (STS). This system aims to optimize
Clearly, a single controller point of failure limits scalability the debugging time by filtering events not correlated to the
and we argue that the several recent research studies carried source of failure. They have demonstrated the feasibility
out do not yet provide a mechanism to achieve high-level per- of their proposed system and have tested five SDN control
formance or fault-tolerance at scale in SDN based networks. open-source platforms: ONOS (Java) [78], POX (Python)
To deal with the challenge of SDN controller consistency [79], NOX (C++) [80], Pyretic (Python) [81], and Floodlight
and performance, Gonzalez et al. [68] proposed a method (Java) [82]. They were able to identify seven new bugs in real-
to improve consistency and performance by using some of time, and debugged them using their proposed STS system,
the approaches from the recent study carried out by Katta and showed that STS enhances the time-consuming process
et al [62]. They design a mechanism to provide better consis- for debugging in SDN. Likewise, Canini et al. [74] built
tency and performance in master-slave SDN configuration. NICE, a troubleshooting tool for SDN. The state-space of the
They consider the performance metric for the SDN controller entire SDN system is explored through model checking. This
based on controller latency and throughput. Their proposed approach provides a systematic way to test unmodified con-
solution provides consistency and performance close to the troller programs. This tool automates the testing of OpenFlow
offered by a single SDN controller. However, they emphasize application based on model checking and concocts execution
that a very reliable communication channel is a must between efficiently.
the master controller and the data store. Reitblatt et al. [75] proposed FatTire, a high-level declar-
ElDefrawy and Kaczmarek [69] proposed a fault-tolerance ative language for writing fault-tolerant network programs in
SDN controller design that tolerates Byzantine faults. How- SDN. This high-level language aims to provide policy-based
ever, their controller design has not yet achieved high-level network management where SDN programmers can construct
performance for large-scale deployments. Further, they specific policies (for instance, data security and customized
argued that their controller design is feasible for constructing forwarding). Earlier work of Lui et al. [83] emphasize that
resilient networks. In this research study, they have designed connectivity must be realized as data plane service. This work
and prototyped a Byzantine-fault-tolerant distributed SDN fits together with FatTire for implementing policy abstrac-
controller to tolerate malicious faults both in control and in tions. Similarly, a study by Suchara et al. [84] based on
data plane as described in Kreutz et al. [2]. Further, they integrating fault-tolerance and traffic engineering possibly
integrated the two existing SDN byzantine vulnerable con- be used with FatTire. Likewise, the Flow-based Manage-
troller with the BFT-SMaRt, a tool for creating byzantine ment Language (FML) [85] specify policies using a declar-
fault-tolerant system [71]. ative language to enforce policies within the enterprise. For
instance, Access Control Lists (ACLs), Virtual Local Area
C. APPLICATION PLANE Networks (VLANs) and policy-based routing. This differs
In this section, we discuss key research efforts that have from FatTire as it does not provide fault-tolerance policy.
addressed application plane fault-tolerance in the context of Similarly, Kazemian et al. [51] introduce NetPlumber, a real-
SDN. time tool for policy checking based on Header Space Analysis

TABLE 5. Selected work on SDN application plane Fault-tolerance efforts.

VOLUME 7, 2019 124485

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

(HSA). The authors argue that they have applied this tool to cessing with pre-defined logic). However, the SDN pro-
Google’s SDN and Stanford backbone and analyzed that 50- grammable data plane should provide the flexibility to modify
500 µs on average were required for a rule update against a forwarding logic (customized packet processing). Concern-
single policy. ing the SDN data plane, there are still error detection and
Chandrasekaran et al. [76], [77] claim that they have recovery issues, which require careful consideration. For
developed a fault-tolerant SDN controller framework called instance, the current probe-based testing solution takes a long
LegoSDN. The authors aim to achieve recovery of SDN time to generate probe packets, [94], [95] making consistency
application against both deterministic and non-deterministic between control plane policies and data plane forwarding
service failures. Extending this work further, the authors behaviors difficult. Furthermore, additional new pipelines are
develop a prototype which isolates SDN application from one required in the switch data path for collecting traffic statistics
and another, as well as from controller, by running each appli- [96]; this process itself can cause errors.
cation securely in a sandbox. Thus, all failures are restricted Due to these challenges, the idea towards data plane
to its virtual isolated space. programmability has attracted significant interest from
both academia and industry [97]. Recent research studies
VII. SDN FAULT-TOLERANCE: CHALLENGES AND FUTURE addressed SDN data plane programmability. New data plane
RESEARCH DIRECTIONS specification (eg., P4 and POF) has been evolved which
In this section, we outline future directions for SDN extend the feature of SDN beyond OpenFlow specifica-
fault-tolerance development from the perspective of its role tions [98]. These new data plane specification can optimized
in future intelligent programmable networks. The term pro- fault-management in SDN, thus improve SDN architecture
grammability refers to control the set of action, rules or to fault-tolerance and reliability aspects. we believe that data
enforce a policy by software intelligently. The programma- plane programmability is an important area for future SDN
bility empowers to utilize multi-vendor hardware/devices development.
with enhanced flexibility. Moreover, it enables customized
scripting through programming languages to facilitate net- B. CONTROLLER ARCHITECTURE FOR MISSION CRITICAL
work administrators to enforce policy-based configuration COMMUNICATIONS
on network devices/functions through APIs. Therefore, pro- Controller fault-tolerance research studies included in Table 4
grammability is a pre-requisite for enabling network automa- were focused on designing fault-tolerant SDN controller
tion (a practice in which software automatically configure and in scenarios where parameters such as throughput, packet
test network devices) in a communications network [86]. loss, latency, jitter, and redundancy are more flexible than
A programmable network is flexible and re-configurable mission-critical communications (industrial networks and
because most of the protocol stacks are implemented in soft- intelligent systems) [99], [100], where these parameters
ware. Therefore, network upgrades to replace or configure the have more stringent demands. Mission-critical applications
network protocols is possible without the interruption of the are common in different sectors including military, hospi-
network operations [87], [88]. tal, automotive safety, and air-traffic control systems [101].
The term Network softwarization refers to the ‘‘network- Unfortunately, the research and development of fault-tolerant
ing industry transformation for designing, deploying, imple- SDN controller for mission-critical applications have been
menting and maintaining network devices/network elements overlooked. Not even the SDN fault-tolerant controller
through software programming’’ [89]. research efforts (other than mission-critical communica-
tion) are still not yet fully developed. Scalability, perfor-
A. DATA PLANE PROGRAMMABILITY FOR NETWORK mance, and data consistency in SDN multi-controller archi-
SOFTWARIZATION tectures is still an area of intense investigation [70], [102].
A few research studies included in Table 3 focused on There is a need to develop fault-tolerant SDN control net-
SDN data plane fault-tolerance using traditional failure work for mission-critical applications, where designing SDN
detection approaches (i.e., BFD and LOS) and recovery controller for mission-critical applications is of significant
approaches (i.e., restoration and protection), that were able importance and quite challenging hence, we believe that
to ensure Carrier-grade reliability. However, due to the emer- this topic should be addressed comprehensively in future
gence of new data plane specifications such as Program- research.
ming Protocol Independent Packet Processors (P4) [90], and
Protocol-Oblivious Forwarding (POF) [91] new paths toward C. SOFTWARE TOOLS FOR SDN APPLICATIONS
the development of novel strategies and standards to support DEVELOPMENT
fault-tolerance opened up. Data plane programmability in SDN fault-tolerance research studies included in Table 5
SDN is the next step towards supporting a fast-growing trend were focused on developing software tools for troubleshoot-
of network programmability [92] and network softwariza- ing, writing fault-tolerant programs, and detect any network
tion (softwarization of future networks and services) [93]. policy violations in application plane. However, the devel-
Traditionally, the network data plane was designed to be oped fault-tolerant software tools are still having many
configurable but with fixed forwarding logic (packet pro- shortcomings, for instance, incomplete repair mechanisms,

124486 VOLUME 7, 2019

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

and high overhead for recovery [9]. Due to the diversity MAC Medium Access Control
of network protocols for SDN Southbound and Northbound NETCONF Network Configuration Protocol
APIs, and underway standardization of these diverse proto- ONF Open Network Foundation
cols, the new SDN applications development has not been OSPF Open Shortest Path First
accelerated. Hence, the developed software tools have not P4 Programming Protocol Independent Packet
been comprehensively tested and developed to support the Processors
diversity of network protocols used in SDN networks. There POF Protocol-Oblivious Forwarding
is a need to develop improved software tools in order to enable REST REpresentaional State Transfer
application plane fault-tolerance in future SDN deployments. RSM Replicated State Machine
SDN Software-defined Networking
SMR State Machine Replication
VIII. CONCLUSION
STS SDN Troubleshooting System
This work presents a survey on fault-tolerance in the
TCAM Ternary Content Addressable Memory
scope of SDN. Also, we provided a simple background on
VLAN Virtual Local Area Network
fault-tolerance and related concepts to develop a complete
WAN Wide Area Network
understanding of the topic. Our goal was to identify SDN
XMPP Extensible Messaging and Presence Protocol
fault-tolerance requirements specific to the SDN architec-
ture and discuss approaches that can be used to improve
fault-tolerance in SDN. REFERENCES
Current SDN research efforts were structured according to [1] B. A. A. Nunes, M. Mendonca, X.-N. Nguyen, K. Obraczka, and
T. Turletti, ‘‘A survey of software-defined networking: Past, present,
the three main layers of SDN architecture and categorized and future of programmable networks,’’ IEEE Commun. Surveys Tuts.,
them according to data, control, and application planes. vol. 16, no. 3, pp. 1617–1634, 3rd Quart., 2014.
While exploring the topic of fault-tolerance in SDN, [2] D. Kreutz, F. Ramos, P. E. Veríssimo, C. E. Rothenberg, S. Azodolmolky,
and S. Uhlig, ‘‘Software-defined networking: A comprehensive survey,’’
we have identified that each layer has its faults and Proc. IEEE, vol. 103, no. 1, pp. 14–76, Jan. 2015.
fault-tolerance issues. This means that in order to achieve [3] F. Hu, Q. Hao, and K. Bao, ‘‘A survey on software-defined network
fault-tolerance different aspects and features are needed to and OpenFlow: From concept to implementation,’’ IEEE Commun. Surv.
Tuts., vol. 16, no. 4, pp. 2181–2206, Nov. 2014.
be targeted, and no single-focused technology will be able [4] T. Bakhshi, ‘‘State of the art and recent research advances in software
to provide the reliability expected in commercial networks. defined networking,’’ Wireless Commun. Mobile Comput., Jan. 2017,
Recent research studies show that SDN can play a pivotal Art. no. 7191647. doi: 10.1155/2017/7191647.
[5] G. Stanley. (2019). Fault Management—The Overall Process and
role in shaping and managing future dynamic networking Life Cycle of a Fault, Accessed: Dec. 12, 2019. [Online]. Available:
environments, such as cloud-native networks, Fifth Genera- https://gregstanleyandassociates.com/whitepapers/FaultDiagnosis/Fault-
tion (5G) mobile networks [103], wireless networks [104] and Management/fault-management.htm
[6] K. Nørvåg, ‘‘An introduction to fault-tolerant systems,’’ Dept. Com-
optical networks [105]. However, SDN fault-tolerance is still put. Inf. Sci., Norwegian Univ. Sci. Technol., Trondheim, Norway,
in its infancy, and there is a broad spectrum of opportunities Tech. Rep. 6/99, 2000.
for the research community to develop new fault-tolerance [7] A. Lara, A. Kolasani, and B. Ramamurthy, ‘‘Simplifying network man-
agement using software defined networking and OpenFlow,’’ in Proc.
mechanisms, standards, monitoring, debugging and testing IEEE Int. Conf. Adv. Netw. Telecommun. Syst. (ANTS), Dec. 2012,
tools to enforce fault-tolerance in such dynamic networking pp. 24–29.
environments, able to ensure Carrier-grade reliability. [8] P. C. da Rocha Fonseca and E. S. Mota, ‘‘A survey on fault management in
software-defined networks,’’ IEEE Commun. Surveys Tuts., vol. 19, no. 4,
pp. 2284–2321, 4th Quart., 2017.
[9] Y. Yu, X. Li, X. Leng, L. Song, K. Bu, Y. Chen, J. Yang, L. Zhang,
LIST OF ABBREVIATIONS/ACRONYMS K. Cheng, and X. Xiao, ‘‘Fault management in software-defined network-
ACLs Access Control Lists ing: A survey,’’ IEEE Commun. Surveys Tuts., vol. 21, no. 1, pp. 349–392,
AFRO Automatic Failure Recovery 1st Quart., 2018.
[10] M. van Steen and A. S. Tanenbaum, Distributed Systems, 3rd ed.
APIs Application Programmable Interfaces Upper Saddle River, NJ, USA: Prentice-Hall, 2017.
BFD Bidirectional Forwarding Detection [11] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, ‘‘Basic con-
BFT Byzantine Fault-tolerant cepts and taxonomy of dependable and secure computing,’’ IEEE Trans.
Depend. Sec. Comput., vol. 1, no. 1, pp. 11–33, Jan. 2004.
BLR Backward Local Rerouting [12] S. Hukerikar and C. Engelmann, ‘‘Resilience design patterns—A struc-
CPR Broadband Forum tured approach to resilience at extreme scale (version 1.0),’’ 2016,
DFRS Business Support System arXiv:1611.02717. [Online]. Available: https://arxiv.org/abs/1611.02717
[13] J. H. Saltzer and M. F. Kaashoek, Principles of Computer System Design:
DoS Denial of Service An Introduction. San Mateo, CA, USA: Morgan Kaufmann, 2009.
EIGRP Enhanced Interior Gateway Routing Protocol [14] R. Jhawar and V. Piuri, ‘‘Fault tolerance and resilience in cloud com-
FIB Forwarding Information Base puting environments,’’ in Computer and Information Security Handbook,
J. R. Vacca, Ed., 3rd ed. Boston, MA, USA: Morgan Kaufmann, 2017,
FLR Forward Local Rerouting ch. 9, pp. 165–181. [Online]. Available: http://www.sciencedirect.com/
FML Flow-based Management Language science/article/pii/B9780128038437000090
ForCES orwarding and Control Element Separation [15] M. Hasan and M. S. Goraya, ‘‘Fault tolerance in cloud com-
puting environment: A systematic survey,’’ Comput. Ind., vol. 99,
HAS Header Space Analysis pp. 156–172, Aug. 2018. [Online]. Available: http://www.sciencedirect.
LOS Loss of Signal com/science/article/pii/S0166361517304438

VOLUME 7, 2019 124487

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

[16] (2018). Fault Tolerant Software Systems: Techniques (Part 4a). [37] G. N. Nde and R. Khondoker, ‘‘SDN testing and debugging tools: A sur-
[Online]. Available: https://slideplayer.com/slide/10093304/[Accessed: vey,’’ in Proc. 5th Int. Conf. Informat., Electron. Vis. (ICIEV), May 2016,
12-Dec-2018] pp. 631–635.
[17] A. Bucchiarone, H. Muccini, and P. Pelliccione, ‘‘Architecting fault- [38] J. Reich, C. Monsanto, N. Foster, J. Rexford, and D. Walker, ‘‘Modular
tolerant component-based systems: From requirements to testing,’’ SDN programming with pyretic,’’ in Proc. USENIX, 2013, pp. 1–7.
Electron. Notes Theor. Comput. Sci., vol. 168, pp. 77–90, Feb. 2007. [39] R. Beckett, X. K. Zou, S. Zhang, S. Malik, J. Rexford, and D. Walker,
[Online]. Available: http://www.sciencedirect.com/science/article/pii/ ‘‘An assertion language for debugging SDN applications,’’ in Proc. 3rd
S1571066107000291 Workshop Hot Topics Softw. Defined Netw. (HotSDN). New York, NY,
[18] J. Dobson, I. Sommerville, and G. Dewsbury, Introduction: Dependability USA: ACM, 2014, pp. 91–96. doi: 10.1145/2620728.2620743.
and Responsibility in Context. London, U.K.: Springer, 2007, pp. 1–17. [40] N. L. Van Adrichem, B. J. Van Asten, and F. A. Kuipers, ‘‘Fast recov-
doi: 10.1007/978-1-84628-626-1_1. ery in software-defined networks,’’ in Proc. IEEE EWSDN, Sep. 2014,
[19] E. Haleplidis, K. Pentikousis, S. Denazis, J. H. Salim, D. Meyer, and pp. 61–66.
O. Koufopavlou, Software-Defined Networking (SDN): Layers and [41] L. Sidki, Y. Ben-Shimol, and A. Sadovski, ‘‘Fault tolerant mechanisms
Architecture Terminology, document RFC 7426, Internet Requests for for SDN controllers,’’ in Proc. IEEE Conf. Netw. Function Virtualization
Comments, RFC Editor, Jan. 2015. [Online]. Available: https://www. Softw. Defined Netw. (NFV-SDN), Nov. 2016, pp. 173–178.
rfc-editor.org/rfc/pdfrfc/rfc7426.txt.pdf [42] K. Kuroki, N. Matsumoto, and M. Hayashi, ‘‘Scalable OpenFlow
[20] G. Warnock and A. Nathoo, Alcatel-Lucent Network Routing Specialist II controller redundancy tackling local and global recoveries,’’ in
(NRS II) Self-Study Guide: Preparing for the NRS II Certification Exams. Proc. 5th Int. Conf. Adv. Future Internet, Barcelona, Spain, 2013,
Hoboken, NJ, USA: Wiley, 2011. pp. 25–31.
[21] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, [43] Y. Tingting, H. Xiaohong, M. Maode, and Y. Jie, ‘‘Balance-based SDN
J. Rexford, S. Shenker, and J. Turner, ‘‘OpenFlow: Enabling innovation controller placement and assignment with minimum weight matching,’’
in campus networks,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 38, in Proc. IEEE Int. Conf. Commun. (ICC), May 2018, pp. 1–6.
no. 2, pp. 69–74, Apr. 2008. [44] G. Wang, Y. Zhao, J. Huang, and W. Wang, ‘‘The controller placement
[22] O. Blial, M. Ben Mamoun, and R. Benaini, ‘‘An overview on SDN problem in software defined networking: A survey,’’ IEEE Netw., vol. 31,
architectures with multiple controllers,’’ J. Comput. Netw. Commun., no. 5, pp. 21–27, Sep./Oct. 2017.
Apr. 2016, Art. no. 9396525. doi: 10.1155/2016/9396525. [45] Y. Jiménez, C. Cervelló-Pastor, and A. J. Garcia, ‘‘On the controller
[23] A. S. da Silva, P. Smith, A. Mauthe, and A. Schaeffer-Filho, placement for designing a distributed SDN control layer,’’ in Proc. IEEE
‘‘Resilience support in software-defined networking: A survey,’’ Com- Netw. Conf. (IFIP), Jun. 2014, pp. 1–9.
put. Netw., vol. 92, pp. 189–207, Dec. 2015. [Online]. Available: [46] T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski, M. Zhu,
http://www.sciencedirect.com/science/article/pii/S1389128615003229 R. Ramanathan, Y. Iwata, H. Inoue, T. Hama, and S. Shenker, ‘‘Onix:
[24] W. Braun and M. Menth, ‘‘Software-defined networking using Open- A distributed control platform for large-scale production networks,’’ in
Flow: Protocols, applications and architectural design choices,’’ Future Proc. 9th USENIX Conf. Operating Syst. Design Implement. (OSDI).
Internet, vol. 6, no. 2, pp. 302–336, 2014. [Online]. Available: Berkeley, CA, USA: USENIX Association, 2010, pp. 351–364. [Online].
https://www.mdpi.com/1999-5903/6/2/302 Available: http://dl.acm.org/citation.cfm?id=1924943.1924968
[47] A. S. Muqaddas, P. Giaccone, A. Bianco, and G. Maier, ‘‘Inter-controller
[25] E. Haleplidis, J. H. Salim, J. M. Halpern, S. Hares, K. Pentikousis,
traffic to support consistency in ONOS clusters,’’ IEEE Trans. Netw.
K. Ogawa, W. Wang, S. Denazis, and O. Koufopavlou, ‘‘Network pro-
Service Manage., vol. 14, no. 4, pp. 1018–1031, Dec. 2017.
grammability with ForCES,’’ IEEE Commun. Surveys Tuts., vol. 17, no. 3,
pp. 1423–1440, 3rd Quart., 2015. [48] V. Pashkov, A. Shalimov, and R. Smeliansky, ‘‘Controller failover for
SDN enterprise networks,’’ in Proc. Int. Sci. Technol. Conf., Modern Netw.
[26] R. Enns, M. Bjorklund, J. Schoenwaelder, and A. Bierman, Network
Technol. (MoNeTeC), Oct. 2014, pp. 1–6.
Configuration Protocol (NETCONF), document 6241, IETF, 2011.
[49] F. Németh, R. Steinert, P. Kreuger, and P. Sköldström, ‘‘Roles of
[27] P. Saint-Andre, Extensible Messaging and Presence Protocol (XMPP):
DevOps tools in an automated, dynamic service creation architecture,’’
Core, document RFC 2779, IETF, 2011.
in Proc. IFIP/IEEE Int. Symp. Integr. Netw. Manage. (IM), May 2015,
[28] R. Jain and S. Paul, ‘‘Network virtualization and software defined net- pp. 1153–1154.
working for cloud computing: A survey,’’ IEEE Commun. Mag., vol. 51, [50] N. Handigol, B. Heller, V. Jeyakumar, and D. Maziéres, and N. McKeown,
no. 11, pp. 24–31, Nov. 2013. ‘‘Where is the debugger for my software-defined network?’’ in Proc. 1st
[29] W. Stallings, Foundations of Modern Networking: SDN, NFV, QoE, Workshop Hot Topics Softw. Defined Netw. (HotSDN). New York, NY,
IoT, and Cloud, 1st ed. Reading, MA, USA: Addison-Wesley, 2015, USA: ACM, 2012, pp. 55–60. doi: 10.1145/2342441.2342453.
pp. 80–85. [51] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, and
[30] S. Sharma, D. Staessens, D. Colle, M. Pickavet, and P. Demeester, S. Whyte, ‘‘Real time network policy checking using header space analy-
‘‘OpenFlow: Meeting carrier-grade recovery requirements,’’ Comput. sis,’’ in Proc. 10th USENIX Conf. Netw. Syst. Design Implement. (NSDI).
Commun., vol. 36, no. 6, pp. 656–665, 2013. [Online]. Available: Berkeley, CA, USA: USENIX Association, 2013, pp. 99–112. [Online].
http://www.sciencedirect.com/science/article/pii/S0140366412003349 Available: http://dl.acm.org/citation.cfm?id=2482626.2482638
[31] J. Chen, J. Chen, F. Xu, M. Yin, and W. Zhang, ‘‘When software defined [52] S. Sharma, D. Staessens, D. Colle, M. Pickavet, and P. Demeester, ‘‘Fast
networks meet fault tolerance: A survey,’’ in Algorithms and Architec- failure recovery for in-band OpenFlow networks,’’ in Proc. 9th Int. Conf.
tures for Parallel Processing, G. Wang, A. Zomaya, G. Martinez, and Design Reliable Commun. Netw. (DRCN), Mar. 2013, pp. 52–59.
K. Li, Eds. Cham, Switzerland: Springer, 2015, pp. 351–368. [53] S. Sharma, D. Staessens, D. Colle, M. Pickavet, and P. Demeester, ‘‘In-
[32] D. Katz and D. Ward, Bidirectional Forwarding Detection (BFD), docu- band control, queuing, and failure recovery functionalities for Open-
ment RFC 5880, Internet Requests for Comments, RFC Editor, Jun. 2010. Flow,’’ IEEE Netw., vol. 30, no. 1, pp. 106–112, Jan./Feb. 2016.
[Online]. Available: https://www.rfc-editor.org/rfc/pdfrfc/rfc5880.txt.pdf [54] P. M. Mohan, T. Truong-Huu, and M. Gurusamy, ‘‘Fault tolerance in
[33] D. Staessens, S. Sharma, D. Colle, M. Pickavet, and P. Demeester, TCAM-limited software defined networks,’’ Comput. Netw., vol. 116,
‘‘Software defined networking: Meeting carrier grade requirements,’’ in pp. 47–62, Apr. 2017. [Online]. Available: http://www.sciencedirect.
Proc. 18th IEEE Workshop Local Metropolitan Area Netw. (LANMAN), com/science/article/pii/S1389128617300476
Oct. 2011, pp. 1–6. [55] H. Li, Q. Li, Y. Jiang, T. Zhang, and L. Wang, ‘‘A declarative failure
[34] A. U. Rehman, R. L. Aguiar, and J. P. B. Barraca, ‘‘A proposal for recovery system in software defined networks,’’ in Proc. IEEE Int. Conf.
fault-tolerant and self-healing hybrid SDN control network,’’ in Proc. 9th Commun. (ICC), May 2016, pp. 1–6.
Simpósio de Informática (INForum), Oct. 2017, pp. 47–52. [56] M. Kuzniar, P. Perešíni, N. Vasić, M. Canini, and D. Kostić, ‘‘Automatic
[35] C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, failure recovery for software-defined networks,’’ in Proc. 2nd ACM SIG-
M. Nanduri, and R. Wattenhofer, ‘‘Achieving high utilization with COMM Workshop Hot Topics Softw. Defined Netw. (HotSDN). New York,
software-driven WAN,’’ ACM SIGCOMM Comput. Commun. Rev., NY, USA: ACM, 2013, pp. 159–160. doi: 10.1145/2491185.2491218.
vol. 43, no. 4, pp. 15–26, 2013. doi: 10.1145/2534169.2486012. [57] H. Kim, M. Schlansker, J. R. Santos, J. Tourrilhes, Y. Turner, and
[36] C. Trois, M. D. Del Fabro, L. C. E. de Bona, and M. Martinello, ‘‘A survey N. Feamster, ‘‘CORONET: Fault tolerance for software defined net-
on SDN programming languages: Toward a taxonomy,’’ IEEE Commun. works,’’ in Proc. 20th IEEE Int. Conf. Netw. Protocols (ICNP),
Surveys Tuts., vol. 18, no. 4, pp. 2687–2712, 4th Quart., 2016. Oct./Nov. 2012, pp. 1–2.

124488 VOLUME 7, 2019

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

[58] L. Schiff, S. Schmid, and M. Canini, ‘‘Ground control to major faults: [76] B. Chandrasekaran and T. Benson, ‘‘Tolerating SDN application fail-
Towards a fault tolerant and adaptive SDN control network,’’ in Proc. 46th ures with LegoSDN,’’ in Proc. 13th ACM Workshop Hot Topics Netw.
Annu. IEEE/IFIP Int. Conf. Dependable Syst. Netw. Workshop (DSN-W), (HotNets-XIII). New York, NY, USA: ACM, 2014, pp. 22:1–22:7. doi:
Jun./Jul. 2016, pp. 90–96. 10.1145/2670518.2673880.
[59] J. Chen, J. Chen, J. Ling, and W. Zhang, ‘‘Failure recovery using [77] B. Chandrasekaran, B. Tschaen, and T. Benson, ‘‘Isolating and toler-
vlan-tag in SDN: High speed with low memory requirement,’’ in Proc. ating SDN application failures with LegoSDN,’’ in Proc. Symp. SDN
IEEE 35th Int. Perform. Comput. Commun. Conf. (IPCCC), Dec. 2016, Res. (SOSR). New York, NY, USA: ACM, 2016, pp. 7:1–7:12. doi:
pp. 1–9. 10.1145/2890955.2890965.
[60] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, [78] Open Network Foundation. Open Network Operating System
S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, (ONOS). Accessed: Jun. 20, 2019. [Online]. Available: https://www.
S. Stuart, and A. Vahdat, ‘‘B4: Experience with a globally-deployed soft- opennetworking.org/onos/
ware defined WAN,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 43, [79] The POX Network Software Platform. Accessed: Jun. 20, 2019. [Online].
no. 4, pp. 3–14, 2013. Available: https://github.com/noxrepo/pox
[61] P. Thorat, S. M. Raza, D. T. Nguyen, G. Im, H. Choo, and [80] N. Gude, T. Koponen, J. Pettit, B. Pfaff, M. Casado, N. McKeown, and
D. S. Kim, ‘‘Optimized self-healing framework for software defined Scott Shenker, ‘‘NOX: Towards an operating system for networks,’’ ACM
networks,’’ in Proc. 9th Int. Conf. Ubiquitous Inf. Manage. Com- SIGCOMM Comput. Commun. Rev., vol. 38, no. 3, pp. 105–110, 2008.
mun. (IMCOM). New York, NY, USA: ACM, 2015, pp. 7:1–7:6. doi: doi: 10.1145/1384609.1384625.
10.1145/2701126.2701235. [81] N. Foster, R. Harrison, M. J. Freedman, C. Monsanto, J. Rexford,
[62] N. Katta, H. Zhang, M. Freedman, and J. Rexford, ‘‘Ravana: Con- A. Story, and D. Walker, ‘‘Frenetic: A network programming lan-
troller fault-tolerance in software-defined networking,’’ in Proc. 1st ACM guage,’’ in Proc. 16th ACM SIGPLAN Int. Conf. Funct. Program.
SIGCOMM Symp. Softw. Defined Netw. Res. (SOSR). New York, NY, (ICFP). New York, NY, USA: ACM, 2011, pp. 279–291. doi: 10.1145/
USA: ACM, 2015, pp. 4:1–4:12. doi: 10.1145/2774993.2774996. 2034773.2034812.
[63] F. A. Botelho, F. M. V. Ramos, D. Kreutz, and A. N. Bessani, ‘‘On the [82] Project Floodlight. Accessed: Jun. 20, 2019. [Online]. Available:
feasibility of a consistent and fault-tolerant data store for SDNs,’’ in Proc. http://www.projectfloodlight.org/floodlight/
2nd Eur. Workshop Softw. Defined Netw. (EWSDN). Washington, DC, [83] H. H. Liu, S. Kandula, R. Mahajan, M. Zhang, and D. Gelernter,
USA, Oct. 2013, pp. 38–43. doi: 10.1109/EWSDN.2013.13. ‘‘Traffic engineering with forward fault correction,’’ SIGCOMM Com-
[64] F. Botelho, A. Bessani, F. M. V. Ramos, and P. Ferreira, ‘‘On the design put. Commun. Rev., vol. 44, no. 4, pp. 527–538, Aug. 2014. doi:
of practical fault-tolerant SDN controllers,’’ in Proc. 3rd Eur. Workshop 10.1145/2740070.2626314.
Softw. Defined Netw. (EWSDN), Sep. 2014, pp. 73–78. [84] M. Suchara, D. Xu, R. Doverspike, D. Johnson, and J. Rexford, ‘‘Net-
[65] P. Fonseca, R. Bennesby, E. Mota, and A. Passito, ‘‘A replication com- work architecture for joint failure recovery and traffic engineering,’’ in
ponent for resilient OpenFlow-based networking,’’ in Proc. IEEE Netw. Proc. ACM SIGMETRICS Joint Int. Conf. Meas. Modeling Comput. Syst.
Oper. Manage. Symp., Apr. 2012, pp. 933–939. (SIGMETRICS). New York, NY, USA: ACM, 2011, pp. 97–108. doi:
[66] W. H. F. Aly and Y. Kotb, ‘‘Towards SDN fault tolerance using 10.1145/1993744.1993756.
Petri-nets,’’ in Proc. 28th Int. Telecommun. Netw. Appl. Conf. (ITNAC), [85] T. L. Hinrichs, N. S. Gude, M. Casado, J. C. Mitchell, and S. Shenker,
Nov. 2018, pp. 1–3. ‘‘Practical declarative network management,’’ in Proc. 1st ACM Work-
[67] A. Tootoonchian and Y. Ganjali, ‘‘HyperFlow: A distributed control shop Res. Enterprise Netw. (WREN). New York, NY, USA: ACM, 2009,
plane for OpenFlow,’’ in Proc. Internet Netw. Manage. Conf. Res. pp. 1–10. doi: 10.1145/1592681.1592683.
Enterprise Netw. (INM/WREN). Berkeley, CA, USA: USENIX [86] S. Lowe, J. Edelman, and M. Oswalt, Network Programmability and
Association, 2010, p. 3. [Online]. Available: http://dl.acm.org/ Automation, 1st ed. Champaign, IL, USA: O’Reilly Media, 2017,
citation.cfm?id=1863133.1863136 pp. 1–35.
[68] A. J. Gonzalez, G. Nencioni, B. E. Helvik, and A. Kamisinski, ‘‘A fault- [87] D. F. Macedo, D. Guedes, L. F. M. Vieira, M. A. M. Vieira, and
tolerant and consistent SDN controller,’’ in Proc. IEEE Global Commun. M. Nogueira, ‘‘Programmable Networks—From Software-Defined
Conf. (GLOBECOM), Dec. 2016, pp. 1–6. Radio to Software-Defined Networking,’’ IEEE Commun. Surveys Tuts.,
[69] K. ElDefrawy and T. Kaczmarek, ‘‘Byzantine fault tolerant software- vol. 17, no. 2, pp. 1102–1125, 2nd Quart., 2015.
defined networking (SDN) controllers,’’ in Proc. IEEE 40th Annu. [88] X. Foukas, M. K. Marina, and K. Kontovasilis, ‘‘Software defined
Comput. Softw. Appl. Conf. (COMPSAC), vol. 2, Jun. 2016, networking concepts,’’ in Software Defined Mobile Networks (SDMN):
pp. 208–213. Beyond LTE Network Architecture. Hoboken, NJ, USA: Wiley, 2015,
[70] S. H. Yeganeh, A. Tootoonchian, and Y. Ganjali, ‘‘On scalability of ch. 3, pp. 21–44. [Online]. Available: https://onlinelibrary.wiley.com/doi/
software-defined networking,’’ IEEE Commun. Mag., vol. 51, no. 2, abs/10.1002/9781118900253.ch3
pp. 136–141, Feb. 2013. [89] A. U. Rehman, R. L. Aguiar, and J. P. Barraca, ‘‘Network functions
[71] A. Bessani, J. Sousa, and E. E. P. Alchieri, ‘‘State machine replication for virtualization: The long road to commercial deployments,’’ IEEE Access,
the masses with BFT-SMaRt,’’ in Proc. 44th Annu. IEEE/IFIP Int. Conf. vol. 7, pp. 60439–60464, 2019.
Dependable Syst. Netw. (DSN), Jun. 2014, pp. 355–362. [90] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown,
[72] B. Heller, C. Scott, N. McKeown, S. Shenker, A. Wundsam, H. Zeng, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and
S. Whitlock, V. Jeyakumar, N. Handigol, J. McCauley, K. Zarifis, and D. Walker, ‘‘P4: Programming protocol-independent packet processors,’’
P. Kazemian, ‘‘Leveraging SDN layering to systematically troubleshoot SIGCOMM Comput. Commun. Rev., vol. 44, pp. 87–95, Jul. 2014. doi:
networks,’’ in Proc. 2nd ACM SIGCOMM Workshop Hot Topics Softw. 10.1145/2656877.2656890.
Defined Netw. (HotSDN). New York, NY, USA: ACM, 2013, pp. 37–42. [91] H. Song, ‘‘Protocol-oblivious forwarding: Unleash the power of SDN
doi: 10.1145/2491185.2491197. through a future-proof forwarding plane,’’ in Proc. 2nd ACM SIGCOMM
[73] C. Scott, A. Wundsam, B. Raghavan, A. Panda, A. Or, J. Lai, E. Huang, Workshop Hot Topics Softw. Defined Netw. (HotSDN). New York, NY,
Z. Liu, A. El-Hassany, S. Whitlock, H. B. Acharya, K. Zarifis, and USA: ACM, 2013, pp. 127–132. doi: 10.1145/2491185.2491190.
S. Shenker, ‘‘Troubleshooting blackbox SDN control software with mini- [92] R. Tischer and J. Gooley, Programming and Automating Cisco
mal causal sequences,’’ ACM SIGCOMM Comput. Commun. Rev., vol. 44, Networks, 1st ed. Indianapolis, IN, USA: Cisco Press, 2016,
no. 4, pp. 395–406, 2014. doi: 10.1145/2740070.2626304. pp. 1–64.
[74] M. Canini, D. Venzano, and P. Perešíni, D. Kostić, and J. Rexford, [93] A. Galis, S. Clayman, L. Mamatas, J. R. Loyola, A. Manzalini,
‘‘A NICE way to test OpenFlow applications,’’ presented at the S. Kuklinski, J. Serrat, and T. Zahariadis, ‘‘Softwarization of future net-
9th USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2012, works and services-programmable enabled networks as next generation
pp. 127–140. software defined networks,’’ in Proc. IEEE SDN Future Netw. Services
[75] M. Reitblatt, M. Canini, A. Guha, and N. Foster, ‘‘FatTire: Declara- (SDN4FNS), Nov. 2013, pp. 1–7.
tive fault tolerance for software-defined networks,’’ in Proc. 2nd ACM [94] P. Perešíni, M. Kuźniar, and D. Kostić, ‘‘Monocle: Dynamic, fine-grained
SIGCOMM Workshop Hot Topics Softw. Defined Netw. (HotSDN). data plane monitoring,’’ in Proc. 11th ACM Conf. Emerg. Netw. Exp.
New York, NY, USA: ACM, 2013, pp. 109–114. doi: 10.1145/ Technol. (CoNEXT). New York, NY, USA: ACM, 2015, pp. 32:1–32:13.
2491185.2491187. doi: 10.1145/2716281.2836117.

VOLUME 7, 2019 124489

A. U. Rehman et al.: Fault-Tolerance in the Scope of SDN

[95] K. Bu, X. Wen, B. Yang, Y. Chen, L. E. Li, and X. Chen, ‘‘Is every flow RUI. L. AGUIAR received the degree in telecom-
on the right track?: Inspect SDN forwarding with RuleScope,’’ in Proc. munication engineering and the Ph.D. degree in
35th Annu. IEEE Int. Conf. Comput. Commun. (INFOCOM), Apr. 2016, electrical engineering from the Universidade de
pp. 1–9. Aveiro, in 1990 and 2001, respectively, where he
[96] P. Zhang, H. Li, C. Hu, L. Hu, L. Xiong, R. Wang, Y. Zhang, ‘‘Mind is currently a Full Professor and is responsible for
the gap: Monitoring the control-data plane consistency in software networking area. He has been an Adjunct Profes-
defined networks,’’ in Proc. 12th Int. Conf. Emerg. Netw. Exp. Tech- sor with INI, Carnegie Mellon University, and a
nol. (CoNEXT). New York, NY, USA: ACM, 2016, pp. 19–33. doi:
Visiting Research Scholar with the Universidade
10.1145/2999572.2999605.
Federal de Uberlândia, Brazil. He is coordinat-
[97] H. Farhad, H. Lee, and A. Nakao, ‘‘Data plane programmability in
SDN,’’ in Proc. IEEE 22nd Int. Conf. Netw. Protocols, Oct. 2014, ing a research line in the area of networks and
pp. 583–588. multimedia nationwide with the Instituto de Telecomunicações. His current
[98] W. L. da Costa Cordeiro, J. A. Marques, and L. P. Gaspary, ‘‘Data research interests include the implementation of 5G networks and the future
plane programmability beyond OpenFlow: Opportunities and challenges Internet. He has over 450 published articles in his research areas, including
for network and service operations and management,’’ J. Netw. Syst. standardization contributions to the IEEE and the IETF. He is also a Senior
Manage., vol. 25, no. 4, pp. 784–818, Oct. 2017. doi: 10.1007/s10922- Member of the Portugal ComSoc Chapter Chair and a member of the ACM.
017-9423-2. He has served as the Technical and General Chair of several IEEE, ACM,
[99] M. Bouet, K. Phemius, and J. Leguay, ‘‘Distributed SDN for mission- and IFIP conferences and as an IEEE ComSoc Distinguished Lecturer. He is
critical networks,’’ in Proc. IEEE Mil. Commun. Conf., Oct. 2014, the current Chair of the Steering Board of the Networld 2020 ETP. He is
pp. 942–948. regularly invited for keynotes on 5G and the future Internet subjects. He sits
[100] V. Gkioulos, H. Gunleifsen, and G. K. Weldehawaryat, ‘‘A sys- on the TPC of most major IEEE ComSoc conferences. He is also an Associate
tematic literature review on military software defined networks,’’ Editor of ETT (Wiley) and Wireless Networks (Springer). He has helped in
Future Internet, vol. 10, no. 9, p. 88, 2018. [Online]. Available: the launch of ICT Express (Elsevier).
https://www.mdpi.com/1999-5903/10/9/88
[101] R. Carreras Ramirez, Q.-T. Vien, R. Trestian, L. Mostarda, and P. Shah,
‘‘Multi-path routing for mission critical applications in software-defined
networks,’’ in Industrial Networks and Intelligent Systems, T. Q. Duong
and N.-S. Vo, Eds. Cham, Switzerland: Springer, 2019, pp. 38–48.
[102] M. Karakus and A. Durresi, ‘‘A survey: Control plane scalability
issues and approaches in software-defined networking (SDN),’’ Com-
put. Netw., vol. 112, pp. 279–293, Jan. 2017. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S138912861630411X
[103] A. Hakiri and P. Berthou, Leveraging SDN for the 5G Networks.
Hoboken, NJ, USA: Wiley, 2015, ch. 5, pp. 61–80. [Online]. Available:
https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118900253.ch5
[104] I. T. Haque and N. Abu-Ghazaleh, ‘‘Wireless software defined network-
ing: A survey and taxonomy,’’ IEEE Commun. Surveys Tuts., vol. 18,
no. 4, pp. 2713–2737, 4th Quart., 2016.
[105] A. S. Thyagaturu, A. Mercian, M. P. McGarry, M. Reisslein, and
W. Kellerer, ‘‘Software defined optical networks (SDONs): A com-
prehensive survey,’’ IEEE Commun. Surveys Tuts., vol. 18, no. 4, JOÃO PAULO BARRACA received the Ph.D.
pp. 2738–2786, 4th Quart., 2016. degree in informatics engineering from the Uni-
versidade de Aveiro, in 2012, where he is cur-
rently an acting Assistant Professor. He conducts
A. U. REHMAN received the bachelor’s degree research with the Instituto de Telecomunicações,
(Hons.) in telecommunications engineering from having led the TN-AV Group, from 2015 to 2016.
Mohammad Ali Jinnah University, Pakistan, He has close to 100 peer-reviewed publications
in 2009, and the master’s degree (Hons.) in and reports related to solutions for the Internet
telecommunications engineering from The Uni- of Things and software for cloud environments,
versity of Sunderland, U.K., in 2011. He is with a focus on software-defined networking and
currently pursuing the Ph.D. degree in telecom- 5G Networks. Having participated in many review panels, he has also
munications with MAP-tele (a joint Doctoral organized workshops and conferences. He has participated in more than
Program of the Universidade do Porto, the Uni- 20 projects, either developing novel concepts or applying these concepts
versidade de Aveiro, and the Universidade do in innovative products and solutions. He leads the FCT/CAPES DEVNF
Minho, Portugal, all three universities with a strong tradition in the area Project in Portugal devoted to NFV orchestration, the local teams of EU
of telecommunications engineering). He was a Visiting Instructor with LIFE-PAYT, participates in European Science Cloud for Astronomy (EU
Telecom Foundation, Pakistan. He was a Teaching Assistant with Moham- AENEAS), the local team in the P2020 (CRUISE Project), the security team
mad Ali Jinnah University, for several years. He is currently involved in at P2020-Social, participates in the EU Interreg CISMOB smart cities pilot,
the research areas of telecommunications and the Internet at the Instituto the Engage SKA research infrastructure, and the Square Kilometer Array
de Telecomunicações, Portugal, where he is currently an Active Member System (SKA) Team, and having lead activities for TM-LINFRA, among a
of the Network Application and Services Group. His research interests dozen other innovation projects. Recently, he received the third place from
include software-defined networking (SDN), network functions virtualiza- the INCM Innovation Challenge, for the development of a project targeting
tion (NFV), and the reliability and resilience of future networks. He is also a smarter environments for public transports in smart cities, using blockchain
member of the Communications Society (ComSoc) and the IEEE Software technologies.
Defined Networks Community.

124490 VOLUME 7, 2019

Exam Questions 300-430: Implementing Cisco Enterprise Wireless Networks (ENWLSI)
No ratings yet
Exam Questions 300-430: Implementing Cisco Enterprise Wireless Networks (ENWLSI)
6 pages
Scalability and Robustness of The Control Plane in Software-Defined Networking (SDN)
No ratings yet
Scalability and Robustness of The Control Plane in Software-Defined Networking (SDN)
190 pages
ThesisBJvanAsten 2014 82v2
No ratings yet
ThesisBJvanAsten 2014 82v2
204 pages
Onyx Ethernet User Manual For HPE PDF
No ratings yet
Onyx Ethernet User Manual For HPE PDF
1,595 pages
5G K-SimNet - UserManual
100% (1)
5G K-SimNet - UserManual
46 pages
SoftwareDefinedNetworking - A Comprehensive Survey (01-31)
No ratings yet
SoftwareDefinedNetworking - A Comprehensive Survey (01-31)
31 pages
SDN Controller Pcep
No ratings yet
SDN Controller Pcep
43 pages
Wireless Networking With IEEE 802.15.4 and 6LoWPAN
No ratings yet
Wireless Networking With IEEE 802.15.4 and 6LoWPAN
60 pages
A Comprehensive Survey On Software-Defined Network
No ratings yet
A Comprehensive Survey On Software-Defined Network
50 pages
Abdullah 2018
No ratings yet
Abdullah 2018
45 pages
POB User Manual v1.2.2 1
No ratings yet
POB User Manual v1.2.2 1
61 pages
Advancing Software-Defined Networks A Survey
No ratings yet
Advancing Software-Defined Networks A Survey
40 pages
Synopsis Lokesh Pawar
No ratings yet
Synopsis Lokesh Pawar
31 pages
Self-Healing Mechanism On Switch-Controller Connections in SDN
No ratings yet
Self-Healing Mechanism On Switch-Controller Connections in SDN
52 pages
Computer Networks: Manar Jammal, Taranpreet Singh, Abdallah Shami, Rasool Asal, Yiming Li
No ratings yet
Computer Networks: Manar Jammal, Taranpreet Singh, Abdallah Shami, Rasool Asal, Yiming Li
25 pages
Hakir I 2014
No ratings yet
Hakir I 2014
31 pages
Journal of Network and Computer Applications
No ratings yet
Journal of Network and Computer Applications
23 pages
Security in Openflow-Based SDN, Opportunities and Challenges
No ratings yet
Security in Openflow-Based SDN, Opportunities and Challenges
23 pages
2023-Systematizing Attacks and Defenses in Software-Defined
No ratings yet
2023-Systematizing Attacks and Defenses in Software-Defined
23 pages
Modeling For Fault Tolerance in Cloud Computing Environment
No ratings yet
Modeling For Fault Tolerance in Cloud Computing Environment
11 pages
A Survey On Software-Defined Network and OpenFlow - From Concept To Implementation
No ratings yet
A Survey On Software-Defined Network and OpenFlow - From Concept To Implementation
26 pages
E-Book Computer Networking Step by Step
No ratings yet
E-Book Computer Networking Step by Step
133 pages
Sustainability 12 04255 v2 PDF
No ratings yet
Sustainability 12 04255 v2 PDF
28 pages
Hakiri Et Al. - 2014 - Software-Defined Networking Challenges and Research Opportunities For Future Internet
No ratings yet
Hakiri Et Al. - 2014 - Software-Defined Networking Challenges and Research Opportunities For Future Internet
19 pages
06819788
No ratings yet
06819788
26 pages
Performance Issues and Solutions in SDN Based Data Center: A Survey
No ratings yet
Performance Issues and Solutions in SDN Based Data Center: A Survey
49 pages
Network Monitoring in Software-Defined Networking A Review
No ratings yet
Network Monitoring in Software-Defined Networking A Review
12 pages
Modeling For Fault Tolerance in Cloud Computing Environment: Rampratap, T
No ratings yet
Modeling For Fault Tolerance in Cloud Computing Environment: Rampratap, T
11 pages
Paper#7 - SDN Security - A Survey 2013
No ratings yet
Paper#7 - SDN Security - A Survey 2013
7 pages
Analysis of The File Transfer Protocol
No ratings yet
Analysis of The File Transfer Protocol
31 pages
Computer Communications: Francisco J. Ros, Pedro M. Ruiz
No ratings yet
Computer Communications: Francisco J. Ros, Pedro M. Ruiz
11 pages
34 Guillen
No ratings yet
34 Guillen
14 pages
Network Telemetry Streaming Services in SDN-Based Disaggregated Optical Networks
No ratings yet
Network Telemetry Streaming Services in SDN-Based Disaggregated Optical Networks
8 pages
5G Ready Multi-Failure
100% (1)
5G Ready Multi-Failure
11 pages
Resilientflow: Deployments of Distributed Control Channel Maintenance Modules To Recover SDN From Unexpected Failures
No ratings yet
Resilientflow: Deployments of Distributed Control Channel Maintenance Modules To Recover SDN From Unexpected Failures
8 pages
Important Topics For Placements - Core Electronics
100% (2)
Important Topics For Placements - Core Electronics
6 pages
Simplification of Internet Ossification Through Software Defined Network Approach
No ratings yet
Simplification of Internet Ossification Through Software Defined Network Approach
5 pages
Qos Nbar Xe 2 Book
No ratings yet
Qos Nbar Xe 2 Book
112 pages
Paper 9
No ratings yet
Paper 9
3 pages
Towards Secure SDN
No ratings yet
Towards Secure SDN
6 pages
CRW Presentation Recchia
No ratings yet
CRW Presentation Recchia
41 pages
Telecom: U. B. Desai
No ratings yet
Telecom: U. B. Desai
30 pages
3 DoS and DDoS Attacks in Software Defined Networks
No ratings yet
3 DoS and DDoS Attacks in Software Defined Networks
23 pages
Software-Defined Optical Networking SDON Principle
No ratings yet
Software-Defined Optical Networking SDON Principle
17 pages
Bagaa 2020
No ratings yet
Bagaa 2020
15 pages
Cisco IOS Firewall Design Guide
No ratings yet
Cisco IOS Firewall Design Guide
60 pages
Aginet 3.0 - 1910020942 - Aginet Config - UG - V1.0.1
No ratings yet
Aginet 3.0 - 1910020942 - Aginet Config - UG - V1.0.1
37 pages
TR-5a Series Manual
No ratings yet
TR-5a Series Manual
66 pages
SDNIntegrationwith Firewallsand Enhancing Security Monitoringon Firewalls
No ratings yet
SDNIntegrationwith Firewallsand Enhancing Security Monitoringon Firewalls
8 pages
25 Most Frequently Used Linux IPTables Rules PDF
No ratings yet
25 Most Frequently Used Linux IPTables Rules PDF
10 pages
Network Reliability and Fault Tolerance
No ratings yet
Network Reliability and Fault Tolerance
10 pages
JH He3416 Ds en v1.2 (Lootom)
No ratings yet
JH He3416 Ds en v1.2 (Lootom)
5 pages
Blackhole Attack Implementation in Aodv Routing Protocol
No ratings yet
Blackhole Attack Implementation in Aodv Routing Protocol
5 pages
A Use Case Based Analysis of Network Man
No ratings yet
A Use Case Based Analysis of Network Man
6 pages
Software-Defined Networking: State of Art and Research Challenges
No ratings yet
Software-Defined Networking: State of Art and Research Challenges
24 pages
DVB PDH SDH Adapter
No ratings yet
DVB PDH SDH Adapter
3 pages
To Eliminate The Threat of A Single Point of Failu
No ratings yet
To Eliminate The Threat of A Single Point of Failu
8 pages
Real Time Communication
No ratings yet
Real Time Communication
30 pages
ZXHN H168N Broadband Access CPE Product Flyer
No ratings yet
ZXHN H168N Broadband Access CPE Product Flyer
2 pages
BK-SIM5320 Board User Manual V1 PDF
No ratings yet
BK-SIM5320 Board User Manual V1 PDF
7 pages
Day 2 - Networking Presentation
No ratings yet
Day 2 - Networking Presentation
37 pages
IP Addressing and Subnetting For New Users
100% (1)
IP Addressing and Subnetting For New Users
23 pages
Impact of Advertisement (Electronic) On Consumer Behaviour For Head and Shoulders Shampoo
No ratings yet
Impact of Advertisement (Electronic) On Consumer Behaviour For Head and Shoulders Shampoo
3 pages
D16 AXON Air Brochure 121218
No ratings yet
D16 AXON Air Brochure 121218
2 pages
YEAR 9 CS - Data Communication
No ratings yet
YEAR 9 CS - Data Communication
11 pages
Code Division Multiple Access (CDMA) : Presented by Sudhananda Sahu Electronics and Telecommunication Branch
No ratings yet
Code Division Multiple Access (CDMA) : Presented by Sudhananda Sahu Electronics and Telecommunication Branch
18 pages
Inductionn + Chapter 1 Part 1
No ratings yet
Inductionn + Chapter 1 Part 1
22 pages
Packet Tracer
No ratings yet
Packet Tracer
2 pages
Sensors-25-00164-V2 Oi J
No ratings yet
Sensors-25-00164-V2 Oi J
21 pages
Future Trends in Fault Tolerant (Lect.10)
No ratings yet
Future Trends in Fault Tolerant (Lect.10)
3 pages
Sivam 219303066 Research Paper Reliability
No ratings yet
Sivam 219303066 Research Paper Reliability
16 pages
Sivam 219303066 Research Paper Reliability 1
No ratings yet
Sivam 219303066 Research Paper Reliability 1
16 pages
Software Defined Networking For Real Time Ethernet
No ratings yet
Software Defined Networking For Real Time Ethernet
6 pages
BMS CANOpen Manual v2.0
No ratings yet
BMS CANOpen Manual v2.0
9 pages
Sensors 23 02762
No ratings yet
Sensors 23 02762
20 pages
Load Balancing Model Based On Machine Learning and Segment Routing in SDN
No ratings yet
Load Balancing Model Based On Machine Learning and Segment Routing in SDN
4 pages
Afro P-Hotsdn13
No ratings yet
Afro P-Hotsdn13
3 pages
Computer Network Assignment
No ratings yet
Computer Network Assignment
6 pages
GC 2025 04 07
No ratings yet
GC 2025 04 07
37 pages
Souvik Chakraborty-1
No ratings yet
Souvik Chakraborty-1
9 pages
181103-Stochastic Reward Net-Based Assessment of Reliability, Availability and Operational Cost For A Software-Defined Network Infrastructure
No ratings yet
181103-Stochastic Reward Net-Based Assessment of Reliability, Availability and Operational Cost For A Software-Defined Network Infrastructure
27 pages
Sophos XG Mib - Mib
No ratings yet
Sophos XG Mib - Mib
22 pages
A Modeling Framework For Self Healing So
No ratings yet
A Modeling Framework For Self Healing So
9 pages
Electronics 13 02329 v2
No ratings yet
Electronics 13 02329 v2
14 pages
Unit Iii
No ratings yet
Unit Iii
76 pages
IEEE Conference Template 3
No ratings yet
IEEE Conference Template 3
4 pages
Study Guide 300-435 ENAUTO: Automating and Programming Cisco Enterprise Solutions Certification Exam
From Everand
Study Guide 300-435 ENAUTO: Automating and Programming Cisco Enterprise Solutions Certification Exam
Anand Vemula
No ratings yet
Z-Wave Protocols and Applications: Definitive Reference for Developers and Engineers
From Everand
Z-Wave Protocols and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DENT Network Operating System in Practice: The Complete Guide for Developers and Engineers
From Everand
DENT Network Operating System in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Computer Science Self Management: Fundamentals and Applications
From Everand
Computer Science Self Management: Fundamentals and Applications
Fouad Sabry
No ratings yet

Fault-Tolerance in The Scope of Software-Defined Networking SDN

Uploaded by

Fault-Tolerance in The Scope of Software-Defined Networking SDN

Uploaded by

Received July 20, 2019, accepted July 31, 2019, date of publication September 2, 2019, date of current version

September 13, 2019.

Fault-Tolerance in the Scope of Software-Defined

ABSTRACT Fault-tolerance is an essential aspect of network resilience. Fault-tolerance mechanisms

I. INTRODUCTION designed in such a way that it can minimize service failures in

FIGURE 1. Condensed structure of this survey.

VOLUME 7, 2019 124475

TABLE 1. Scope and technical contributions of previous and this work.

124476 VOLUME 7, 2019

FIGURE 2. Relationship: Fault, Error, and Failure.

A fault is the hypothesized cause of an error, for instance,

depicted in Fig. 2 [12].

VOLUME 7, 2019 124477

TABLE 2. Details of Fault-tolerance techniques.

IV. FAULT-TOLERANCE IN SDN B. CONTROLLER

124478 VOLUME 7, 2019

FIGURE 5. SDN high-level architecture: SDN planes and communication interfaces.

VOLUME 7, 2019 124479

124480 VOLUME 7, 2019

VOLUME 7, 2019 124481

124482 VOLUME 7, 2019

VOLUME 7, 2019 124483

TABLE 4. Selected work on SDN Controller Fault-tolerance Efforts.

124484 VOLUME 7, 2019

TABLE 5. Selected work on SDN application plane Fault-tolerance efforts.

VOLUME 7, 2019 124485

124486 VOLUME 7, 2019

VOLUME 7, 2019 124487

124488 VOLUME 7, 2019

VOLUME 7, 2019 124489

124490 VOLUME 7, 2019

You might also like