A Survey on Deep Multimodal Learning for Computer Vision Advances, Trends, Applications, And Datasets
A Survey on Deep Multimodal Learning for Computer Vision Advances, Trends, Applications, And Datasets
https://doi.org/10.1007/s00371-021-02166-7
SURVEY
Abstract
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer
vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing
universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the
multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities,
often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for
researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep
multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration
and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from
the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both
traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and
zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving
problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and
provide insights and directions for future research.
Keywords Applications · Computer vision · Datasets · Deep learning · Sensory modalities · Multimodal learning
1 Introduction
123
2940 K. Bayoudh et al.
trends in deep learning. In practice, the extraction and synthe- from raw data to achieve maximum performance on many
sis of rich information from a multidimensional data space heterogeneous datasets. Thus, it will be possible to design
require the use of an intermediate mechanism to facilitate intelligent systems that can quickly answer questions, rea-
decision making in intelligent systems. Deep learning has son, and discuss what is seen in different views in different
been used in many practices, and it has been shown that its scenarios. Classically, there are three general approaches to
performance can be greatly improved in several disciplines, multimodal data fusion: early fusion, late fusion, and hybrid
including computer vision. This line of research is part of fusion.
the rich field of deep learning, which typically deals with In addition to surveys of recent advances in deep multi-
visual information of different types and scales to perform modal learning itself, we also discussed the main methods of
complex tasks. Currently, the deep learning algorithms have multimodal fusion and reviewed the latest advanced applica-
demonstrated their potential and applicability in other active tions and multimodal datasets popular in the computer vision
areas such as natural language processing, machine transla- community.
tion, and speech recognition, performing comparably or even The remainder of this paper is organized as follows. In
better than humans. Sect. 2, we discuss the differences between similar previ-
A large number of computer vision researchers focus each ous studies and our work. Section 3 reviews recent advances
year on developing vision systems that enable machines in deep multimodal algorithms, the motivation behind them,
to mimic human behavior. For example, some intelligent and commonly used fusion techniques, with a focus on deep
machines can use computer vision technology to simulta- learning-based algorithms. In Sects. 4 and 5, we present more
neously map their behavior, detect potential obstacles, and advanced multimodal applications and benchmark datasets
track their location. By applying computer vision to multi- that are very popular in the computer vision community.
modal applications, complex operational processes can be In Sect. 6, we discuss the limitations and challenges of
automated and made more efficient. Here, the key challenge vision-based deep multimodal learning. The final section
is to extract visual attributes from one or more data streams then summarizes the whole paper and points out a roadmap
(also called modalities) with different shapes and dimensions for future research.
by learning how to fuse the extracted heterogeneous features
and project them into a common representation space, which
is referred to as deep multimodal learning in this work. 2 Comparison with previous surveys
In many cases, a set of heterogeneous cues from multiple
modalities and sensors can provide additional knowledge that In recent years, the computer vision community has paid
reflects the contextual nature of a given task. In the arena of more attention to deep learning algorithms due to their
multimodality, a given modality depends on how specific exceptional capabilities compared to traditional handcrafted
media and related features are structured within a conceptual methods. A considerable amount of work has been conducted
architecture. Such modalities may include textual, visual, and under the general topic of deep learning in a variety of appli-
auditory modalities, involving specific ways or mechanisms cation domains. In particular, these include several excellent
to encode heterogeneous information harmoniously. surveys of global deep learning models, techniques, trends,
In this study, we mainly focused on visual modalities, and applications [4,180,182], a survey of deep learning algo-
such as images as a set of discrete signals from a variety rithms in the computer vision community [179], a survey
of image sensors. The environment in which we live gener- that focuses directly on the problem of deep object detection
ally includes many modalities in which we can see objects, and its recent advances [181], and a survey of deep learn-
hear tones, feel textures, smell aromas, and so on. For exam- ing models including the generative adversarial network and
ple, the audiovisual modalities are complementary to each its related challenges and applications [19]. Nonetheless, the
other, where the acoustic and visual attributes come from applications discussed in these surveys include only a single
two different physical entities. However, combining differ- modality as a data source for data-driven learning. How-
ent modalities or data sources to improve performance is still ever, most modern machine learning applications involve
often an attractive task from one standpoint, but in practice, it more than one modality (e.g., visual and textual modalities),
makes little sense to distinguish between noise, concepts, and such as embodied question answering, vision-and-language
conflicts between data sources. Moreover, the lack of labeled navigation, etc. Therefore, it is of vital importance to learn
multimodal data in the current literature can lead to reduced more complex and cross-modal information from different
flexibility and accuracy, often requiring cooperation between sources, types, and data distributions. This is where deep
different modalities. In this paper, we reviewed recent deep multimodal learning comes into play.
multimodal learning techniques to put forward typical frame- From the early works of speech recognition to recent
works and models to advance the field. These networks show advances in language- and vision-based tasks, deep multi-
the utility of learning hierarchical representations directly modal learning technologies have demonstrated significant
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2941
progress in improving cognitive performance and interoper- related popular datasets. Moreover, most of the papers
ability of prediction models in a variety of ways. To date, deep we reviewed are recent and have been published in high-
multimodal learning has been the most important evolution in quality conferences and journals such as the visual computer,
the field of multimodal machine learning using deep learning ICCV, and CVPR. A comprehensive overview of multi-
paradigm and multimodal big data computing environments. modal technologies—their limitations, perspectives, trends,
In recent years, many pieces of research based on multi- and challenges—is also provided in this article to deepen and
modal machine learning have been proposed [37], but to the improve the understanding of the main directions for future
best of our knowledge, there is no recent work that directly progress in the field. In summary, our survey is similar to the
addresses the latest advances in deep multimodal learning closest works [35,37], which discuss recent advances in deep
particularly for the computer vision community. A thorough multimodal learning with a special focus on computer vision
review and synthesis of existing work in this domain, espe- applications. The surveys we discussed are summarized in
cially for researchers pursuing this topic, is essential for Table 1.
further progress in the field of deep learning. However, there
is still relatively little recent work directly addressing this
research area [32–37]. Since multimodal learning is not a 3 Deep multimodal learning architectures
new topic, there is considerable overlap between this work
and the surveys of [32–37], which needs to be highlighted In this section, we discuss deep multimodal learning and
and discussed. its main algorithms. To do so, we first briefly review the
Recently, the valuable works of [32,33] considered several history of deep learning and then focus on the main moti-
multimodal practices that apply only to specific multimodal vations behind this research to answer the question of how
use cases and applications, such as emotion recognition [32], to reduce heterogeneity biases across different modalities.
human activity and context recognition [33]. More specif- We then outline the perspective of multimodal representa-
ically, they highlighted the impact of multimodal feature tion and what distinguishes it from the unimodal space. We
representation and multilevel fusion on system performance next introduce recent approaches for combining modalities.
and the state-of-the-art in each of these application areas. Next, we highlight the difference between multimodal learn-
Furthermore, some cutting-edge works [34,36] have been ing and multitask learning. Finally, we discuss multimodal
proposed in recent years that address the mechanism of inte- alignment, multimodal transfer learning, and zero-shot learn-
grating and fusing multimodal representations inside deep ing in detail in Sects. 3.6, 3.7, and 3.8, respectively.
learning architectures by showing the reader the possibilities
this opens up for the artificial intelligence community. Like- 3.1 Brief history of deep learning
wise, Guo et al. [35] provided a comprehensive overview of
deep multimodal learning frameworks and models, focus- Historically, artificial neural networks date back to the 1950s
ing on one of the main challenges of multimodal learning, and the efforts of psychologists to gain a better understand-
namely multimodal representation. They summarized the ing of how the human brain works, including the work of F.
main issues, advantages, and disadvantages for each frame- Rosenblat [8]. In 1960, F. Rosenblat [8] proposed a percep-
work and typical model. Another excellent survey paper was tron as part of supervised learning algorithms that is used to
recently published by Baltrušaitis et al. [37], which reviews compute a set of activations, meaning that for a given neu-
recent developments in multimodal machine learning and ron and input vector, it performs the sum weighted by a set
expresses them in a general taxonomic way. Here, the authors of weights, adds a bias, and applies an activation function.
identified five levels of multimodal data combination: rep- An activation function (e.g., sigmoid, tanH, etc.), also called
resentation, translation, alignment, fusion, and co-learning. nonlinearity, uses the derived patterns to perform its nonlin-
It is important to note here that, unlike our survey, which ear transformation. As a deep variant of the perceptron, a
focused primarily on computer vision tasks, the study pub- multilayer perceptron, originally designed by [9] in 1986, is
lished by Baltrušaitis et al. [37] was aimed mainly at both a special class of feed-forward neural networks. Structurally,
the natural language processing and computer vision com- it is a stack of single-layer perceptrons. In other words, this
munities. In this article, we reviewed recent advances in structure gives the meaning of “deep” that a network can be
deep multimodal learning and organized them into six top- defined by its depth (i.e., the number of hidden layers). Typ-
ics: multimodal data representation, multimodal fusion (i.e., ically, a multilayer perceptron with one or two hidden layers
both traditional and deep learning-based schemes), multi- does not require much data to learn informative features due
task learning, multimodal alignment, multimodal transfer to the reduced number of parameters to be trained. A multi-
learning, and zero-shot learning. Beyond the above work, layer perceptron can be considered as a deep neural network
we focused primarily on cutting-edge applications of deep if the number of hidden layers is greater than one, as con-
multimodal learning in the field of computer vision and firmed by [10,11]. In this regard, many more advances in the
123
2942 K. Bayoudh et al.
field are likely to follow, such as the convolutional neural net- to faster content analysis and recognition of the millions of
works of LeCun et al. [21] in 1998 and the spectacular deep videos produced daily. The main reason for using multimodal
network results of Krizhevsky et al. [7] in 2012, opening the data sources is that it is possible to extract complemen-
door to many real-world domains including computer vision. tary and richer information coming from multiple sensors,
which can provide much more optimistic results than a single
3.2 Motivation input. Some monomodal learning systems have significantly
increased their robustness and accuracy, but in many use
Recently, the amount of visual data has exploded due to cases, there are shortcomings in terms of the universality
the widespread use of available low-cost sensors, leading to of different feature levels and inaccuracies due to noise and
superior performance in many computer vision tasks (see missing concepts. The success of deep multimodal learn-
Fig. 1). Such visual data can include still images, video ing techniques has been driven by many factors that have
sequences, etc., which can be used as the basis for con- led many researchers to adopt these methods to improve
structing multimodal models. Unlike the static image, the model performance. These factors include large volumes of
video stream provides a large amount of meaningful infor- widely usable multimodal datasets, more powerful comput-
mation that takes into account the spatiotemporal appearance ers with fast GPUs, and high-quality feature representation at
of successive frames, so it can be easily used and analyzed multiple scales. Here, a practical challenge for the deep learn-
for various real-world use cases, such as video synthesis ing community is to strengthen correlation and redundancy
and description [68], and facial expression recognition [123]. between modalities through typical models and powerful
The spatiotemporal concept refers to the temporal and spa- mechanisms.
tial processing of a series of video sequences with variable
duration. In multimodal learning analytics, the audio-visual- 3.3 Multimodal representation
textual features are extracted from a video sequence to learn
joint features covering the three modalities. Efficient learn- Multi-sensory perception primarily encompasses a wide
ing of large datasets at multiple levels of representation leads range of interacting modalities, including audio and video.
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2943
Fig. 1 An example of a
multimodal pipeline that
includes three different
Data acquisition and sampling
modalities Multimodal datasets
Sensors
Deep learning
models Multimodal learning Fusion algorithms
(DNNs)
For simplicity, we consider the following temporal mul- type of information from several redundant sources. Learn-
timodal problem, where both audio and video modalities ing multimodal representation from heterogeneous signals
are exploited in a video recognition task (emotion recog- poses a real challenge for the deep learning community. Typ-
nition). First, let us consider two
input streams
of different ically, inter- and intra-modal learning involves the ability to
modalities: X a = χ1n , . . . , χTn and X v = χ1m , . . . , χTm , represent an object of interest from different perspectives,
where χtn and χtm refer to the n- and m-dimensional fea- in a complementary and semantic context where multi-
ture vectors of the X a and X v modalities occurring at time modal information is fed into the network. Another crucial
t, respectively. Next, we combine the two modalities at time advantage of inter- and intra-modal interaction is the dis-
t and consider the two unimodal output distributions at dif- criminating power of the perceptual model for multisensory
ferent levels
of representations.
Given ground truth labels stimuli by exploiting the potential synergies between modal-
Z = Z 1 , . . . , Z T , we aim here to train a multimodal ities and their intrinsic representations [112]. Furthermore,
learning model M that maps both X a and X v into the same multimodal learning involves a significant improvement in
categorical set of Z . Each parameter of the input audio stream perceptual cognition, as many of our senses are involved in
χaT and video stream χvT is synchronized differently in time the process of treatment information from several modalities.
and space, where χaT ∈ Ri and χvT ∈ R j , respectively. Nevertheless, it is essential to learn how to interpret the input
Here, we can construct two separate unimodal networks from signals and summarize their multimodal nature to construct
X a and X v , denoted, respectively, by Na and Nv , where aggregate feature maps across multiple dimensions. In the
Na : X a → Y , Nv : X v → Y , and M = Na Nv . multimodality theory, obtaining contextual representation
Y denotes the predicted class label of the training samples from more than one modality has become a vital challenge,
generated by the output of the constructed networks and which has been termed in this study as the multimodal rep-
indicates the fusion operation. The generated multimodal net- resentation.
work M can then recognize the most discriminating patterns Typically, monomodal representation involves a linear or
in the streaming data by learning a common representation nonlinear mapping of an individual input stream (e.g., image,
that integrates relevant concepts from both modalities. Fig- video, or sound, etc.) into a high-level semantic representa-
ure 2 shows a schematic diagram of the application of the tion. The multimodal representation leverages the correlation
described multimodal problem to the video emotion recog- power of each monomodal sensation by aggregating their
nition task. spatial outputs. Thus, the deep learning model must be
Therefore, it is necessary to consider the extent to which adapted to accurately represent the structure and represen-
any such dynamic entity will be able to take advantage of this tation space of the source and target modality. For example,
123
2944 K. Bayoudh et al.
t
Final
Fusion
Multimodal fusion prediction
operation
Audio description
Audio
Fig. 2 A schematic illustration of the method used: The visual modality descriptions are also generated. The two modalities are then combined
(video) involves the extraction of facial regions of interest followed by using a multimodal fusion operation to predict the target class label
a visual mapping representation scheme. The obtained representations (emotion) of the test sample
are then temporally fused into a common space. Additionally, the audio
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2945
Since a long time ago, the support vector machine [40] clas-
sifier has been introduced as a learning algorithm for a wide
(a) Modality 2 Fusion Classifier range of classification tasks. Indeed, SVM is one of the most
popular linear classifiers that are based on learning a single
kernel function through the handling of linear tasks, such as
Modality 3 discrimination and regression problems. The main idea of an
SVM is to separate the feature space into two classes of data
with a hard margin. Kernel-based methods are among the
most commonly used techniques for performing fusion due
Modality 1 Classifier 1 to their proven robustness and reliability. For more details, we
invite the reader to consult the work of Gönen et al. [41] that
focused on the taxonomy of multi-kernel learning algorithms.
(b) Modality 2 Classifier 2 Fusion These kernels are intended to make use of the similarities and
discrepancies across training samples as well as a wide vari-
ety of data sources. In other words, these modular learning
Modality 3 Classifier 3 methods are used for multimodal data analysis. Recently, a
growing number of studies have focused, in particular, on
the potential of these kernels for multi-source-based learn-
ing for improving performance. In this sense, a wide range
Modality 1 Classifier 1
of kernel-based methods have been proposed to summarize
information from multiple sources using a variety of input
data. In this regard, Gönen et al. [41] pioneered multiple
kernel learning (MKL) algorithms that seek to combine mul-
Modality 2 Classifier 2 Fusion
timodal data that have distinct representations of similarity.
MKL is the process of learning a classifier through multiple
kernels and data sources. Also, it aims to extract the joint
Modality 3 Classifier 3
correlation of several kernels in a linear or nonlinear manner.
(c) Similarly, Aiolli et al. [42] proposed the MKL-based algo-
rithm, called EasyMKL, which combines a series of kernels
Modality 1
to maximize the segregation of representations and extract
the strong correlation between feature spaces to improve
the performance of the classification task. An alternative
Modality 2 Fusion Classifier
model, called convolutional recurrent multiple kernel learn-
ing (CRMKL), based on the MKL framework for emotion
recognition and sentiment analysis is reported by Wen et al.
Modality 3
[43]. In [43], the MKL algorithm is used to combine multiple
features that are extracted from deep networks.
Fig. 4 Conventional methods for multimodal data fusion: a Early
fusion, b Late fusion, c Hybrid fusion 3.4.1.3 Graphical models based
One of the most common probabilistic graphical models
(PGMs) includes the hidden Markov model (HMM) [44].
It is an unsupervised and generative model. It has a series
decision, which can reduce the overall performance of the of potential states and transition probabilities. In the Markov
integration process. In the case of intermediate fusion, the chain, the transition from one state to another leads to the
spatial combination of intermediate representations of the generation of observed sequences in which the observations
different data streams usually produced with varying scales are part of a state set. A transition formalizes how it is possi-
and dimensions, making them more challenging to merge. ble to move from one state to another and for each one there
To overcome this challenge, the authors of [124] designed is a probability distribution of being borrowed. The states
a simple fusion scheme, called multimodal transfer module are hidden, but the first state generates a visible state from a
(MMTM), to transfer and hierarchically aggregate shared given one. The main property of Markov chains is that the
knowledge from multiple modalities in CNN networks. probabilities depend only on the previous state of the model.
123
2946 K. Bayoudh et al.
In HMM, a kind of generalization of mixing densities defined Boltzmann machine (RBM) by combining it together. In
by each state is involved, as confirmed by Ghahramani et al. other words, a DBN consists of stacking a series of RBM
[45]. Specifically, Ghahramani et al. [45] introduced the fac- where the hidden layer of the first RBM is the visible layer
torial HMM (FHMM) which consists of combining the state of the higher hierarchies. Structurally, a DBN model has a
transition matrix of HMMs with the distributed representa- dense structure similar to that of a shallow multilayer percep-
tions of vector quantizer (VQ) [46]. According to [46], VQ tron. The first RBM is designed to systematically reconstruct
is a conventional technique for quantifying and generaliz- its input signal in which its hidden layer will be handled as
ing dynamic mixing models. FHMM addresses the limited the visible layer for the second one. However, all hidden rep-
representational power of the latent variables of HMM by resentations are learned globally at each level of DBN. Note
presenting the hidden state under a certain weighted appear- that DBN is one of the strongest alternatives to overcome the
ance. Likewise, Gael et al. [47] proposed the non-parametric vanishing gradient problem through a stack of RBM units.
FHMM, called iFHMM, by introducing a new stochastic pro- Like a single RBM, DBN involves discovering latent fea-
cess for latent feature representation of time series. tures in the raw data. It can be further trained in a supervised
In summary, the PGM model can be considered a robust fashion to perform the classification of the detected hidden
tool for generating missing channels by learning the most representations.
representative inter-modal features in an unsupervised man- Compared to other supervised deep models, DBN requires
ner. One of the drawbacks of the graphical model is the high only a very small set of labeled data to perform weight
cost of the training and inference process. training, which leads to a high level of usefulness in many
3.4.1.4 Canonical correlation analysis based multimodal tasks. For instance, Srivastava et al. [206] pro-
posed a multimodal generative model based on the concept
In general, a fusion scheme can construct a single mul- of deep Boltzmann machine (DBM) which learns a set of
timodal feature representation for each processing stage. multimodal features by filling in the conditional distribu-
However, it is also straightforward to place constraints on tion of data on a space of multimodal inputs such as image,
the extracted unimodal features [37]. Canonical correlation text, and audio. Specifically, the purpose of training a multi-
analysis (CCA) [201] is a very popular statistical method that modal DBN model is to improve the prediction accuracy of
attempts to maximize the semantic relationship between two both unimodal and multimodal systems by generating a set
unimodal representations so that complex nonlinear trans- of multimodal features that are semantically similar to the
formations of the two data perspectives can be effectively original input data so that they can be easily derived even
learned. Formally, it can be formulated as follows: if some modalities are missing. Figure 5 illustrates a mul-
∗ timodal DBN architecture that takes as input two different
v1 , v2∗ = argmax corr (v1T X 1, v2T X 2), (1) modalities (image and text) with different statistical distri-
v1,v2
butions to map the original data from a high-dimensional
where X 1 and X 2 stand for unimodal representations, v1 and space to a high-level abstract representation space. After
v2 for two vectors of a given length, and corr for the corre- extracting the high-level representation from each modality,
lation function. A deep variant of CCA can also be used to an RBM network is then used to learn the joint distribution.
maximize the correlation between unimodal representations, The image and text modalities are modeled using two DBMs,
as suggested by the authors of [202]. Similarly, Chandar et al. each consisting of two hidden layers. Formally, the joint rep-
[203] proposed a correlation neural network, called CorrNet, resentation can be expressed as follows:
which is based on a constrained encoder/decoder structure to
P(vi |θ ) = P(vi , h 1 , h 2 |θ ), (2)
maximize the correlation of internal representations when
h 1 ,h 2
projected onto a common subspace. Engilberge et al. [204]
introduced a weaker constraint on the joint embedding space where vi refers to the input visual and textual modalities, θ
using a cosine similarity measure. Besides, Shahroudy et al. to the network parameters, and h to the hidden layer of each
[205] constructed a unimodal representation using a hierar- modality.
chical factorization scheme that is limited to representing In a multimodal context, the advantage of using multi-
redundant feature parts and other completely orthogonal modal DBN models lies in their sensitivity and stability in
parts. both supervised, semi-supervised and unsupervised learn-
ing protocols. These models allow for better modeling of
3.4.2 Deep learning methods very complex and discriminating patterns from multiple input
modalities. Despite these advantages, these models have a
3.4.2.1 Deep belief networks based few limitations. For instance, they largely ignore the spa-
Deep belief network (DBN) is part of the graphical generative tiotemporal cues of multimodal data streams, making the
deep model [15]. They form a deeper variant of the restricted inference process computationally intensive.
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2947
(2)
(2) h Shared representation
h
(1)
(1) h
h
123
2948 K. Bayoudh et al.
fully connected layer (also called dense layer) is used that In summary, a multimodal CNN serves as a powerful fea-
concatenates all previous activation maps. ture extractor that learns local cross-modal features from
Since its introduction by Krizhevsky et al. [7] in 2012, the visual modalities. It is also capable of modeling spatial cues
CNN model has been successfully applied to a wide range of from multimodal data streams with an increased number of
multimodal applications, such as image dehazing [239,240] parameters. However, it requires a large-scale multimodal
and human activity recognition [241]. An adaptive multi- dataset to converge optimally during training, and the infer-
modal mapping between two visual modalities (e.g., images ence process is time-consuming.
and sentences) typically requires strong representations of
3.4.2.4 Recurrent neural networks based
the individual modalities [213]. In particular, CNNs have
demonstrated powerful generalization capabilities to learn Recurrent neural networks (RNNs) [12] are a popular type of
how to represent visual appearance features from static data. deep neural network architectures for processing sequential
Recently, with the advent of robust and low-cost RGB-D data of varying lengths. They learn to map input activa-
sensors such as the Kinect, the computer vision community tions to the next hierarchy level and then transfer hidden
has turned its attention to integrating RGB images and corre- states to the outputs using the recurrent feedback, which
sponding depth maps (2.5D) into multimodal architectures as gives them the capacity to learn useful features from the pre-
shown in Fig. 7. For instance, Couprie et al. [214] proposed vious states, unlike other deep feedforward networks such
a bimodal CNN architecture for multiscale feature extrac- as CNNs, DBNs, etc. It also can handle time series and
tion from RGB-D datasets, which are taken as four-channel dynamic media such as text and video sequences. By using
frames (blue, green, red, and depth). Similarly, Madhuranga the backpropagation algorithm, the RNN function takes an
et al. [215] used CNN models for video recognition purposes input vector and a previous hidden state as input to capture
by extracting silhouettes from depth sequences and then fus- the temporal dependence between objects. After training, the
ing the depth information with audio descriptions for activity RNN function is fixed at a certain level of stability and can
of daily living (ADL) recognition. Zhang et al. [217] pro- then be used over time.
posed to use multicolumn CNNs to extract visual features However, the vanilla RNN model is typically incapable
from the face and eye images for the gaze point estimation of capturing long-term dependencies in sequential data since
problem. Here, the regression depth of the facial landmarks they have no internal memory. To this end, several popular
is estimated from the facial images and the relative depth of variants have been developed to efficiently handle this con-
facial keypoints is predicted by global optimization. To per- straint and the gradient vanishing problem with impressive
form image classification directly, the authors of [217,218] results, including long short-term memory (LSTM) [13] and
suggested the possibility of using multi-stream CNNs (i.e., gated recurrent linear units (GRU) [14]. In terms of compu-
two or more stream CNNs) to extract robust features from tational efficiency, GRU is a lightweight variant of LSTM
a final hidden layer and then project them onto a common since it can modulate the information flow without using its
representation space. However, the most commonly adopted internal memory units.
approaches involve concatenating a set of pre-trained fea- In addition to their use for unimodal tasks, RNNs have
tures derived from the huge ImageNet dataset to generate a proved useful in many multimodal problems that require
multimodal representation [216]. modeling long- and short-range dependencies across the
j
Formally, let f i be the feature map of j modalities and i input sequence, such as semantic segmentation [219] and
be the current spatial location, where j = {1, 2, . . . , N }. As image captioning [220]. For instance, Abdulnabi et al.
shown in Fig. 7, in our case N = 2, since the feature maps [219] proposed a multimodal RNN architecture designed for
FC2 (RGB) and FC2 (D) were taken separately from the semantic scene segmentation using RGB and depth chan-
RGB and depth paths. The fused feature map Fifusion , which nels. They integrated two parallel RNNs to efficiently extract
is a weighted sum of the unimodal representations, can be robust cross-modal features from each modality. Zhao et al.
calculated as follows: [220] proposed an RNN-based multimodal fusion scheme
to generate captions by analyzing distributional correla-
N
j j tions between images and sentences. Recently, several new
Fifusion = wi f i . (4)
multimodal approaches based on RNN variants have been
j=1
proposed and have achieved outstanding results in many
j
vision applications. For example, Li et al. [221] designed
Here, wi denotes the weight vectors that can be computed a GRU-based embedding framework to describe the content
as follows: of an image. They used GRU to generate a description of vari-
j
able length from a given image. Similarly, Sano et al. [222]
j exp( f i ) proposed a multimodal BiLSTM for ambulatory sleep detec-
wi = N
. (5)
k=1 exp( f i )
k
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2949
Conv1 Conv1
Conv2 Conv2
Conv3 Conv3
Conv5 Conv5
Fusion
Prediction
Bidrectional recurrent
Softmax Next
Input word Embedding 1 Embedding 2 Multimodal word
tion. In this case, BiLSTM was used to extract features from where f (.) denotes the activation function, w and r consist
the wearable device and synthesize temporal information. of the word embedding feature and the hidden states in both
Figure 8 illustrates a multimodal m-RNN architecture directions of the recurrent layer and I represent the visual
that incorporates both word embeddings and visual features features.
using a bidirectional recurrent mechanism and a pre-trained In summary, the multimodal RNN model is a robust tool
CNN. As can be seen, m-RNN consists of three components: for analyzing both short- and long-term dependencies of mul-
a language network component, a vision network compo- timodal data sequences using the backpropagation algorithm.
nent, and a multimodal layer component. The multimodal However, the model has a slow convergence rate due to the
layer here maps semantic information across sub-networks high computational cost in the hidden state transfer function.
by temporally learning word embeddings and visual features.
3.4.2.5 Generative adversarial networks based
Formally, it can be expressed as follows:
Generative adversarial networks (GANs) are part of deep
generative architectures, designed to learn the data distri-
bution through the adversarial learning. Historically, they
m(t) = f (vw .w(t), vr .r (t), v I .I ), (6) were first developed by Goodfellow et al. [16], which
123
2950 K. Bayoudh et al.
ably impractical representations from noisy data domains. Modality n Discriminators Prediction
Structurally, GAN is a unified network consisting of two .
sub-networks, a generator network (G) and a discriminator .
Positive data
network (D), which interact continuously during the learn- Modality 2
ing process. The principle of its operation is as follows: The
generator network takes as input the latent distribution space Modality 1
(i.e., a random noise (z)) and generates an artificial sample.
The discriminator takes the true sample and those generated
by the generator and tries to predict whether the input sam- Negative data
Generators
ple is true (false) or not. Hence, it is a binary classification
problem, where the output must be between 0 (generated)
and 1 (true). In other words, the generator’s main task is to Random noise
generate a realistic image, while the discriminator’s task is to
determine whether the generated image is true or false. Sub-
Fig. 9 A schematic illustration of multimodal GAN
sequently, they should use an objective function to represent
the distance between the distribution of generated samples
( pz ) and the distribution of real ones ( pdata ). The adversar-
ial training strategy consists of using a minimax objective
function V (G, D), which can be expressed as follows: In summary, unsupervised GAN is one of the most pow-
erful generative models that can address scenarios where
training data is lacking or some hidden concepts are missing.
min max V (G, D) = Ex∼ pdata |log(D(x))| However, it is extremely tricky to train the network when
G D
+Ez∼ pz |log(1 − D(x))| (7) generating discrete distributions, and the process itself is
unstable compared to other generative networks. Moreover,
the function that this network seeks to optimize is an adver-
Since their development in 2014, generative adversar- sarial loss function without any normalization.
ial training algorithms have been widely used in various
3.4.2.6 Attention mechanism based
unimodal applications such as scene generation [17], image-
to-image translation [18], and image super-resolution [224, In recent years, the attention mechanism (AM) has become
225]. To obtain the latest advances in super-resolution algo- one of the most challenging tasks in computer vision and
rithms for a variety of remote sensing applications, we invite machine translation [232]. The idea of AM is to focus on
the reader to refer to the excellent survey article by Rohith a particular position in the input image by computing the
et al. [226]. weighted sum of feature vectors and mapping them into a
In addition to its use in unimodal applications, the gener- final contextual representation. In other words, it learns how
ative adversarial learning paradigm has recently been widely to reduce some irrelevant attributes from a set of feature vec-
adopted in multimodal arenas, where two or more modali- tors. In the multimodal analysis, an attentional model can
ties are involved, such as image captioning [227] and image be designed to combine multiple modalities, each with its
retrieval [228]. In recent years, GAN-based schemes have internal representation (e.g., spatial features, motion features,
been receiving a lot of attention and interest in the field of etc.). That is, when a set of features is derived from spatiotem-
multimodal vision. For example, Xu et al. [229] proposed poral cues, these variable-length vectors are semantically
a fine-grained text-image generation framework using an combined into a single fixed-length vector. Furthermore, an
attentional GAN model to create high-quality images from AM can be integrated into RNN models to improve the gen-
text. Similarly, Huang et al. [230] proposed an unsupervised eralization capability of the former by capturing the most
image-to-image translation architecture that is based on the representative and discriminating patterns from heteroge-
idea that the image style of one domain can be mapped neous datasets. A formalism for integrating an AM into the
into the styles of many domains. In [231], Toriya et al. basic RNN model was developed by Bahdanau et al. [1].
addressed the task of image alignment between a pair of mul- Since the encoding side of an RNN generates a fixed-length
timodal images by mapping the appearance features of the feature vector from its input sequence, this can lead to very
first modality to the other using GAN models. Here, GANs tedious and time-consuming parameter tuning. Therefore, the
were used as a means to apply keypoint-mapping techniques AM acts as a contextual bridge between the encoding and
to multimodal images. Figure 9 shows a simplified diagram decoding sides of an RNN to pay attention only to a partic-
of a multimodal GAN. ular position in the input representation.
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2951
Tx Source sequence
ci = σi j h j , (10)
j=1
Fig. 10 A schematic illustration of the attention-based machine trans-
lation model
where the alignment weight σi j of each annotation h j can be
calculated as: Furthermore, the number of parameters to be trained is huge
exp(ei j ) compared to other deep networks such as RNNs, CNNs, etc.
σi j = Tx
, (11)
k=1 exp(eik ) 3.5 Multitask learning
and ei j = a(si−1 , h j ). si−1 is the hidden state at the (i −1)-th
position of the input sequence. More recently, multitask learning (MTL) [108,109] has
Since its introduction, the AM has gained wide adop- become an increasingly popular topic in the deep learn-
tion in the computer vision community due to its spectral ing community. Specifically, the MTL paradigm frequently
capabilities for many multimodal applications such as video arises in a context close to multimodal concepts. In contrast
description [233,234], salient object detection [235], etc. For to single-task learning, the idea behind this paradigm is to
example, Hori et al. [233] proposed a multimodal atten- learn a shared representation that can be used to respond
tion framework for video captioning and sentence generation to several tasks in order to ensure better generalizability.
based on the encoder–decoder structure using RNNs. In par- Although, there are some similarities between the fusion
ticular, the multimodal attention model was used as a way methods discussed in Sect. 3.4 and the methods used to per-
to integrate audio, image, and motion features by select- form multi-tasks simultaneously. What they have in common
ing the most relevant context vector from each modality. is that the sharing of the structure between all tasks can be
In [236], Yang et al. suggested the use of stacked attention learned jointly to improve performance. The conventional
networks to search for image regions that correlate with a typology of the MTL approach consists of two subtasks:
query answer and identify representative features of a given
question more precisely. More recently, Guo et al. [237] – Hard parameter sharing [110]: It consists of extracting a
introduced a normalized variant of the self-attention mecha- generic representation for different tasks using the same
nism, called normalised self-attention (NSA), which aims to parameters. It is usually applied to avoid overfitting prob-
encode and decode the image and caption features and nor- lems.
malize the distribution of internal activations during training. – Soft parameter sharing [111]: It consists of extracting a
In summary, the multimodal AM provides a robust solu- set of feature vectors and simultaneously drawing simi-
tion for cross-modal data fusion by selecting the local larity relationships between them.
fine-grained salient features in a multidimensional space and
filtering out any hidden noise. However, the only weakness Figure 11 shows a meta-architecture for the two-task case.
of AM is that the training algorithm is unstable, which may As can be seen, there are six intermediate layers in total, one
affect the predictive power of the decision-making system. shared input layer (bottom), two task-specific output layers
123
2952 K. Bayoudh et al.
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2953
Source domain
Domain 2 / Modality 2
Fusion
Transfer learning
Fine-tuned model 2
Transfer learning
synthesize and reconstruct the visual features of the unseen 4.1 Generic computer vision tasks
classes, resulting in high accuracy classification and ensuring
a balance between seen and unseen class labels [136,137]. 4.1.1 Object detection
123
2954 K. Bayoudh et al.
deep learning community has pioneered a new generation RGB data can improve the performance of one-stage based
of CNN-based frameworks. Recent literature has focused on detection systems.
this challenging task: In [67], Jiao et al. studied a variety of
4.1.1.2 Two-stage detectors
deep object detectors, ranging from one-stage detectors to
two-stage detectors. Monomodal based The R-CNN detector [74] employs the
patch proposal procedure using the selective search [80] strat-
4.1.1.1 One-stage detectors
egy and applies the SVM classifier to classify any potential
Monomodal based The overfeat architecture [24] consists proposals. Fast R-CNN was introduced in [75] to improve
of several processing steps, each of which is dedicated to the detection efficiency of R-CNN. The principle of Fast R-
the extraction of multi-scale feature maps by applying the CNN is as follows: it first feeds the input image into the CNN
dense SW method to efficiently perform the object detection network, extracts a set of feature vectors, applies a patch
task. To significantly increase the processing speed of object proposal mechanism, generates potential candidate regions
detection pipelines, Redmon et al. [25] implemented a one- using the RoI pooling layer, reshapes them to a fixed size,
stage lightweight detection strategy called YOLO (You Only and then performs the final object detection prediction. As
Look Once). This approach treats the object detection task as an efficient extension of fast R-CNN, Faster R-CNN [76]
a regression problem, analyzing the entire input image and serves to use a deep CNN as a proposal generator. It has an
simultaneously predicting the bounding box coordinates and internal strategy for proposing patches called region proposal
associated class labels. However, in some vision applications, network (RPN). Simultaneously, RPN carries out classifica-
such as autonomous driving, security, video surveillance, tion and localization regression to generate a set of RoIs. The
etc., real-time conditions become necessary. In this respect, primary objective is to improve the localization task and the
two-stage detectors are generally slow in terms of real-time overall performance of the decision system. In other words,
processing. In contrast, SSD (single-shot multibox detector) the first network uses prior information about being an object,
[78] has reduced the needs of the patches’ proposal net- and the second one (at the end of the classifier) that deals with
work and, thus, accelerated the object detection pipeline. this information for each class. The feature pyramid network
It can learn multi-scale feature representation from multi- (FPN) detector [77] consists of a pyramidal structure that
resolution images. Its capability to detect objects at different allows the learning of hierarchical feature maps extracted
scales enables it to enhance the robustness of the entire chain. at each level of representation. According to [77], learning
Like most object detectors, the SSD detector consists of two multi-scale representations is very slow and requires a lot
processing stages: extracting the feature map through the of memory. However, FPN can generate pyramidal repre-
VGG16 model and detecting the object by applying a con- sentations with a higher semantic resolution than traditional
volutional filter through the Conv4 − 3 layer. As similar to pyramidal designs.
the principle of YOLO and SSD detectors, RetinaNet [79]
takes only one stage to detect dense objects by producing
multi-scale semantic feature maps using a feature pyramid Multimodal based As mentioned before, two-stage detec-
network (FPN) backbone and the ResNet model. To deal with tors are generally based on a combination of a CNN model
the class imbalance in the training phase, a novel loss func- to perform classification and a patch proposal module to
tion called “focal loss” is considered by [79]. This function generate candidate regions like RPNs. These techniques
allows training a one-stage detector with high accuracy by have proven effective for the accurate detection of multiple
reducing the level of artifacts. objects under normal and extreme environmental conditions.
However, multi-object detection in both indoor and out-
Multimodal based High-precision object recognition sys- door environments under varying environmental and lighting
tems with multiple sensors are aware of external noise and conditions remains one of the major challenges facing the
environmental sensitivity (e.g., lighting variations, occlu- computer vision community. Furthermore, a better trade-off
sion, etc.). More recently, the availability of low-cost and between accuracy and computational efficiency in two-stage
robust sensors (e.g., RGB-D sensors, stereo, etc.) has encour- object detection remains an open question [84]. The question
aged the computer vision community to focus on combining may be addressed more effectively by combining two or more
the RGB modality with other sensing modalities. According sensory modalities simultaneously. However, the most com-
to experimental results, it has been shown that the use of mon approach is to concatenate heterogeneous features from
depth information [183,184], optical flow information [185], different modalities to generate an artificial multimodal rep-
and LiDAR point clouds [186] in addition to conventional resentation. The recent literature has shown that it is attractive
to learn shared representations from the complementarity and
synergies between several modalities for increasing the dis-
criminatory power of models [190]. Such modalities may
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2955
include visual RGB-D [187], audio-visual data [188], visi- RGB and thermal data [196], RGB and infrared data [197],
ble and thermal data [189], etc. etc.
4.1.1.3 Multi-stage detectors
4.1.3 Semantic segmentation
Monomodal based Cascade R-CNN [26] is one of the most
effective multi-stage detectors that have proven their robust- In image processing, image segmentation is a process of
ness over one and two-stage methods. It is a cascaded version grouping pixels of the image together according to partic-
of R-CNN aimed at achieving a better compromise between ular criteria. Semantic segmentation consists of assigning a
object localization and classification. This framework has class label to each pixel of a segmented region. Several stud-
proven its capability in overcoming some of the main chal- ies have provided an overview of the different techniques
lenges of object detection, including overtraining problems used for semantic segmentation of visual data, including the
[5,6] and false alarm distribution caused by the patches’ works of [27,28]. Scene segmentation is a subtask of seman-
proposal stage. In other words, the trained model may be tic segmentation that enables intelligent systems to perceive
over-specialized on the training data and can no longer gener- and interact in their surrounding environment [27,66]. The
alize on the test data. The problem can be solved by stopping image can be split into non-overlapping regions according
the learning process before reaching a poor convergence rate, to particular criteria, such as pixel and edge detection and
increasing the data distribution in various ways, etc. points of interest. Some algorithms are then used to define
Multimodal based More recently, only a few multimodal- inter-class correlations for these regions.
based multi-stage detection frameworks [191–193] have
been developed and have achieved outstanding detection per- Monomodal based Over the last few years, the fully convo-
formance on benchmark datasets. lutional network (FCN) [29] has become one of the robust
models for a wide range of image types (multimedia, aerial,
medical, etc.). The network consists of replacing the final
dense layers with convolution layers, hence the reason for
4.1.2 Visual tracking its name “FCN”. However, the convolutional side (i.e., the
feature extraction side) of the FCN generates low-resolution
For decades, visual tracking has been one of the major chal- representations which lead to fairly fuzzy object boundaries
lenges for the computer vision community. The objective and noisy segmentations. Consequently, this requires the
is to observe the motion of a given object in real time. A use of a posteriori regularizations to smooth out the seg-
tracker can predict the trajectory of a given rigid object from mentation results, such as conditional random field (CRF)
a chronologically ordered sequence of frames. The task has networks [69]. As a light variant of semantic segmentation,
attracted a lot of interest because of its enormous relevance instance segmentation yields a semantic mask for each object
in many real-world applications, including video surveil- instance in the image. For this purpose, some methods have
lance [82], autonomous driving [83], etc. Over the last few been developed, including Mask-RCNN [30], Hybrid Task
decades, most deep learning-based object tracking systems Cascade (HTC) [31], etc. For instance, the Mask R-CNN
have been based on CNN architectures [84,139]. For exam- model offers the possibility of locating instances of objects
ple, in 1995, Nowlan et al. [85] implemented the first tracking with class labels and segmenting them with semantic masks.
system that tracks hand gestures in a sequence of frames Scene parsing is a visual recognition process that is based on
using a CNN model. Multi-object tracking (MOT) has been semantic segmentation and deep architectures. A scene can
extensively explored in recent literature for a wide range of be parsed into a series of regions labeled for each pixel that
applications [86,138]. Indeed, MOT (tracking-by-detection) is mapped to semantic classes. The task is highly useful in
is another aspect of the generic object tracking task. However, several real-time applications, such as self-driving cars, traf-
MOT methods are mainly designed to optimize the dynamic fic scene analysis, etc. However, fine-grained visual labeling
matching of the objects of interest detected in each frame. To and multi-scale feature distortions pose the main challenges
date, the majority of the existing tracking algorithms have in scene parsing.
yet to be adapted to various factors, such as illumination and
scale variation, occlusions, etc [178]. Multimodal MOT is Multimodal based More recently, it has been shown in the
a universal aspect of MOT aimed at ensuring the accuracy literature that the accuracy of scene parsing can be improved
of autonomous systems by mapping the motion sequence of by combining several detection modalities instead of a single
dynamic objects [194]. To date, several multimodal variants one [91]. Many different methods are available, such as soft
of MOT have been proposed to improve the speed and accu- correspondences [94], 3D scene analysis from RGB-D data
racy of visual tracking by using multiple data sources, e.g., [95], to ensure dense and accurate scene parsing of indoor
thermal, shortwave infrared, and hyperspectral data [195], and outdoor environments.
123
2956 K. Bayoudh et al.
4.2 Multimodal applications different modalities (face and iris) to establish an individual’s
identity.
4.2.1 Human recognition
4.2.3 Image retrieval
In recent years, a wide range of deep learning techniques has
been developed that focus on human recognition in videos.
Content-based image research (CBIR), commonly known as
Human recognition seeks to identify the same target at dif-
query by image content (QBIC) and content-based visual
ferent points in space-time derived from complex scenes.
information retrieval (CBVIR) [54], is the process of recover-
Some studies have attempted to enhance the quality of per-
ing visual content (e.g., colors, edges, textures, etc) stored in
son recognition from two data sources (audio-visual data)
datasets by learning their visual representations. The retrieval
using DBN and DBM [72] models, which have allowed
procedure leads to the generation of metadata (i.e., key-
several types of representation to be combined and coordi-
words, tags, labels, and so on). The CBIR mechanism can
nated. Some of these works include [48], [73]. According
be simulated in two fundamental phases: the offline database
to Salakhutdinov et al. [72], a DBM is a generative model
indexing phase and the online retrieval step. During the
that includes several layers of hidden variables. In [48], the
indexing stage, image signatures will be generated and stored
structure of deep multimodal Boltzmann machines (DMBM)
in a database. In the retrieval phase, the image to be recovered
[71] is similar to that of DBM, but it can admit more than
will be treated as a query and the matching process will recon-
one modality. Therefore, each modality will be covered
cile this image signature with that stored in the database. Over
individually using adaptive approaches. After joining the
the last few years, several cross-modal image retrieval tasks,
multi-domain features, the high-level classification will be
e.g., text-to-image retrieval [100], sketch-to-image retrieval
performed by an individual classifier. In [73], Koo et al. devel-
[101], cross-view image retrieval [102], composing text and
oped a multimodal human recognition framework based on
image-to-image [103], etc. have been covered in the litera-
face and body information extracted from deep CNNs. They
ture.
employed the late fusion policy to merge the high-level fea-
tures across the different modalities.
4.2.4 Gesture recognition
4.2.2 Face recognition Gesture recognition is one of the most sophisticated tasks
of computer vision. The task has already gained the atten-
Face recognition has long been extremely important, rang- tion of the deep learning community for many reasons. In
ing from conventional approaches that involve the extraction particular, its potential is to facilitate human–computer inter-
and selection of handcrafted features, such as Viola and Jones action and detect motion in real time. As gestures become
detectors [49] to the automatic extraction and training of end- more diversified and enriched, our instinctive intelligence
to-end hierarchical features from raw data. This process has will recognize basic actions and associate them with generic
been widely used in biometric systems for control and mon- behaviors. The challenge of action recognition is mainly
itoring purposes. The most biometric systems rely on three related to the difficulty of extracting body silhouettes from
modes of operation: enrolment, authentication (verification), foreground rigid objects to focus on their emotions [96].
and identification [92]. However, most facial recognition sys- Occlusions that occur between different object parts can lead
tems, including biometric systems, suffer from a restriction to a significant decrease in performance. However, various
in terms of universality and variations in the appearance factors, such as variations in speed, scale, noise, and object
of visual patterns. End-to-end training of multimodal facial position, can significantly affect the recognition process.
representations can effectively help to overcome this limi- Some real-world applications of gesture recognition include
tation. Multimodal facial recognition systems can integrate driver assistance, smart surveillance, human–machine inter-
complex representations derived from multiple modalities action, etc. Regarding the multimodal dimensions of gesture
at different scales and levels (e.g., feature level, decision recognition, the authors of [97] proposed a multi-stream
level, score level, rank level, etc.). Note that face detec- architecture based on the RNN (LSTM) model to capture
tion, face identification, and face reconstruction are subtasks spatial-temporal features from gesture data. In [98], the
of face recognition [50]. Numerous works in the literature authors developed a multimodal gesture recognition system
have demonstrated the benefits of multimodal recognition using the 3D Residual CNN (ResC3D) model [99] trained
systems. In [51], Ding et al. proposed a new late fusion pol- on an RGB-D dataset. The features extracted by the ResC3D
icy using CNNs for multimodal facial feature extraction and model are then combined with a canonical correlation scheme
SAEs for dimensional reduction. The authors of [93] intro- to ensure consistency in the fusion process. Likewise, Abav-
duced a biometric system that combines biometric traits from isani et al. [200] developed a fusion approach to derive
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2957
knowledge from multiple modalities in individual unimodal respond to a given question. To this end, the agent must first
3D CNN networks. explore its environment, capture visual information, and then
answer the question posed. In [90], the authors proposed the
4.2.5 Image captioning multi-target embodied question answering (MT-EQA) task
as a generalization of EQA. In contrast to EQA, MT-EQA
Recently, image captioning has become an active research considered some questions related to multiple targets, where
topic in the field of multimodal vision, i.e., the automatic an agent has to navigate toward various locations to answer
generation of text captions to describe the content of images. a question asked (Fig. 13a).
In a supervised learning way, training of model parameters
is provided by a set of labeled learning examples in the 4.2.8 Video question answering
form of an image and its related captions. The task has also
been demonstrated its ability for application in a variety of Currently, video question answering (VQA) [125–127,129,
real-world systems, including social media recommendation, 143] is one of the promising lines of research for reason-
image indexing, image annotation, etc. Most recently, Biten ing the correct answer to a particular question, based on the
et al. [52] combined both visual and textual data to gener- spatiotemporal visual content of video sequences. To answer
ate captions across two stages: template caption generation that question, we need to consider the correlation between
stage and entity insertion stage. Similarly, Peri et al. [53] features in the spatial and temporal dimensions (Fig. 13b).
proposed a multimodal framework that encodes both images The VQA task can be conceptually divided into three sub-
and captions using CNN and RNN as an intermediate level tasks. The first task is to identify the endpoints of the problem
representation and then decodes these multimodal represen- in the natural domain, while the second task is to capture the
tations into a new caption that is similar to the input. The correlation of the problem in the spatial domain. The third
authors of [128] presented an unsupervised image caption- task consists of reasoning about how this correlation varies in
ing framework based on a new alignment method that allows space over time. Typically, video sequences contain audio-
the simultaneous integration of visual and textual streams visual information of substantially different structures and
through semantic learning of multimodal embeddings of visual appearance, which requires reasoning schemes that
the language and vision domains. Moreover, a multimodal take into account the spatiotemporal nature of the data. To this
model can also aggregate motion information [174], acous- end, increased attention has been paid to these challenges by
tic information [175], temporal information [176], etc. from developing a wide range of spatiotemporal reasoning mech-
successive frames to assign a caption for each one. We invite anisms. Currently, the most common existing methods use
the reader to read the survey of Liu et al. [177] to learn more attention [125,127,129] and memory [126] mechanisms to
about the methods, techniques, and challenges of image cap- efficiently learn visual artifacts and the semantic correla-
tioning. tions that allow questions to be answered accurately. These
techniques are more effective for spatial-temporal video rep-
4.2.6 Vision-and-language navigation resentation and reasoning as they increase the memorization
and discrimination capacity of models.
Visual-and-language navigation (VLN) [87,88,118–121] is
a multimodal task that has become increasingly popular in 4.2.9 Style transfer
recent years. The idea behind VLN is to combine several
active domains (i.e., natural language, vision, and action) Neural style transfer (NST), also known as style transfer, has
to enable robots (intelligent agents) to navigate easily in recently gained momentum following the publication of the
unstructured environments. A key innovation in this area is works of Gatys et al. [156]. Gatys et al. [156] demonstrated
the synthesis of heterogeneous data into multiple modali- that visual features of models could be combined to represent
ties using natural language commands to navigate through image styles. It arises in a context of strong growth in DNNs
crowded locations and visual cues to perceive the surround- for several applications, including art and painting [157,158].
ings. It seeks to establish an interaction between visual For example, Lian et al. [157] proposed a style transfer-based
patterns and natural language concepts by merging these method that takes any natural portrait of a human and trans-
modalities into a single representation. forms it into Picasso’s cubism style. Informally, style transfer
is an optimization-based technique that renders the content
4.2.7 Embodied question answering of an existing image (content image) in the style of another
image (style image). Figure 14 depicts an example of trans-
Embodied question answering (EQA) [89,90,122] is an ferring the style of a specific painting to a scene image using
emerging multimodal task in which an intelligent agent acts the DeepArts tool [162]. In practice, style transfer involves
intelligently in a three-dimensional environment in order to applying a particular artistic style to a content image. For
123
2958 K. Bayoudh et al.
Fig. 13 Difference in results between EQA and VQA tasks: a EQA [90], b VQA [129]
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2959
123
2960 K. Bayoudh et al.
fusion of optical flow and depth information with RGB yields combines a wide range of data types such as RGB stereo
the best performance [242,243]. A selection of RGB-D and rendering, optical flow maps, and so on.
RGB flow datasets and their detailed information is given in – MPI-Sintel: The dataset consists of 1040 annotated opti-
Table 3, so that researchers can easily choose the right dataset cal flow and corresponding RGB images from very long
for their needs. Table 3 shows the typical computer vision sequences.
tasks, such as object recognition and semantic segmentation,
along with their respective benchmark datasets.
All datasets listed in Table 3 will be detailed in the fol-
lowing paragraphs: 6 Discussion, limitations, and challenges
Over the last few decades, the deep learning paradigm has
– RGB-D Object: According to the original paper [55], proven its ability to outperform human expertise in many
the larger-scale RGB-D object dataset consists of RGB practices. Deep learning algorithms involve a sequence of
videos and depth sequences of 300 object instances in multiple layers of nonlinear processing units that are used
51 categories from multiple view angles for a total of to extract and transform feature vectors coming from raw
250,000 images. data. Up to now, the deep learning community is still seek-
– BigBIRD: The dataset was originally introduced by [56]. ing a better trade-off between complex model structuring,
It contains 125 objects, 600 RGB-D point clouds, and 600 computational power requirements, and real-time processing
12 megapixel images taken by two sensors: Kinect and capability. Among its assets, computer vision seeks to give
DSLR cameras. machines the visual capabilities of human beings thanks to
– A large dataset of object scans: It includes more than deep learning algorithms that are fed with information from a
10,000 scanned and reconstructed objects in nine cate- wide range of sensors. In recent years, the trend toward its use
gories acquired by PrimeSense Carmine cameras. in a fairly wide range of applications has become increasingly
– RGB-D Semantic Segmentation: The dataset has origi- evident. Therefore, it is necessary to develop applications that
nally been proposed in [58], it was acquired by the Kinect can automatically predict the target information. However,
RGB-D sensor. It contains six categories such as juice most current scene-content analysis methods are still limited
bottles, coffee cans and boxes of salt, etc. On the one in their ability to deal with information that is not usable in
hand, the training set contains three 3D models for each real-life contexts. But this field is very interesting for the sci-
category. On the other hand, the testing set includes 16 entific and industrial communities. This aspect of uncertainty
objects scenes. underlines the need to propose innovative and practical meth-
– RGB-D Scenes v.1: The dataset contains eight scenes in ods under very similar conditions to those used in practice.
which each scene corresponds to a single video sequence In general, capturing multimodal data streams under differ-
of several RGB-D images. ent acquisition conditions and increasing the data volume
– RGBD Scenes v.2: The dataset contains 14 scenes makes it easier to recognize visual content. Deep learning
of video sequences including furniture that have been models are often robust strategies for dealing with the lin-
acquired by the Kinect device. ear and nonlinear combination of multimodal data. Despite
– NYU: There are two versions of the dataset (NYU-v1 and the impressive results of deep multimodal learning, no abso-
NYU-v2) that were recorded by the Kinect sensor. On the lute conclusions can be drawn in this regard. Considering
one hand, NYU-v1 dataset contains 64 different indoor this exponential growth, the main challenges of multimodal
scenes and 108617 unlabelled images. On the other hand, learning methods are the following:
NYU-v2 Dataset includes 464 different indoor scenes and
407024 unlabeled images. – Dimensionality and data conflict: Confusion between
– RGB-D People: This dataset was initially introduced by various data sources is a challenge for future analysis.
[60], it consists of more than 3000 RGB-D images cap- The multimodal data is usually available in various for-
tured from Kinect sensors. mats. This variation makes it difficult to extract valuable
– SceneNet RGB-D: This dataset contains 5M RGB-D information from the data. However, multimodal infor-
images extracted from a total of 16895 configurations. mation generally has a large dimension. In other words,
– Kinetics-400: It consists of a massive dataset of YouTube acquiring and processing a large amount of multimodal
video URLs that includes a diverse set of human actions. data is costly in terms of computation complexity and
The dataset includes more than 300,000 video sequences memory consumption. Moreover, the synchronization of
across 400 classes of human action. temporal data allows maximizing the correlation between
– Scene Flow: The dataset includes over 39,000 high- the features of several levels of representation. However,
resolution frames from synthetic video sequences. It feature-level fusion is more flexible than decision-level
123
Table 2 Summary of the multimodal applications reviewed, their related technical details, and best results achieved
References Year Application Sensing modality/data sources Fusion scheme Dataset/best results
[73] 2018 Person recognition Face and body information Late fusion (Score-level DFB-DB1 (EER = 1.52%)
fusion)
ChokePoint (EER = 0.58%)
[51] 2015 Face recognition Holistic face + Rendered frontal pose data Late fusion LFW (ACC = 98.43%)
CASIA-WebFace (ACC = 99.02%)
[93] 2020 Face recognition Biometric traits (face and iris) Feature concatenation CASIA-ORL (ACC = 99.16%)
CASIA-FERET (ACC = 99.33%)
[100] 2016 Image retrieval Visual + Textual Joint embeddings Flickr30K (mAP = 47.72%; R@10 = 79.9%)
MSCOCO (R@10 = 86.9%)
[101] 2016 Image retrieval Photos + Sketches Joint embeddings Fine-grained SBIR Database (R@5 = 19.8%)
[102] 2015 Image retrieval Cross-view image pairs Alignment A dataset of 78k pairs of Google street-view images
(AP = 41.9%)
[103] 2019 Image retrieval Visual + Textual Feature concatenation Fashion-200k (R@50 = 63.8%)
MIT-State (R@10 = 43.1%)
CS (R@1 = 73.7%)
[97] 2015 Gesture recognition RGB + D Recurrent fusion, Late SKIG (ACC = 97.8%)
fusion, and Early fusion
[98] 2017 Gesture recognition RGB + D A canonical correlation Chalearn LAP IsoGD (ACC = 67.71%)
scheme
[200] 2019 Gesture recognition RGB + D + Opt. flow A spatio-temporal semantic VIVA hand gestures (ACC = 86.08%)
alignment loss (SSA)
EgoGesture (ACC = 93.87%)
A survey on deep multimodal learning for computer vision: advances, trends, applications, and...
123
2961
Table 2 continued
2962
References Year Application Sensing modality/data sources Fusion scheme Dataset/best results
123
[87] 2018 Vision-and-language navigation Visual + Textual (instructions) Attention mechanism + R2R (SPL = 18%)
LSTM
[88] 2019 Vision-and-language navigation Visual + Textual Attention mechanism + R2R (SPL = 38%)
Language Encoder
[118] 2020 Vision-and-language navigation Visual + Textual (instructions) Domain adaptation R2R (Performance gap = 8.6)
R4R (Performance gap = 23.9)
CVDN (Performance gap = 3.55)
[119] 2020 Vision-and-language navigation Visual + Textual (instructions) Early fusion + Late fusion R2R (SPL = 59%)
[120] 2020 Vision-and-language navigation Visual + Textual (instructions) Attention mechanism + VLN-CE (SPL = 35%)
Feature concatenation
[121] 2019 Vision-and-language navigation Visual + Textual (instructions) Encoder-decoder + ASKNAV (Success rate = 52.26%)
Multiplicative attention
mechanism
[89] 2018 Embodied question answering Visual + Textual (questions) Attention mechanism + EQA-v1 (MR = 3.22)
Alignment
[90] 2019 Embodied question answering Visual + Textual (questions) Feature concatenation EQA-v1 (ACC = 61.45%)
[122] 2019 Embodied question answering Visual + Textual (questions) Alignment VideoNavQA (ACC = 64.08%)
[125] 2019 Video question answering Visual + Textual (questions) Bilinear fusion TDIUC (ACC = 88.20%)
VQA-CP (ACC = 39.54%)
VQA-v2 (ACC = 65.14%)
[126] 2019 Video question answering Visual + Textual (questions) Alignment TGIF-QA (ACC = 53.8%)
MSVD-QA (ACC = 33.7%)
MSRVTT-QA (ACC = 33.00%)
Youtube2Text-QA (ACC = 82.5%)
[127] 2020 Video question answering Visual + Textual (questions) Hierarchical Conditional MSRVTT-QA (ACC = 35.6%)
Relation Networks
(HCRN)
MSVD-QA (ACC = 36.1%)
[129] 2019 Video question answering Visual + Textual (questions) Dual-LSTM + Spatial and TGIF-QA (l2 distance = 4.22)
temporal attention
[159] 2019 Style transfer Content + Style Graph based matching A dataset of images from MSCOCO and WikiArt
(PV = 33.45%)
[160] 2017 Style transfer Content + Style Hierarchical feature A dataset of images from MSCOCO (PS = 0.54s)
concatenation
K. Bayoudh et al.
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2963
ACC accuracy, MR mean rank, SPL success weighted by path length, mAP mean average precision, AP average precision, R@i recall for setting i, PREC precision, PV percentage of the votes,
fusion due to the homogeneity of data samples. As men-
tioned before, some dimensionality reduction algorithms
PS processing speed, TRE target registration error, NDS nuScenes detection score, ATE absolute trajectory error, Trel average translational error percentage, Rrel rotational error
AirSim (trel = 4.53%; rrel =8.75%)
Color fundus-FA (ACC = 90.10%)
ther processing.
– Data availability: One of the most significant chal-
lenges of deep multimodal learning is the large amount
of data required to learn discriminative feature maps. The
Dataset/best results
(TRE = 3.48mm)
Autonomous systems
Autonomous systems
2018
2020
2019
2019
Year
7 Conclusion
Table 2 continued
[144]
[145]
[155]
[199]
123
2964 K. Bayoudh et al.
[55] 2011 RGB-D Object RGB + D Object recognition Contains 300 object instances under 51
categories from different angles for a total of
250,000 RGB-D images
[56] 2014 BigBIRD RGB + D Object recognition Contains 125 objects, 600 RGB-D point clouds,
and 600 12 megapixel images
[57] 2016 A large dataset of RGB + D Object recognition Contains more than 10,000 scanned and
object scans reconstructed objects in 9 categories
[58] 2011 RGB-D Semantic RGB + D Semantic segmentation Contains 3 3D models for 6 categories and 16
Segmentation test object scenes
[55] 2011 RGB-D Scenes v.1 RGB + D Object recognition Contains 8 video scenes from several RGB-D
images
Semantic segmentation
[55] 2014 RGB-D Scenes v.2 RGB + D Object recognition Contains 14 scenes of video sequences
Semantic segmentation
[59] 2011 NYU v1-v2 RGB + D Semantic segmentation NYU-v1 contains 64 different indoor scenes and
108617 unlabelled images. NYU-v2 contains
464 different indoor scenes and 407024
unlabeled images
[60] 2011 RGB-D People RGB + D Object recognition Contains more than 3000 RGB-D images
[61] 2016 SceneNet RGB-D RGB + D Semantic segmentation Contains 5M RGB-D images
Instance segmentation
Object detection
[62] 2017 Kinetics-400 RGB + Opt. flow Motion recognition Contains more than 300,000 video sequences in
400 classes
[63] 2016 Scene Flow RGB + Opt. flow Object segmentation Contains over 39,000 high resolution images
[64] 2012 MPI-Sintel RGB + Opt. flow Semantic segmentation Contains 1040 annotated optical flow and
matching RGB images
Object recognition
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2965
6. Bilbao, I., Bilbao, J.: Overfitting problem and the over-training 31. Chen, K. et al.: Hybrid task cascade for instance segmentation.
in the era of data: particularly for artificial neural networks. In: arXiv:1901.07518 (2019)
2017 Eighth International Conference on Intelligent Computing 32. Marechal, C. et al.: Survey on AI-based multimodal methods for
and Information Systems (ICICIS), pp. 173–177 (2017) emotion detection. In: High-Performance Modelling and Simu-
7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classifica- lation for Big Data Applications: Selected Results of the COST
tion with deep convolutional neural networks. Commun. ACM Action IC1406 cHiPSet, pp. 307–324 (2019)
60, 84–90 (2017) 33. Radu, V., et al.: Multimodal deep learning for activity and con-
8. Rosenblatt, F.: Perceptron simulation experiments. Proc. IRE 48, text recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous
301–309 (1960) Technol. 1, 157:1–157:27 (2018)
9. Van Der Malsburg, C.: Frank Rosenblatt: principles of 34. Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a
neurodynamics–perceptrons and the theory of brain mechanisms. survey on recent advances and trends. IEEE Signal Process. Mag.
Brain Theory, 245–248 (1986) 34, 96–108 (2017)
10. Huang, Y, Sun, S, Duan, X, Chen, Z.: A study on deep neural net- 35. Guo, W., Wang, J., Wang, S.: Deep multimodal representation
works framework. In: IEEE Advanced Information Management, learning: a survey. IEEE Access 7, 63373–63394 (2019)
Communicates, Electronic and Automation Control Conference 36. Lahat, D., Adali, T., Jutten, C.: Multimodal data fusion: an
(IMCEC), pp. 1519–1522 (2016) overview of methods, challenges, and prospects. Proc. IEEE
11. Sheela, K.G. Deepa, S.N.: Review on methods to fix number 103(9), 1449–1477 (2015)
of hidden neurons in neural networks. Math. Problems. Eng. 37. Baltrušaitis, T., Ahuja, C., Morency, L.-P.: Multimodal Machine
2013(25740) (2013) Learning: A Survey and Taxonomy. arXiv:1705.09406 (2017)
12. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning represen- 38. Morvant, E., Habrard, A., Ayache, S.: Majority vote of diverse
tations by back-propagating errors. Nature 323, 533–536 (1986) classifiers for late fusion. In: Structural, Syntactic, and Statistical
13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Pattern Recognition, pp. 153–162 (2014)
Comput. 9, 1735–1780 (1997) 39. Liu, Z. et al.: Efficient Low-Rank Multimodal Fusion with
14. Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent Modality-Specific Factors. arXiv:1806.00064 (2018)
neural network (IndRNN): building a longer and deeper RNN. 40. Zhang, D., Zhai, X.: SVM-based spectrum sensing in cognitive
arXiv:1803.04831 (2018) radio. In: 7th International Conference on Wireless Communica-
15. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm tions, Networking and Mobile Computing, pp. 1–4 (2011)
for deep belief nets. Neural Comput. 18, 1527–1554 (2006) 41. Gönen, M., Alpaydın, E.: Multiple Kernel learning algorithms. J.
16. Goodfellow, I.J., et al.: Generative adversarial networks. Mach. Learn. Res. 12, 2211–2268 (2011)
arXiv:1406.2661 (2014) 42. Aiolli, F., Donini, M.: EasyMKL: a scalable multiple kernel learn-
17. Turkoglu, M.O., Thong, W., Spreeuwers, L., Kicanaoglu, B.: ing algorithm. Neurocomputing 169, 215–224 (2015)
A layer-based sequential framework for scene generation with 43. Wen, H., et al.: Multi-modal multiple kernel learning for accurate
GANs. arXiv:1902.00671 (2019) identification of Tourette syndrome children. Pattern Recognit.
18. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image trans- 63, 601–611 (2017)
lation with conditional adversarial networks. arXiv:1611.07004 44. Rabiner, L.R.: A tutorial on hidden Markov models and selected
(2018) applications in speech recognition. Proc. IEEE 77, 257–286
19. Creswell, A., et al.: Generative adversarial networks: an overview. (1989)
IEEE Signal Process. Mag. 35, 53–65 (2018) 45. Ghahramani, Z., Jordan, M.I.: Factorial hidden Markov models.
20. Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the Mach. Learn. 29, 245–273 (1997)
recent architectures of deep convolutional neural networks. Artif 46. Linde, Y., Buzo, A., Gray, R.: An algorithm for vector quantizer
Intell Rev (2020) design. IEEE Trans. Commun. 28, 84–95 (1980)
21. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based 47. Gael, J.V., Teh, Y.W., Ghahramani, Z.: The infinite factorial hid-
learning applied to document recognition. Proc. IEEE 86, 2278– den Markov model. In: Proceedings of the 21st International
2324 (1998) Conference on Neural Information Processing Systems, pp. 1697–
22. Simonyan, K., Zisserman, A.: Very deep convolutional networks 1704 (2008)
for large-scale image recognition. arXiv:1409.1556 (2015) 48. Alam, M. R., Bennamoun, M., Togneri, R., Sohel, F.: A deep
23. Stone, J.V.: Principal component analysis and factor analysis. neural network for audio-visual person recognition. In: IEEE 7th
In: Independent Component Analysis: A Tutorial Introduction, International Conference on Biometrics Theory, Applications and
MITP, pp. 129–135 (2004) Systems (BTAS), pp. 1–6 (2015)
24. Sermanet, P. et al.: OverFeat: integrated recognition, localiza- 49. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Com-
tion and detection using convolutional networks. arXiv:1312.6229 put. Vis. 57, 137–154 (2004)
(2014) 50. Wang, M., Deng, W.: Deep Face Recognition: A Survey.
25. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only arXiv:1804.06655 (2019)
look once: unified, real-time object detection. arXiv:1506.02640 51. Ding, C., Tao, D.: Robust face recognition via multimodal deep
(2016) face representation. IEEE Trans. Multimed. 17, 2049–2058 (2015)
26. Cai, Z., Vasconcelos, N.: Cascade R-CNN: high quality object 52. Biten, A.F., Gomez, L., Rusiñol, M., Karatzas, D.: Good News,
detection and instance segmentation. arXiv:1906.09756 (2019) Everyone! Context driven entity-aware captioning for news
27. Thoma, M.: A survey of semantic segmentation. images. arXiv:1904.01475 (2019)
arXiv:1602.06541 (2016) 53. Peri, D., Sah, S., Ptucha, R.: Show, Translate and Tell.
28. Guo, Y., Liu, Y., Georgiou, T., Lew, M.S.: A review of semantic arXiv:1903.06275 (2019)
segmentation using deep neural networks. Int. J. Multimed. Infom. 54. Duan, G., Yang, J., Yang, Y.: Content-based image retrieval
Retr. 7, 87–93 (2018) research. Phys. Proc. 22, 471–477 (2011)
29. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks 55. Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-
for semantic segmentation. arXiv:1411.4038 (2015) view RGB-D object dataset. In: IEEE International Conference
30. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. on Robotics and Automation, pp. 1817–1824 (2011)
arXiv:1703.06870 (2018)
123
2966 K. Bayoudh et al.
56. Singh, A., Sha, J., Narayan, K.S., Achim, T., Abbeel, P.: Big- 80. Uijlings, J.R., Sande, K.E., Gevers, T., Smeulders, A.W.: Selective
BIRD: A large-scale 3D database of object instances. In: IEEE search for object recognition. Int. J. Comput. Vis. 104, 154–171
International Conference on Robotics and Automation (ICRA), (2013)
pp. 509–516 (2014) 81. Xiao, Y., Codevilla, F., Gurram, A., Urfalioglu, O., López, A.M.:
57. Choi, S., Zhou, Q.-Y., Miller, S., Koltun, V.: A Large Dataset of Multimodal end-to-end autonomous driving. arXiv:1906.03199
Object Scans. arXiv:1602.02481 (2016) (2019)
58. Tombari, F., Di Stefano, L., Giardino, S.: Online learning for 82. 1.Mohanapriya, D., Mahesh, K.: Chapter 5—an efficient frame-
automatic segmentation of 3D data. In: IEEE/RSJ International work for object tracking in video surveillance. In: The Cognitive
Conference on Intelligent Robots and Systems, pp. 4857–4864 Approach in Cloud Computing and Internet of Things Technolo-
(2011) gies for Surveillance Tracking Systems, pp. 65–74 (2020)
59. Silberman, N., Fergus, R.: Indoor scene segmentation using a 83. Rangesh, A., Trivedi, M.M.: No blind spots: full-surround multi-
structured light sensor. In: International Conference on Computer object tracking for autonomous vehicles using cameras and
Vision Workshops (2011) LiDARs. IEEE Trans. Intelli. Veh. 4, 588–599 (2019)
60. Spinello, L., Arras, K.O.: People detection in RGB-D data. In: 84. Liu, L., et al.: Deep learning for generic object detection: a survey.
Intelligent and Robotic Systems (2011) Int. J. Comput. Vis. 128, 261–318 (2020)
61. Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, 85. Nowlan, S., Platt, J.: A convolutional neural network hand tracker.
R.: SceneNet: Understanding Real World Indoor Scenes With In: Advances in Neural Information Processing Systems, pp. 901–
Synthetic Data. arXiv:1511.07041 (2015) 908 (1995)
62. Kay, W. et al.: The Kinetics Human Action Video Dataset. 86. Ciaparrone, G., et al.: Deep learning in video multi-object track-
arXiv:1705.06950 (2017) ing: a survey. Neurocomputing 381, 61–88 (2020)
63. Mayer, N. et al.: A large dataset to train convolutional networks for 87. Anderson, P. et al.: Vision-and-language navigation: interpreting
disparity, optical flow, and scene flow estimation. In: IEEE Con- visually-grounded navigation instructions in real environments.
ference on Computer Vision and Pattern Recognition (CVPR), pp. In: IEEE/CVF Conference on Computer Vision and Pattern
4040–4048 (2016) Recognition, pp. 3674–3683 (2018)
64. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalis- 88. Wang, X. et al.: Reinforced Cross-Modal Matching and Self-
tic open source movie for optical flow evaluation. Comput. Vis. Supervised Imitation Learning for Vision-Language Navigation.
ECCV 2012, 611–625 (2012) arXiv:1811.10092 (2019)
65. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. 89. Das, A. et al.: Embodied question answering. In: Proceedings of
Intell. 17, 185–203 (1981) the IEEE Conference on Computer Vision and Pattern Recogni-
66. Wang, W., Fu, Y., Pan, Z., Li, X., Zhuang, Y.: Real-time driv- tion, pp. 1–10 (2018)
ing scene semantic segmentation. IEEE Access 8, 36776–36788 90. Yu, L. et al.: Multi-target embodied question answering. In: Pro-
(2020) ceedings of the IEEE Conference on Computer Vision and Pattern
67. Jiao, L., et al.: A survey of deep learning-based object detection. Recognition, pp. 6309–6318 (2019)
IEEE Access 7, 128837–128868 (2019) 91. Wang, A., Lu, J., Wang, G., Cai, J., Cham, T.-J.: Multi-modal
68. Dilawari, A., Khan, M.U.G.: ASoVS: abstractive summarization unsupervised feature learning for RGB-D scene labeling. In: Com-
of video sequences. IEEE Access 7, 29253–29263 (2019) puter Vision—ECCV, pp. 453–467 (2014)
69. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random 92. Dargan, S., Kumar, M.: A comprehensive survey on the biomet-
fields: probabilistic models for segmenting and labeling sequence ric recognition systems based on physiological and behavioral
data. In: Proceedings of the Eighteenth International Conference modalities. Expert Syst. Appl. 143, 113114 (2020)
on Machine Learning, pp. 282–289 (2001) 93. Ammour, B., Boubchir, L., Bouden, T., Ramdani, M.: Face-Iris
70. Shao, L., Zhu, F., Li, X.: Transfer learning for visual catego- multimodal biometric identification system. Electronics 9, 85
rization: a survey. IEEE Trans. Neural Netw. Learn. Syst. 26, (2020)
1019–1034 (2015) 94. Namin, S.T., Najafi, M., Salzmann, M., Petersson, L.: Cutting
71. Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep edge: soft correspondences in multimodal scene parsing. In: IEEE
Boltzmann machines. J. Mach. Learn. Res. 15(1), 2949–2980 International Conference on Computer Vision (ICCV), pp. 1188–
(2014) 1196 (2015)
72. Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: 95. Zou, C., Guo, R., Li, Z., Hoiem, D.: Complete 3D scene parsing
Artificial Intelligence and Statistics, pp. 448–455 (2009) from an RGBD image. Int. J. Comput. Vis. 127, 143–162 (2019)
73. Koo, J.H., Cho, S.W., Baek, N.R., Kim, M.C., Park, K.R.: 96. Escalera, S., Athitsos, V., Guyon, I.: Challenges in multimodal
CNN-based multimodal human recognition in surveillance envi- gesture recognition. J. Mach. Learn. Res. 17, 1–54 (2016)
ronments. Sensors 18, 3040 (2018) 97. Nishida, N., Nakayama, H.: Multimodal gesture recognition
74. Girshick, R., Donahue, J., Darrell, T., Malik, J. Rich feature hier- using multi-stream recurrent neural network. In: Revised Selected
archies for accurate object detection and semantic segmentation. Papers of the 7th Pacific-Rim Symposium on Image and Video
arXiv:1311.2524 (2014) Technology, pp. 682–694 (2015)
75. Girshick, R.: Fast R-CNN. arXiv:1504.08083 (2015) 98. Miao, Q. et al.: Multimodal gesture recognition based on the
76. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards ResC3D network. In: IEEE International Conference on Com-
real-time object detection with region proposal networks. puter Vision Workshops (ICCVW), pp. 3047–3055 (2017)
arXiv:1506.01497 (2016) 99. Tran, D., Ray, J., Shou, Z., Chang, S.-F., Paluri, M.: Con-
77. Lin, T.-Y. et al.: Feature pyramid networks for object detection. vNet architecture search for spatiotemporal feature learning.
arXiv:1612.03144 (2017) arXiv:1708.05038 (2017)
78. Liu, W. et al.: SSD: single shot multibox detector, pp. 21–37. 100. Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving
arXiv:1512.02325 (2016) image-text embeddings. In: IEEE Conference on Computer Vision
79. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss and Pattern Recognition (CVPR), pp. 5005–5013 (2016)
for dense object detection. arXiv:1708.02002 (2018) 101. Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database:
learning to retrieve badly drawn bunnies. ACM Trans. Graph. 35,
119:1–119:12 (2016)
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2967
102. Lin, T.-Y., Yin Cui, Belongie, S., Hays, J.: Learning deep represen- 126. Fan, C. et al.: Heterogeneous memory enhanced multimodal
tations for ground-to-aerial geolocalization. In: IEEE Conference attention model for video question answering. In: CVPR, pp.
on Computer Vision and Pattern Recognition (CVPR), pp. 5007– 1999–2007 (2019)
5015 (2015) 127. Le, et al.: Hierarchical Conditional Relation Networks for Video
103. Vo, N. et al.: Composing text and image for image retrieval—an Question Answering. arXiv:2002.10698 (2020)
empirical odyssey. In: Proceedings of the IEEE Conference on 128. Laina, I., et al.: Towards unsupervised image captioning with
Computer Vision and Pattern Recognition, pp. 6439–6448 (2019) shared multimodal embeddings. In: ICCV, pp. 7414–7424 (2019)
104. Xu, Y.: Deep learning in multimodal medical image analysis. In: 129. Jang, Y., et al.: Video question answering with spatio-temporal
Health Information Science, pp. 193–200 (2019) reasoning. Int. J. Comput. Vis. 127, 1385–1412 (2019)
105. Shi, F., et al.: Review of artificial intelligence techniques in imag- 130. Wang, W., et al.: A survey of zero-shot learning: settings, methods,
ing data acquisition, segmentation and diagnosis for COVID-19. and applications. ACM Trans. Intell. Syst. Technol. 10, 13:1–
IEEE Rev. Biomed. Eng. 1, 2020 (2020) 13:37 (2019)
106. Santosh, K.C.: AI-driven tools for coronavirus outbreak: need of 131. Wei, L., et al.: A single-shot multi-level feature reused neural
active learning and cross-population train/test models on multitu- network for object detection. Vis. Comput. (2020). https://doi.
dinal/multimodal data. J. Med. Syst. 44, 93 (2020) org/10.1007/s00371-019-01787-3
107. Wang, X., et al.: Convergence of edge computing and deep learn- 132. Hascoet, T., et al.: Semantic embeddings of generic objects for
ing: a comprehensive survey. IEEE Commun. Surv. Tutorials 1, zero-shot learning. J. Image Video Proc. 2019, 13 (2019)
2020 (2020) 133. Liu, Y., et al.: Attribute attention for semantic disambiguation in
108. Ruder, S.: An Overview of Multi-Task Learning in Deep Neural zero-shot learning. In: ICCV, pp. 6697–6706 (2019)
Networks. arXiv:1706.05098 (2017) 134. Li, K., et al.: Rethinking zero-shot learning: a conditional visual
109. Ruder, S., Bingel, J., Augenstein, I., Søgaard, A.: Latent Multi- classification perspective. In: ICCV, pp. 3582–3591 (2019)
task Architecture Learning. arXiv:1705.08142 (2018) 135. Liu, Y., Tuytelaars, T.: A: deep multi-modal explanation model
110. Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 for zero-shot learning. IEEE Trans. Image Process. 29, 4788–4803
(1997) (2020)
111. Duong, L., Cohn, T., Bird, S., Cook, P. low resource depen- 136. Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating
dency parsing: cross-lingual parameter sharing in a neural network networks for zero-shot learning. In: CVPR, pp. 5542–5551 (2018)
parser. In: Proceedings of the 53rd Annual Meeting of the Asso- 137. Kumar, Y. et al.: Harnessing GANs for Zero-shot Learning of New
ciation for Computational Linguistics and the 7th International Classes in Visual Speech Recognition. arXiv:1901.10139 (2020)
Joint Conference on Natural Language Processing, pp. 845–850 138. Zhang, X., et al.: Online multi-object tracking with pedestrian
(2015) re-identification and occlusion processing. Vis. Comput. (2020).
112. Peng, Y., et al.: CCL: cross-modal correlation learning with multi- https://doi.org/10.1007/s00371-020-01854-0
grained fusion by hierarchical network. IEEE Trans. Multimed. 139. Abbass, M.Y., et al.: Efficient object tracking using hierarchical
20(2), 405–420 (2017) convolutional features model and correlation filters. Vis. Comput.
113. Palaskar, S., Sanabria, R., Metze, F.: Transfer learning for multi- (2020). https://doi.org/10.1007/s00371-020-01833-5
modal dialog. Comput. Speech Lang. 64, 101093 (2020) 140. Xi, P.: An integrated approach for medical abnormality detection
114. Libovický, J., Helcl, J.: Attention strategies for multi-source using deep patch convolutional neural networks. Vis. Comput. 36,
sequence-to-sequence learning. In: Proceedings of the 55th 1869–1882 (2020)
Annual Meeting of the Association for Computational Linguistics 141. Parida, K., et al.: Coordinated joint multimodal embeddings for
(Vol. 2: Short Papers), pp. 196–202 (2017) generalized audio-visual zero-shot classification and retrieval of
115. He, G., et al.: Classification-aware semi-supervised domain adap- videos. In: CVPR, pp. 3251–3260 (2020)
tation. In: CVPR, pp. 964–965 (2020) 142. Lee, J. A., et al.: Deep step pattern representation for multimodal
116. Rao, R., et al.: Quality and relevance metrics for selection of retinal image registration. In: CVPR, pp. 5077–5086 (2019)
multimodal pretraining data. In: CVPR, pp. 956–957 (2020) 143. Hashemi Hosseinabad, S., Safayani, M., Mirzaei, A.: Multiple
117. Bucci, S., Loghmani, M.R., Caputo, B.: Multimodal Deep Domain answers to a question: a new approach for visual question answer-
Adaptation. arXiv:1807.11697 (2018) ing. Vis. Comput. (2020). https://doi.org/10.1007/s00371-019-
118. Zhang, Y., Tan, H., Bansal, M.: Diagnosing the Environment Bias 01786-4
in Vision-and-Language Navigation. arXiv:2005.03086 (2020) 144. Yan, P., et al.: Adversarial image registration with application for
119. Landi, F., et al.: Perceive, Transform, and Act: Multi- mr and trus image fusion. arXiv:1804.11024 (2018)
Modal Attention Networks for Vision-and-Language Navigation. 145. Horry, Michael. J. et al.: COVID-19 Detection through Transfer
arXiv:1911.12377 (2020) Learning using Multimodal Imaging Data. IEEE Access 1 (2020)
120. Krantz, et al.: Beyond the Nav-Graph: Vision-and-Language Nav- https://doi.org/10.1109/ACCESS.2020.3016780
igation in Continuous Environments. arXiv:2004.02857 (2020) 146. Grigorescu, S., Trasnea, B., Cocias, T., Macesanu, G.: A survey of
121. Nguyen, K., et al.: Vision-based Navigation with Language-based deep learning techniques for autonomous driving. J. Field Robot.
Assistance via Imitation Learning with Indirect Intervention. 37, 362–386 (2020)
arXiv:1812.04155 (2019) 147. Metzger, A., Drewing, K.: Memory influences haptic perception
122. Cangea, et al.: VideoNavQA: Bridging the Gap between Visual of softness. Sci. Rep. 9, 14383 (2019)
and Embodied Question Answering. arXiv:1908.04950 (2019) 148. Guclu, O., Can, A.B.: Integrating global and local image features
123. Zarbakhsh, P., Demirel, H.: 4D facial expression recognition using for enhanced loop closure detection in RGB-D SLAM systems.
multimodal time series analysis of geometric landmark-based Vis. Comput. 36, 1271–1290 (2020)
deformations. Vis. Comput. 36, 951–965 (2020) 149. Van Brummelen, J., et al.: Autonomous vehicle perception: the
124. Joze, H.R.V., et al.: MMTM: multimodal transfer module for CNN technology of today and tomorrow. Transp. Res. C Emerg. Tech-
fusion. In: CVPR, pp. 13289–13299 (2020) nol. 89, 384–406 (2018)
125. Cadene, et al.: MUREL: multimodal relational reasoning for 150. He, M., et al.: A review of monocular visual odometry. Vis. Com-
visual question answering. In: CVPR, pp. 1989–1998 (2019) put. 36, 1053–1065 (2020)
151. Liu, S., et al.: Accurate and robust monocular SLAM with omni-
directional cameras. Sensors 19, 4494 (2019)
123
2968 K. Bayoudh et al.
152. Mur-Artal, R., Tardos, J.D.: ORB-SLAM2: an open-source 181. Wu, X., Sahoo, D. Hoi, S.C.H.: Recent Advances in Deep Learn-
SLAM system for monocular. Stereo RGB-D Cameras (2016). ing for Object Detection. arXiv:1908.03673 (2019)
https://doi.org/10.1109/TRO.2017.2705103 182. Pouyanfar, S., et al.: A survey on deep learning: algorithms, tech-
153. Engel, J., et al.: LSD-SLAM: large-scale direct monocular SLAM. niques, and applications. ACM Comput. Surv. 51, 92:1–92:36
In: Computer Vision—ECCV, pp. 834–849 (2014) (2018)
154. Engel, J., et al.: Direct Sparse Odometry. arXiv:1607.02565 183. Ophoff, T., et al.: Exploring RGB+depth fusion for real-time
(2016) object detection. Sensors 19, 866 (2019)
155. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous 184. Luo, Q., et al.: 3D-SSD: learning hierarchical features from RGB-
driving. In: CVPR, pp. 11621–11631 (2020) D images for amodal 3D object detection. Neurocomputing 378,
156. Gatys, L., et al.: A Neural Algorithm of Artistic Style. 364–374 (2020)
arXiv:1508.06576 (2015) 185. Zhang, S., et al.: Video object detection base on rgb and optical
157. Lian, G., Zhang, K.: Transformation of portraits to Picasso’s flow analysis. In: CCHI, pp. 280–284 (2019). https://doi.org/10.
cubism style. Vis. Comput. 36, 799–807 (2020) 1109/CCHI.2019.8901921
158. Wang, L., et al.: Photographic style transfer. Vis. Comput. 36, 186. Simon, M., et al.: Complexer-YOLO: real-time 3D object detec-
317–331 (2020) tion and tracking on semantic point clouds. In: CVPRW, pp.
159. Zhang, Y. et al.: Multimodal style transfer via graph cuts. In: 1190–1199 (2019). https://doi.org/10.1109/CVPRW.2019.00158
ICCV, pp. 5943–5951 (2019) 187. Tu, S., et al.: Passion fruit detection and counting based on mul-
160. Wang, X., et al.: Multimodal Transfer: A Hierarchical Deep tiple scale faster R-CNN using RGB-D images. Precision Agric.
Convolutional Neural Network for Fast Artistic Style Transfer. 21, 1072–1091 (2020)
arXiv:1612.01895 (2017) 188. Li, J., et al.: Facial expression recognition with faster R-CNN.
161. Jing, Y., et al.: Neural Style Transfer: A Review. Proc. Comput. Sci. 107, 135–140 (2017)
arXiv:1705.04058 (2018) 189. Liu, S.: Enhanced situation awareness through CNN-based deep
162. DeepArts: turn your photos into art. https://deepart.io (2020). multimodal image fusion. OE 59, 053103 (2020)
Accessed 18 Aug 2020 190. Michael, Y.B., Rosenhahn, V.M.: Multimodal Scene Understand-
163. Waymo: Waymo safety report: On the road to fully self-driving. ing, 1st edn. Academic Press, London (2019)
https://waymo.com/safety (2020). Accessed 18 Aug 2020 191. Djuric, N., et al.: MultiXNet: Multiclass Multistage Multimodal
164. Wang, Z., Wu, Y., Niu, Q.: Multi-sensor fusion in automated driv- Motion Prediction. arXiv:2006.02000 (2020)
ing: a survey. IEEE Access 8, 2847–2868 (2020) 192. Asvadi, A., et al.: Multimodal vehicle detection: fusing 3D-
165. Ščupáková, K., et al.: A patch-based super resolution algorithm LIDAR and color camera data. Pattern Recogn. Lett. 115, 20–29
for improving image resolution in clinical mass spectrometry. Sci. (2018)
Rep. 9, 2915 (2019) 193. Mahmud, T., et al.: A novel multi-stage training approach for
166. Bashiri, F.S., et al.: Multi-modal medical image registration with human activity recognition from multimodal wearable sensor data
full or partial data: a manifold learning approach. J. Imag. 5, 5 using deep neural network. IEEE Sens. J. (2020). https://doi.org/
(2019) 10.1109/JSEN.2020.3015781
167. Chen, C., et al. Progressive Feature Alignment for Unsupervised 194. Zhang, W., et al.: Robust Multi-Modality Multi-Object Tracking.
Domain Adaptation. arXiv:1811.08585 (2019) arXiv:1909.03850 (2019)
168. Jin, X., et al.: Feature Alignment and Restoration for Domain 195. Kandylakis, Z., et al.: Fusing multimodal video data for detecting
Generalization and Adaptation. arXiv:2006.12009 (2020) moving objects/targets in challenging indoor and outdoor scenes.
169. Guan, S.-Y., et al.: A review of point feature based medical image Remote Sens. 11, 446 (2019)
registration. Chin. J. Mech. Eng. 31, 76 (2018) 196. Yang, R., et al.: Learning target-oriented dual attention for robust
170. Dapogny, A., et al.: Deep Entwined Learning Head Pose and RGB-T tracking. In: ICIP, pp. 3975–3979 (2019). https://doi.org/
Face Alignment Inside an Attentional Cascade with Doubly- 10.1109/ICIP.2019.8803528
Conditional fusion. arXiv:2004.06558 (2020) 197. Lan, X., et al.: Modality-correlation-aware sparse representation
171. Yue, L., et al.: Attentional alignment network. In: BMVC (2018) for RGB-infrared object tracking. Pattern Recogn. Lett. 130, 12–
172. Liu, Z., et al.: Semantic Alignment: Finding Semanti- 20 (2020)
cally Consistent Ground-truth for Facial Landmark Detection. 198. Bayoudh, K., et al.: Transfer learning based hybrid 2D–3D CNN
arXiv:1903.10661 (2019) for traffic sign recognition and semantic road detection applied in
173. Hao, F., et al.: Collect and select: semantic alignment metric learn- advanced driver assistance systems. Appl. Intell. (2020). https://
ing for few-shot learning. In: CVPR, pp. 8460–8469 (2019) doi.org/10.1007/s10489-020-01801-5
174. Wang, B., et al.: Controllable Video Captioning with 199. Shamwell, E.J., et al.: Unsupervised deep visual-inertial odometry
POS Sequence Guidance Based on Gated Fusion Network. with online error correction for RGB-D imagery. IEEE Trans. Pat-
arXiv:1908.10072 (2019) tern Anal. Mach. Intell. (2019). https://doi.org/10.1109/TPAMI.
175. Wu, M., et al.: Audio caption: listen and tell. In: ICASSP, pp. 2019.2909895
830–834 (2019) https://doi.org/10.1109/ICASSP.2019.8682377 200. Abavisani, M., et al.: Improving the Performance of Unimodal
176. Pan, B., et al. Spatio-temporal graph for video captioning with Dynamic Hand-Gesture Recognition with Multimodal Training.
knowledge distillation. In: CVPR, pp. 10870–10879 (2020) arXiv:1812.06145 (2019)
177. Liu, X., Xu, Q., Wang, N.: A survey on deep neural network-based 201. Yang, X., et al.: A survey on canonical correlation analysis. IEEE
image captioning. Vis. Comput. 35, 445–470 (2019) Trans. Knowl. Data Eng. 1, 2019 (2019)
178. Abbass, M.Y., et al.: A survey on online learning for visual track- 202. Hardoon, D.R., et al.: Canonical correlation analysis: an overview
ing. Vis. Comput. (2020). https://doi.org/10.1007/s00371-020- with application to learning methods. Neural Comput. 16, 2639–
01848-y 2664 (2004)
179. Guo, Y., et al.: Deep learning for visual understanding: a review. 203. Chandar, S., et al.: Correlational neural networks. Neural Comput.
Neurocomputing 187, 27–48 (2016) 28, 257–285 (2016)
180. Hatcher, W.G., Yu, W.: A survey of deep learning: platforms, 204. Engilberge, M., et al.: Finding beans in burgers: deep semantic-
applications and emerging research trends. IEEE Access 6, visual embedding with localization. In: CVPR, pp. 3984–3993
24411–24432 (2018) (2018)
123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2969
205. Shahroudy, A., et al.: Deep Multimodal Feature Analysis for Person re-identification. Vis. Comput. (2020). https://doi.org/10.
Action Recognition in RGB+D Videos. arXiv:1603.07120 (2016) 1007/s00371-020-02015-z
206. Srivastava, N., et al.: Multimodal learning with deep Boltzmann 229. Xu, T., et al.: AttnGAN: fine-grained text to image generation
machines. J. Mach. Learn. Res. 15, 2949–2980 (2014) with attentional generative adversarial networks. In: IEEE/CVF
207. Bank, D., et al.: Autoencoders. arXiv:2003.05991 (2020) Conference on Computer Vision and Pattern Recognition, pp.
208. Bhatt, G., Jha, P., Raman, B.: Representation learning using step- 1316–1324 (2018)
based deep multi-modal autoencoders. Pattern Recogn. 95, 12–23 230. Huang, X., et al.: Multimodal unsupervised image-to-image trans-
(2019) lation. In: CVPR, pp. 172–189 (2018)
209. Liu, Y., Feng, X., Zhou, Z.: Multimodal video classification with 231. Toriya, H., et al.: SAR2OPT: image alignment between multi-
stacked contractive autoencoders. Signal Process. 120, 761–766 modal images using generative adversarial networks. In: IEEE
(2016) International Geoscience and Remote Sensing Symposium, pp.
210. Kim, J., Chung, K.: Multi-modal stacked denoising autoencoder 923–926 (2019)
for handling missing data in healthcare big data. IEEE Access 8, 232. Chaudhari, S., et al.: An Attentive Survey of Attention Models.
104933–104943 (2020) arXiv:1904.02874 (2020)
211. Singh, V., et al.: Feature learning using stacked autoencoder for 233. Hori, C., et al.: Attention-based multimodal fusion for video
shared and multimodal fusion of medical images. In: Computa- description. In: IEEE International Conference on Computer
tional Intelligence: Theories, Applications and Future Directions, Vision (ICCV), pp. 4203–4212 (2017)
pp. 53–66 (2019) 234. Huang, X., Wang, M., Gong, M.: Fine-grained talking face gener-
212. Said, A. B., et al.: Multimodal deep learning approach for joint ation with video reinterpretation. Vis. Comput. 37, 95–105 (2021)
EEG-EMG data compression and classification. In: IEEE Wire- 235. Liu, Z., et al.: Multi-level progressive parallel attention guided
less Communications and Networking Conference (WCNC), pp. salient object detection for RGB-D images. Vis. Comput. (2020).
1–6 (2017) https://doi.org/10.1007/s00371-020-01821-9
213. Ma, L., et al.: Multimodal convolutional neural networks for 236. Yang, Z., et al.: Stacked attention networks for image question
matching image and sentence. In: IEEE International Conference answering. In: IEEE Conference on Computer Vision and Pattern
on Computer Vision (ICCV), pp. 2623–2631 (2015) Recognition (CVPR), pp. 21–29 (2016)
214. Couprie, C., et al.: Toward real-time indoor semantic segmentation 237. Guo, L., et al.: Normalized and geometry-aware self-attention net-
using depth information. J. Mach. Learn. Res. (2014) work for image captioning. In: CVPR, pp. 10327–10336 (2020)
215. Madhuranga, D., et al.: Real-time multimodal ADL recognition 238. Bayoudh, K., et al.: Hybrid-COVID: a novel hybrid 2D/3D
using convolution neural networks. Vis. Comput. (2020) CNN based on cross-domain adaptation approach for COVID-
216. Gao, M., et al.: RGB-D-based object recognition using multi- 19 screening from chest X-ray images. Phys. Eng. Sci. Med. 43,
modal convolutional neural networks: a survey. IEEE Access 7, 1415–1431 (2020)
43110–43136 (2019) 239. Zhang, S., et al.: Joint learning of image detail and transmission
217. Zhang, Z., et al.: RGB-D-based gaze point estimation via multi- map for single image dehazing. Vis. Comput. 36, 305–316 (2020)
column CNNs and facial landmarks global optimization. Vis. 240. Zhang, S., He, F.: DRCDN: learning deep residual convolutional
Comput. (2020) dehazing networks. Vis. Comput. 36, 1797–1808 (2020)
218. Singh, R., et al.: Combining CNN streams of dynamic image and 241. Basly, H., et al.: DTR-HAR: deep temporal residual representation
depth data for action recognition. Multimed. Syst. 26, 313–322 for human activity recognition. Vis. Comput. (2021). https://doi.
(2020) org/10.1007/s00371-021-02064-y
219. Abdulnabi, A.H., et al.: Multimodal recurrent neural networks 242. Zhou, T., et al.: RGB-D salient object detection: a survey. Comp.
with information transfer layers for indoor scene labeling. IEEE Vis. Med. (2021). https://doi.org/10.1007/s41095-020-0199-z
Trans. Multimed. 20, 1656–1671 (2018) 243. Savian, S., et al.: Optical flow estimation with deep learning, a
220. Zhao, D., et al.: A multimodal fusion approach for image caption- survey on recent advances. In: Deep Biometrics, pp. 257–287
ing. Neurocomputing 329, 476–485 (2019) (2020)
221. Li, X., et al.: Multi-modal gated recurrent units for image descrip-
tion. Multimed. Tools Appl. 77, 29847–29869 (2018)
222. Sano, A., et al.: Multimodal ambulatory sleep detection using Publisher’s Note Springer Nature remains neutral with regard to juris-
lstm recurrent neural networks. IEEE J. Biomed. Health Inform. dictional claims in published maps and institutional affiliations.
23, 1607–1617 (2019)
223. Shu, Y., et al.: Bidirectional multimodal recurrent neural networks
with refined visual features for image captioning. In: Internet Mul- Khaled Bayoudh received a Bach-
timedia Computing and Service, pp. 75–84 (2018) elor’s degree in Computer Science
224. Song, H., et al.: S2 RGANS: sonar-image super-resolution based from the Higher Institute of Com-
on generative adversarial network. Vis. Comput. (2020). https:// puter Science and Mathematics
doi.org/10.1007/s00371-020-01986-3 of Monastir (ISIMM), University
225. Ma, T., Tian, W.: Back-projection-based progressive growing gen- of Monastir, Monastir, Tunisia, in
erative adversarial network for single image super-resolution. Vis. 2014. Then, he graduated with a
Comput. (2020). https://doi.org/10.1007/s00371-020-01843-3 Master’s degree in Highway and
226. Rohith, G., Kumar, L.S.: Paradigm shifts in super-resolution tech- Traffic Engineering: Curricular
niques for remote sensing applications. Vis. Comput. (2020). Reform for Mediterranean Area
https://doi.org/10.1007/s00371-020-01957-8 (HiT4Med) from the National Engi-
227. Jia, X., et al.: TICS: text-image-based semantic CAPTCHA syn- neering School of Sousse (ENISo),
thesis via multi-condition adversarial learning. Vis. Comput. University of Sousse, Sousse,
(2021). https://doi.org/10.1007/s00371-021-02061-1 Tunisia, in 2017. In 2018, he
228. Fan, X., et al.: Modality-transfer generative adversarial network received the M1 Master’s degree
and dual-level unified latent representation for visible thermal in Software Engineering from ISIMM. He is currently a PhD student
at the National School of Engineering of Monastir (ENIM), and a
123
2970 K. Bayoudh et al.
researcher in the Electronics and Micro-electronics Laboratory (EµE) Abdellatif Mtibaa is currently
at the Faculty of Sciences of Monastir (FSM), University of Monas- full Professor in Micro-Electronics,
tir, Monastir, Tunisia. His research focuses on Artificial Intelligence, Hardware Design and Embedded
Machine Learning, Deep Learning, Multimodal and Hybrid Learning, System with Electrical Department
Intelligent Systems, and so on. at the National School of Engi-
neering of Monastir and Head of
Raja Knani obtained a Master’s Circuits Systems Reconfigurable-
degree in Micro and Nanoelec- ENIM-Group at Electronic and
tronics from the FSM, University microelectronic Laboratory. He
of Monastir, Monastir, Tunisia, in holds a Diploma in Electrical Engi-
2014. She is currently a PhD stu- neering in 1985 and received his
dent, and a researcher in the Elec- PhD degree in Electrical Engi-
tronics and Microelectronics Lab- neering in 2000. His current
oratory (EµE) at the FSM, Uni- research interests include System
versity of Monastir, Monastir, on Programmable Chip, high level
Tunisia. She is interested partic- synthesis, rapid prototyping and reconfigurable architecture for real-
ularly in Artificial Intelligence, time multimedia applications. Dr. Abdellatif MTIBAA has authored/co-
Human-computer interaction, Ges- authored over 200 papers in international journals and conferences. He
ture recognition and tracking, and served on the technical program committees for several international
so on. conferences. He also served as a co-organizer of several international
conferences.
123