0% found this document useful (0 votes)
33 views32 pages

A Survey on Deep Multimodal Learning for Computer Vision Advances, Trends, Applications, And Datasets

This document is a survey on deep multimodal learning for computer vision, highlighting advances, trends, applications, and datasets. It discusses key concepts and algorithms, including multimodal data representation, fusion, multitask learning, alignment, transfer learning, and zero-shot learning. The paper also reviews current applications, benchmark datasets, and identifies limitations and challenges in the field, providing insights for future research directions.

Uploaded by

zyx1300281401
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views32 pages

A Survey on Deep Multimodal Learning for Computer Vision Advances, Trends, Applications, And Datasets

This document is a survey on deep multimodal learning for computer vision, highlighting advances, trends, applications, and datasets. It discusses key concepts and algorithms, including multimodal data representation, fusion, multitask learning, alignment, transfer learning, and zero-shot learning. The paper also reviews current applications, benchmark datasets, and identifies limitations and challenges in the field, providing insights for future research directions.

Uploaded by

zyx1300281401
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

The Visual Computer (2022) 38:2939–2970

https://doi.org/10.1007/s00371-021-02166-7

SURVEY

A survey on deep multimodal learning for computer vision: advances,


trends, applications, and datasets
Khaled Bayoudh1 · Raja Knani2 · Fayçal Hamdaoui3 · Abdellatif Mtibaa1

Accepted: 15 May 2021 / Published online: 10 June 2021


© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021

Abstract
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer
vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing
universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the
multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities,
often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for
researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep
multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration
and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from
the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both
traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and
zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving
problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and
provide insights and directions for future research.

Keywords Applications · Computer vision · Datasets · Deep learning · Sensory modalities · Multimodal learning

1 Introduction

In recent years, much progress has been made in the field


B Khaled Bayoudh of artificial intelligence thanks to the implementation of
[email protected] machine learning methods. In general, these methods involve
Raja Knani a variety of intelligent algorithms for pattern recognition and
[email protected] data processing. Usually, several sensors with specific char-
Fayçal Hamdaoui acteristics are employed to obtain and analyze global and
[email protected] local patterns in a uniform way. These sensors are generally
Abdellatif Mtibaa very versatile in terms of coverage, size, manufacturing cost,
[email protected] and accuracy. Besides, the availability of vast amounts of data
(big data), coupled with significant technological advances
1 Electrical Department, National Engineering School of and substantial improvements in hardware implementation
Monastir (ENIM), Laboratory of Electronics and
Micro-electronics (LR99ES30), Faculty of Sciences of techniques, has led the machine learning community to turn
Monastir (FSM), University of Monastir, Monastir, Tunisia to deep learning to find sustainable solutions to a given
2 Physics Department, Laboratory of Electronics and problem. Deep learning, also known as representation-based
Micro-electronics (LR99ES30), Faculty of Sciences of learning [2], is a particular approach to machine learning
Monastir (FSM), University of Monastir, Monastir, Tunisia that is gaining popularity due to its predictive power and
3 Electrical Department, National Engineering School of portability. The work presented in [3] showed a technical
Monastir (ENIM), Laboratory of Control, Electrical Systems transition from machine learning to deep learning by sys-
and Environment (LASEE), National Engineering School of tematically highlighting the main concepts, algorithms, and
Monastir, University of Monastir, Monastir, Tunisia

123
2940 K. Bayoudh et al.

trends in deep learning. In practice, the extraction and synthe- from raw data to achieve maximum performance on many
sis of rich information from a multidimensional data space heterogeneous datasets. Thus, it will be possible to design
require the use of an intermediate mechanism to facilitate intelligent systems that can quickly answer questions, rea-
decision making in intelligent systems. Deep learning has son, and discuss what is seen in different views in different
been used in many practices, and it has been shown that its scenarios. Classically, there are three general approaches to
performance can be greatly improved in several disciplines, multimodal data fusion: early fusion, late fusion, and hybrid
including computer vision. This line of research is part of fusion.
the rich field of deep learning, which typically deals with In addition to surveys of recent advances in deep multi-
visual information of different types and scales to perform modal learning itself, we also discussed the main methods of
complex tasks. Currently, the deep learning algorithms have multimodal fusion and reviewed the latest advanced applica-
demonstrated their potential and applicability in other active tions and multimodal datasets popular in the computer vision
areas such as natural language processing, machine transla- community.
tion, and speech recognition, performing comparably or even The remainder of this paper is organized as follows. In
better than humans. Sect. 2, we discuss the differences between similar previ-
A large number of computer vision researchers focus each ous studies and our work. Section 3 reviews recent advances
year on developing vision systems that enable machines in deep multimodal algorithms, the motivation behind them,
to mimic human behavior. For example, some intelligent and commonly used fusion techniques, with a focus on deep
machines can use computer vision technology to simulta- learning-based algorithms. In Sects. 4 and 5, we present more
neously map their behavior, detect potential obstacles, and advanced multimodal applications and benchmark datasets
track their location. By applying computer vision to multi- that are very popular in the computer vision community.
modal applications, complex operational processes can be In Sect. 6, we discuss the limitations and challenges of
automated and made more efficient. Here, the key challenge vision-based deep multimodal learning. The final section
is to extract visual attributes from one or more data streams then summarizes the whole paper and points out a roadmap
(also called modalities) with different shapes and dimensions for future research.
by learning how to fuse the extracted heterogeneous features
and project them into a common representation space, which
is referred to as deep multimodal learning in this work. 2 Comparison with previous surveys
In many cases, a set of heterogeneous cues from multiple
modalities and sensors can provide additional knowledge that In recent years, the computer vision community has paid
reflects the contextual nature of a given task. In the arena of more attention to deep learning algorithms due to their
multimodality, a given modality depends on how specific exceptional capabilities compared to traditional handcrafted
media and related features are structured within a conceptual methods. A considerable amount of work has been conducted
architecture. Such modalities may include textual, visual, and under the general topic of deep learning in a variety of appli-
auditory modalities, involving specific ways or mechanisms cation domains. In particular, these include several excellent
to encode heterogeneous information harmoniously. surveys of global deep learning models, techniques, trends,
In this study, we mainly focused on visual modalities, and applications [4,180,182], a survey of deep learning algo-
such as images as a set of discrete signals from a variety rithms in the computer vision community [179], a survey
of image sensors. The environment in which we live gener- that focuses directly on the problem of deep object detection
ally includes many modalities in which we can see objects, and its recent advances [181], and a survey of deep learn-
hear tones, feel textures, smell aromas, and so on. For exam- ing models including the generative adversarial network and
ple, the audiovisual modalities are complementary to each its related challenges and applications [19]. Nonetheless, the
other, where the acoustic and visual attributes come from applications discussed in these surveys include only a single
two different physical entities. However, combining differ- modality as a data source for data-driven learning. How-
ent modalities or data sources to improve performance is still ever, most modern machine learning applications involve
often an attractive task from one standpoint, but in practice, it more than one modality (e.g., visual and textual modalities),
makes little sense to distinguish between noise, concepts, and such as embodied question answering, vision-and-language
conflicts between data sources. Moreover, the lack of labeled navigation, etc. Therefore, it is of vital importance to learn
multimodal data in the current literature can lead to reduced more complex and cross-modal information from different
flexibility and accuracy, often requiring cooperation between sources, types, and data distributions. This is where deep
different modalities. In this paper, we reviewed recent deep multimodal learning comes into play.
multimodal learning techniques to put forward typical frame- From the early works of speech recognition to recent
works and models to advance the field. These networks show advances in language- and vision-based tasks, deep multi-
the utility of learning hierarchical representations directly modal learning technologies have demonstrated significant

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2941

progress in improving cognitive performance and interoper- related popular datasets. Moreover, most of the papers
ability of prediction models in a variety of ways. To date, deep we reviewed are recent and have been published in high-
multimodal learning has been the most important evolution in quality conferences and journals such as the visual computer,
the field of multimodal machine learning using deep learning ICCV, and CVPR. A comprehensive overview of multi-
paradigm and multimodal big data computing environments. modal technologies—their limitations, perspectives, trends,
In recent years, many pieces of research based on multi- and challenges—is also provided in this article to deepen and
modal machine learning have been proposed [37], but to the improve the understanding of the main directions for future
best of our knowledge, there is no recent work that directly progress in the field. In summary, our survey is similar to the
addresses the latest advances in deep multimodal learning closest works [35,37], which discuss recent advances in deep
particularly for the computer vision community. A thorough multimodal learning with a special focus on computer vision
review and synthesis of existing work in this domain, espe- applications. The surveys we discussed are summarized in
cially for researchers pursuing this topic, is essential for Table 1.
further progress in the field of deep learning. However, there
is still relatively little recent work directly addressing this
research area [32–37]. Since multimodal learning is not a 3 Deep multimodal learning architectures
new topic, there is considerable overlap between this work
and the surveys of [32–37], which needs to be highlighted In this section, we discuss deep multimodal learning and
and discussed. its main algorithms. To do so, we first briefly review the
Recently, the valuable works of [32,33] considered several history of deep learning and then focus on the main moti-
multimodal practices that apply only to specific multimodal vations behind this research to answer the question of how
use cases and applications, such as emotion recognition [32], to reduce heterogeneity biases across different modalities.
human activity and context recognition [33]. More specif- We then outline the perspective of multimodal representa-
ically, they highlighted the impact of multimodal feature tion and what distinguishes it from the unimodal space. We
representation and multilevel fusion on system performance next introduce recent approaches for combining modalities.
and the state-of-the-art in each of these application areas. Next, we highlight the difference between multimodal learn-
Furthermore, some cutting-edge works [34,36] have been ing and multitask learning. Finally, we discuss multimodal
proposed in recent years that address the mechanism of inte- alignment, multimodal transfer learning, and zero-shot learn-
grating and fusing multimodal representations inside deep ing in detail in Sects. 3.6, 3.7, and 3.8, respectively.
learning architectures by showing the reader the possibilities
this opens up for the artificial intelligence community. Like- 3.1 Brief history of deep learning
wise, Guo et al. [35] provided a comprehensive overview of
deep multimodal learning frameworks and models, focus- Historically, artificial neural networks date back to the 1950s
ing on one of the main challenges of multimodal learning, and the efforts of psychologists to gain a better understand-
namely multimodal representation. They summarized the ing of how the human brain works, including the work of F.
main issues, advantages, and disadvantages for each frame- Rosenblat [8]. In 1960, F. Rosenblat [8] proposed a percep-
work and typical model. Another excellent survey paper was tron as part of supervised learning algorithms that is used to
recently published by Baltrušaitis et al. [37], which reviews compute a set of activations, meaning that for a given neu-
recent developments in multimodal machine learning and ron and input vector, it performs the sum weighted by a set
expresses them in a general taxonomic way. Here, the authors of weights, adds a bias, and applies an activation function.
identified five levels of multimodal data combination: rep- An activation function (e.g., sigmoid, tanH, etc.), also called
resentation, translation, alignment, fusion, and co-learning. nonlinearity, uses the derived patterns to perform its nonlin-
It is important to note here that, unlike our survey, which ear transformation. As a deep variant of the perceptron, a
focused primarily on computer vision tasks, the study pub- multilayer perceptron, originally designed by [9] in 1986, is
lished by Baltrušaitis et al. [37] was aimed mainly at both a special class of feed-forward neural networks. Structurally,
the natural language processing and computer vision com- it is a stack of single-layer perceptrons. In other words, this
munities. In this article, we reviewed recent advances in structure gives the meaning of “deep” that a network can be
deep multimodal learning and organized them into six top- defined by its depth (i.e., the number of hidden layers). Typ-
ics: multimodal data representation, multimodal fusion (i.e., ically, a multilayer perceptron with one or two hidden layers
both traditional and deep learning-based schemes), multi- does not require much data to learn informative features due
task learning, multimodal alignment, multimodal transfer to the reduced number of parameters to be trained. A multi-
learning, and zero-shot learning. Beyond the above work, layer perceptron can be considered as a deep neural network
we focused primarily on cutting-edge applications of deep if the number of hidden layers is greater than one, as con-
multimodal learning in the field of computer vision and firmed by [10,11]. In this regard, many more advances in the

123
2942 K. Bayoudh et al.

Table 1 Summary of reviewed deep multimodal learning surveys


Refs. Year Publication Scope Multimodality?

[4] 2015 Nature A comprehensive overview of deep learning and related ✗


applications
[19] 2018 IEEE Signal Processing Magazine An overview of generative adversarial networks and related ✗
challenges in their theory and application
[179] 2016 Neurocomputing A review of deep learning algorithms in computer vision for ✗
image classification, object detection, image retrieval,
semantic segmentation and human pose estimation
[180] 2018 IEEE Access A survey of deep learning: platforms, applications and ✗
trends
[181] 2019 arXiv A survey of deep learning and its recent advances for object ✗
detection
[182] 2018 ACM Comput. Surv. A survey of deep learning: algorithms, techniques, and ✗
applications
[32] 2019 Book A survey on multimodal emotion detection and recognition 
[33] 2018 Proceedings of the ACM on A survey on multimodal deep learning for activity and 
Interactive, Mobile, Wearable context detection
and Ubiquitous Technologies
[34] 2017 IEEE Signal Processing Magazine A survey of recent progress and trends in deep multimodal 
learning
[35] 2019 IEEE Access A comprehensive survey of deep multimodal learning and 
its frameworks
[36] 2015 Proceedings of the IEEE A comprehensive survey of methods, challenges, and 
prospects for multimodal data fusion
[37] 2017 IEEE Transactions on Pattern A survey and taxonomy on multimodal machine learning 
Analysis and Machine algorithms
Intelligence

field are likely to follow, such as the convolutional neural net- to faster content analysis and recognition of the millions of
works of LeCun et al. [21] in 1998 and the spectacular deep videos produced daily. The main reason for using multimodal
network results of Krizhevsky et al. [7] in 2012, opening the data sources is that it is possible to extract complemen-
door to many real-world domains including computer vision. tary and richer information coming from multiple sensors,
which can provide much more optimistic results than a single
3.2 Motivation input. Some monomodal learning systems have significantly
increased their robustness and accuracy, but in many use
Recently, the amount of visual data has exploded due to cases, there are shortcomings in terms of the universality
the widespread use of available low-cost sensors, leading to of different feature levels and inaccuracies due to noise and
superior performance in many computer vision tasks (see missing concepts. The success of deep multimodal learn-
Fig. 1). Such visual data can include still images, video ing techniques has been driven by many factors that have
sequences, etc., which can be used as the basis for con- led many researchers to adopt these methods to improve
structing multimodal models. Unlike the static image, the model performance. These factors include large volumes of
video stream provides a large amount of meaningful infor- widely usable multimodal datasets, more powerful comput-
mation that takes into account the spatiotemporal appearance ers with fast GPUs, and high-quality feature representation at
of successive frames, so it can be easily used and analyzed multiple scales. Here, a practical challenge for the deep learn-
for various real-world use cases, such as video synthesis ing community is to strengthen correlation and redundancy
and description [68], and facial expression recognition [123]. between modalities through typical models and powerful
The spatiotemporal concept refers to the temporal and spa- mechanisms.
tial processing of a series of video sequences with variable
duration. In multimodal learning analytics, the audio-visual- 3.3 Multimodal representation
textual features are extracted from a video sequence to learn
joint features covering the three modalities. Efficient learn- Multi-sensory perception primarily encompasses a wide
ing of large datasets at multiple levels of representation leads range of interacting modalities, including audio and video.

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2943

Fig. 1 An example of a
multimodal pipeline that
includes three different
Data acquisition and sampling
modalities Multimodal datasets

Sensors

Modality 1 Modality 2 Modality 3

Deep learning
models Multimodal learning Fusion algorithms
(DNNs)

Computer vision applications

For simplicity, we consider the following temporal mul- type of information from several redundant sources. Learn-
timodal problem, where both audio and video modalities ing multimodal representation from heterogeneous signals
are exploited in a video recognition task (emotion recog- poses a real challenge for the deep learning community. Typ-
nition). First, let us consider two
 input streams
 of different  ically, inter- and intra-modal learning involves the ability to
modalities: X a = χ1n , . . . , χTn and X v = χ1m , . . . , χTm , represent an object of interest from different perspectives,
where χtn and χtm refer to the n- and m-dimensional fea- in a complementary and semantic context where multi-
ture vectors of the X a and X v modalities occurring at time modal information is fed into the network. Another crucial
t, respectively. Next, we combine the two modalities at time advantage of inter- and intra-modal interaction is the dis-
t and consider the two unimodal output distributions at dif- criminating power of the perceptual model for multisensory
ferent levels
 of representations.
 Given ground truth labels stimuli by exploiting the potential synergies between modal-
Z = Z 1 , . . . , Z T , we aim here to train a multimodal ities and their intrinsic representations [112]. Furthermore,
learning model M that maps both X a and X v into the same multimodal learning involves a significant improvement in
categorical set of Z . Each parameter of the input audio stream perceptual cognition, as many of our senses are involved in
χaT and video stream χvT is synchronized differently in time the process of treatment information from several modalities.
and space, where χaT ∈ Ri and χvT ∈ R j , respectively. Nevertheless, it is essential to learn how to interpret the input
Here, we can construct two separate unimodal networks from signals and summarize their multimodal nature to construct
X a and X v , denoted, respectively, by Na and Nv ,  where aggregate feature maps across multiple dimensions. In the
Na : X a → Y , Nv : X v → Y , and M = Na Nv . multimodality theory, obtaining contextual representation
Y denotes the predicted class label of the training samples from more than one modality has become a vital challenge,

generated by the output of the constructed networks and which has been termed in this study as the multimodal rep-
indicates the fusion operation. The generated multimodal net- resentation.
work M can then recognize the most discriminating patterns Typically, monomodal representation involves a linear or
in the streaming data by learning a common representation nonlinear mapping of an individual input stream (e.g., image,
that integrates relevant concepts from both modalities. Fig- video, or sound, etc.) into a high-level semantic representa-
ure 2 shows a schematic diagram of the application of the tion. The multimodal representation leverages the correlation
described multimodal problem to the video emotion recog- power of each monomodal sensation by aggregating their
nition task. spatial outputs. Thus, the deep learning model must be
Therefore, it is necessary to consider the extent to which adapted to accurately represent the structure and represen-
any such dynamic entity will be able to take advantage of this tation space of the source and target modality. For example,

123
2944 K. Bayoudh et al.

Alignment detection Face description


Video

t
Final
Fusion
Multimodal fusion prediction
operation
Audio description

Audio

Fig. 2 A schematic illustration of the method used: The visual modality descriptions are also generated. The two modalities are then combined
(video) involves the extraction of facial regions of interest followed by using a multimodal fusion operation to predict the target class label
a visual mapping representation scheme. The obtained representations (emotion) of the test sample
are then temporally fused into a common space. Additionally, the audio

Image 3.4 Fusion algorithms


Visual representations (Dense)
The most critical aspect of the combinatorial approach is the
flexibility to represent data at different levels of abstraction.
By using an intermediate formalism, the learned information
can be combined into two or more modalities for a particular
hypothesis. In this subsection, we describe common methods
Text
for combining multiple modalities, ranging from the conven-
This is the oldest and most important defensive work to
have been built along the North African coastline by the Textual representations (Sparse) tional to the modern methods.
Arab conquerors in the early days of Islam. Founded in c c c c c
796, this building underwent several modifications during
the medieval period. Initially, it formed a quadrilateral
and then was composed of four buildings giving onto two
inner courtyards.
3.4.1 Conventional methods
Fig. 3 Difference between visual and textual representation
3.4.1.1 Typical techniques based
To improve the generalization performance of complex
cognitive systems, it is necessary to capture and fuse an
appropriate set of informative features from multiple modal-
a 2D image may be represented by its visual patterns, mak- ities using typical techniques. Traditionally, they range from
ing it difficult to characterize this data structure using natural early to hybrid fusion schemes (see Fig. 4):
modality or other non-visual concepts. As shown in Fig. 3,
the textual representation (i.e., a word embedding) is very – Early fusion: low-level features that are directly extracted
sparse when compared to the image one, which makes it from each modality will be fused before being classified.
very challenging to combine these two different representa- – Late fusion: also called “decision fusion”, which consists
tions into a unified model. As another example, when the of classifying features extracted from separate modalities
driver of a car is driving autonomously, he probably has before fusing them.
a LiDAR camera and other embedded sensors (e.g., depth – Hybrid fusion: also known as “intermediate fusion”,
sensors, etc) [81] to perceive his surroundings. Here, poor which consists of combining multimodal features of early
weather conditions can affect the visual perception of the and late fusion before making a decision.
environment. Moreover, the high dimensionality of the state
space poses a major challenge, since the vehicle can mobi- Feature-level fusion (i.e., early fusion) provides a richness of
lize in both structured and unstructured locations. However, information from heterogeneous data. The extracted features
an RGB image is encoded as a discrete space in the form of often lack homogeneity due to the diversity of modalities and
grid pixels, making it difficult to combine visual and non- disparities in their appearance. Also, this fusion process can
visual cues. Therefore, learning a joint embedding is crucial generate a single large representation that can lead to pre-
for exploiting the synergies of multimodal data to construct diction errors. In the case of a late fusion, such techniques
shared representation spaces. This implies the emphasis on as majority vote [38] and low-rank multimodal fusion [39]
multimodal fusion approaches, which will be discussed in may be used to aggregate the final prediction scores of sev-
the next subsection. eral classifiers. Thus, each modality independently takes the

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2945

3.4.1.2 Kernel based


Modality 1

Since a long time ago, the support vector machine [40] clas-
sifier has been introduced as a learning algorithm for a wide
(a) Modality 2 Fusion Classifier range of classification tasks. Indeed, SVM is one of the most
popular linear classifiers that are based on learning a single
kernel function through the handling of linear tasks, such as
Modality 3 discrimination and regression problems. The main idea of an
SVM is to separate the feature space into two classes of data
with a hard margin. Kernel-based methods are among the
most commonly used techniques for performing fusion due
Modality 1 Classifier 1 to their proven robustness and reliability. For more details, we
invite the reader to consult the work of Gönen et al. [41] that
focused on the taxonomy of multi-kernel learning algorithms.
(b) Modality 2 Classifier 2 Fusion These kernels are intended to make use of the similarities and
discrepancies across training samples as well as a wide vari-
ety of data sources. In other words, these modular learning
Modality 3 Classifier 3 methods are used for multimodal data analysis. Recently, a
growing number of studies have focused, in particular, on
the potential of these kernels for multi-source-based learn-
ing for improving performance. In this sense, a wide range
Modality 1 Classifier 1
of kernel-based methods have been proposed to summarize
information from multiple sources using a variety of input
data. In this regard, Gönen et al. [41] pioneered multiple
kernel learning (MKL) algorithms that seek to combine mul-
Modality 2 Classifier 2 Fusion
timodal data that have distinct representations of similarity.
MKL is the process of learning a classifier through multiple
kernels and data sources. Also, it aims to extract the joint
Modality 3 Classifier 3
correlation of several kernels in a linear or nonlinear manner.
(c) Similarly, Aiolli et al. [42] proposed the MKL-based algo-
rithm, called EasyMKL, which combines a series of kernels
Modality 1
to maximize the segregation of representations and extract
the strong correlation between feature spaces to improve
the performance of the classification task. An alternative
Modality 2 Fusion Classifier
model, called convolutional recurrent multiple kernel learn-
ing (CRMKL), based on the MKL framework for emotion
recognition and sentiment analysis is reported by Wen et al.
Modality 3
[43]. In [43], the MKL algorithm is used to combine multiple
features that are extracted from deep networks.
Fig. 4 Conventional methods for multimodal data fusion: a Early
fusion, b Late fusion, c Hybrid fusion 3.4.1.3 Graphical models based
One of the most common probabilistic graphical models
(PGMs) includes the hidden Markov model (HMM) [44].
It is an unsupervised and generative model. It has a series
decision, which can reduce the overall performance of the of potential states and transition probabilities. In the Markov
integration process. In the case of intermediate fusion, the chain, the transition from one state to another leads to the
spatial combination of intermediate representations of the generation of observed sequences in which the observations
different data streams usually produced with varying scales are part of a state set. A transition formalizes how it is possi-
and dimensions, making them more challenging to merge. ble to move from one state to another and for each one there
To overcome this challenge, the authors of [124] designed is a probability distribution of being borrowed. The states
a simple fusion scheme, called multimodal transfer module are hidden, but the first state generates a visible state from a
(MMTM), to transfer and hierarchically aggregate shared given one. The main property of Markov chains is that the
knowledge from multiple modalities in CNN networks. probabilities depend only on the previous state of the model.

123
2946 K. Bayoudh et al.

In HMM, a kind of generalization of mixing densities defined Boltzmann machine (RBM) by combining it together. In
by each state is involved, as confirmed by Ghahramani et al. other words, a DBN consists of stacking a series of RBM
[45]. Specifically, Ghahramani et al. [45] introduced the fac- where the hidden layer of the first RBM is the visible layer
torial HMM (FHMM) which consists of combining the state of the higher hierarchies. Structurally, a DBN model has a
transition matrix of HMMs with the distributed representa- dense structure similar to that of a shallow multilayer percep-
tions of vector quantizer (VQ) [46]. According to [46], VQ tron. The first RBM is designed to systematically reconstruct
is a conventional technique for quantifying and generaliz- its input signal in which its hidden layer will be handled as
ing dynamic mixing models. FHMM addresses the limited the visible layer for the second one. However, all hidden rep-
representational power of the latent variables of HMM by resentations are learned globally at each level of DBN. Note
presenting the hidden state under a certain weighted appear- that DBN is one of the strongest alternatives to overcome the
ance. Likewise, Gael et al. [47] proposed the non-parametric vanishing gradient problem through a stack of RBM units.
FHMM, called iFHMM, by introducing a new stochastic pro- Like a single RBM, DBN involves discovering latent fea-
cess for latent feature representation of time series. tures in the raw data. It can be further trained in a supervised
In summary, the PGM model can be considered a robust fashion to perform the classification of the detected hidden
tool for generating missing channels by learning the most representations.
representative inter-modal features in an unsupervised man- Compared to other supervised deep models, DBN requires
ner. One of the drawbacks of the graphical model is the high only a very small set of labeled data to perform weight
cost of the training and inference process. training, which leads to a high level of usefulness in many
3.4.1.4 Canonical correlation analysis based multimodal tasks. For instance, Srivastava et al. [206] pro-
posed a multimodal generative model based on the concept
In general, a fusion scheme can construct a single mul- of deep Boltzmann machine (DBM) which learns a set of
timodal feature representation for each processing stage. multimodal features by filling in the conditional distribu-
However, it is also straightforward to place constraints on tion of data on a space of multimodal inputs such as image,
the extracted unimodal features [37]. Canonical correlation text, and audio. Specifically, the purpose of training a multi-
analysis (CCA) [201] is a very popular statistical method that modal DBN model is to improve the prediction accuracy of
attempts to maximize the semantic relationship between two both unimodal and multimodal systems by generating a set
unimodal representations so that complex nonlinear trans- of multimodal features that are semantically similar to the
formations of the two data perspectives can be effectively original input data so that they can be easily derived even
learned. Formally, it can be formulated as follows: if some modalities are missing. Figure 5 illustrates a mul-
 ∗  timodal DBN architecture that takes as input two different
v1 , v2∗ = argmax corr (v1T X 1, v2T X 2), (1) modalities (image and text) with different statistical distri-
v1,v2
butions to map the original data from a high-dimensional
where X 1 and X 2 stand for unimodal representations, v1 and space to a high-level abstract representation space. After
v2 for two vectors of a given length, and corr for the corre- extracting the high-level representation from each modality,
lation function. A deep variant of CCA can also be used to an RBM network is then used to learn the joint distribution.
maximize the correlation between unimodal representations, The image and text modalities are modeled using two DBMs,
as suggested by the authors of [202]. Similarly, Chandar et al. each consisting of two hidden layers. Formally, the joint rep-
[203] proposed a correlation neural network, called CorrNet, resentation can be expressed as follows:
which is based on a constrained encoder/decoder structure to 
P(vi |θ ) = P(vi , h 1 , h 2 |θ ), (2)
maximize the correlation of internal representations when
h 1 ,h 2
projected onto a common subspace. Engilberge et al. [204]
introduced a weaker constraint on the joint embedding space where vi refers to the input visual and textual modalities, θ
using a cosine similarity measure. Besides, Shahroudy et al. to the network parameters, and h to the hidden layer of each
[205] constructed a unimodal representation using a hierar- modality.
chical factorization scheme that is limited to representing In a multimodal context, the advantage of using multi-
redundant feature parts and other completely orthogonal modal DBN models lies in their sensitivity and stability in
parts. both supervised, semi-supervised and unsupervised learn-
ing protocols. These models allow for better modeling of
3.4.2 Deep learning methods very complex and discriminating patterns from multiple input
modalities. Despite these advantages, these models have a
3.4.2.1 Deep belief networks based few limitations. For instance, they largely ignore the spa-
Deep belief network (DBN) is part of the graphical generative tiotemporal cues of multimodal data streams, making the
deep model [15]. They form a deeper variant of the restricted inference process computationally intensive.

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2947

Joint representation Audio reconstruction Video reconstruction

(2)
(2) h Shared representation
h

(1)
(1) h
h

(i) (i) Audio Video


v v

Image Text Fig. 6 Structure of a bimodal AE

Fig. 5 Structure of a bimodal DBN

across modalities. For example, the authors of [210–212]


designed multimodal systems based on SAEs, where the
3.4.2.2 Deep autoencoders based encoder side of the architecture represents and compresses
each unimodal feature separately, and the decoder side
Deep autoencoders (DAEs) [207] are a class of unsupervised constructs the latent (shared) representation of the inputs
neural networks that are designed to learn a compressed rep- in a unsupervised manner. Figure 6 shows the coupling
resentation of input signals. Conceptually, they consist of mechanism of two separate AEs (bimodal AE) for both
two coupled modules: the encoding module (encoder) and modalities (audio and video) into a jointly shared represen-
the decoding module (decoder). On the one hand, the encod- tation hierarchy where the encoder and decoder components
ing module consists of several processing layers to map are independent of each other. As a powerful tool for feature
high-dimensional input data into a low-dimensional space extraction and dimensionality reduction, the DAE aims to
(i.e., latent space vectors). On the other hand, the decod- learn how to efficiently represent manifolds where the train-
ing module takes these latent representations as input and ing data is unbalanced or lacking. One of the main drawbacks
decodes them in order to reconstruct the input data. These of DAEs is that many hidden parameters have to be trained,
models have recently drawn attention from the multimodal and the inference process is time-consuming. Moreover, they
learning community due to their great potential for reducing also miss some spatiotemporal details in multimodal data.
data dimensionality and, thus, increasing the performance of
training algorithms. For instance, Bhatt et al. [208] proposed 3.4.2.3 Convolutional neural networks based
a DAE-based multimodal data reconstruction scheme that
Convolutional neural networks (CNNs or ConvNets) are a
uses knowledge from different modalities to obtain robust
class of deep feed-forward neural networks whose main pur-
unimodal representations and projects them onto a com-
pose is to extract spatial patterns from visual input signals
mon subspace. Similar to the work of Bhatt, Liu et al. [209]
[20,22]. More specifically, such models tend to model a series
proposed the integration of multimodal stacked contractive
of nonlinear transformations by generating very abstract and
AEs (SCAEs) to learn cross-modality features across multi-
informative features from highly complex datasets. The main
ple modalities even when one of them is missing, intending
properties that distinguish CNNs from other models include
to minimize the reconstruction loss function and avoid the
their ability to capture local connectivity between units, to
overfitting problem. The loss function can be formulated as
share weights across layers, and to block a sequence of hidden
follows:
layers [4]. The architecture is based on hierarchical filtering
operations, i.e., using convolution layers followed by activa-

M
 2  2
Lossreconst = (xi − x̂i 2 ) +  yi − ŷi 2 ). (3) tion functions, etc. Once the convolution layers are linearly
i=1 stacked, the growth of the receptive field size (i.e., kernel
size) of the neural layers can be simulated by a max-pooling
Here, (xi , yi ) denotes a pair of two inputs, and (x̂i , ŷi ) rep- operation, which implies a reduction in the spatial size of
resent their reconstructed outputs. the feature map. After applying a series of convolution and
Several other typical models based on stacked AEs (SAEs) pooling operations, the hidden representation learned from
have been proposed to learn coherent joint representations the model must be predicted. For this purpose, at least one

123
2948 K. Bayoudh et al.

fully connected layer (also called dense layer) is used that In summary, a multimodal CNN serves as a powerful fea-
concatenates all previous activation maps. ture extractor that learns local cross-modal features from
Since its introduction by Krizhevsky et al. [7] in 2012, the visual modalities. It is also capable of modeling spatial cues
CNN model has been successfully applied to a wide range of from multimodal data streams with an increased number of
multimodal applications, such as image dehazing [239,240] parameters. However, it requires a large-scale multimodal
and human activity recognition [241]. An adaptive multi- dataset to converge optimally during training, and the infer-
modal mapping between two visual modalities (e.g., images ence process is time-consuming.
and sentences) typically requires strong representations of
3.4.2.4 Recurrent neural networks based
the individual modalities [213]. In particular, CNNs have
demonstrated powerful generalization capabilities to learn Recurrent neural networks (RNNs) [12] are a popular type of
how to represent visual appearance features from static data. deep neural network architectures for processing sequential
Recently, with the advent of robust and low-cost RGB-D data of varying lengths. They learn to map input activa-
sensors such as the Kinect, the computer vision community tions to the next hierarchy level and then transfer hidden
has turned its attention to integrating RGB images and corre- states to the outputs using the recurrent feedback, which
sponding depth maps (2.5D) into multimodal architectures as gives them the capacity to learn useful features from the pre-
shown in Fig. 7. For instance, Couprie et al. [214] proposed vious states, unlike other deep feedforward networks such
a bimodal CNN architecture for multiscale feature extrac- as CNNs, DBNs, etc. It also can handle time series and
tion from RGB-D datasets, which are taken as four-channel dynamic media such as text and video sequences. By using
frames (blue, green, red, and depth). Similarly, Madhuranga the backpropagation algorithm, the RNN function takes an
et al. [215] used CNN models for video recognition purposes input vector and a previous hidden state as input to capture
by extracting silhouettes from depth sequences and then fus- the temporal dependence between objects. After training, the
ing the depth information with audio descriptions for activity RNN function is fixed at a certain level of stability and can
of daily living (ADL) recognition. Zhang et al. [217] pro- then be used over time.
posed to use multicolumn CNNs to extract visual features However, the vanilla RNN model is typically incapable
from the face and eye images for the gaze point estimation of capturing long-term dependencies in sequential data since
problem. Here, the regression depth of the facial landmarks they have no internal memory. To this end, several popular
is estimated from the facial images and the relative depth of variants have been developed to efficiently handle this con-
facial keypoints is predicted by global optimization. To per- straint and the gradient vanishing problem with impressive
form image classification directly, the authors of [217,218] results, including long short-term memory (LSTM) [13] and
suggested the possibility of using multi-stream CNNs (i.e., gated recurrent linear units (GRU) [14]. In terms of compu-
two or more stream CNNs) to extract robust features from tational efficiency, GRU is a lightweight variant of LSTM
a final hidden layer and then project them onto a common since it can modulate the information flow without using its
representation space. However, the most commonly adopted internal memory units.
approaches involve concatenating a set of pre-trained fea- In addition to their use for unimodal tasks, RNNs have
tures derived from the huge ImageNet dataset to generate a proved useful in many multimodal problems that require
multimodal representation [216]. modeling long- and short-range dependencies across the
j
Formally, let f i be the feature map of j modalities and i input sequence, such as semantic segmentation [219] and
be the current spatial location, where j = {1, 2, . . . , N }. As image captioning [220]. For instance, Abdulnabi et al.
shown in Fig. 7, in our case N = 2, since the feature maps [219] proposed a multimodal RNN architecture designed for
FC2 (RGB) and FC2 (D) were taken separately from the semantic scene segmentation using RGB and depth chan-
RGB and depth paths. The fused feature map Fifusion , which nels. They integrated two parallel RNNs to efficiently extract
is a weighted sum of the unimodal representations, can be robust cross-modal features from each modality. Zhao et al.
calculated as follows: [220] proposed an RNN-based multimodal fusion scheme
to generate captions by analyzing distributional correla-

N
j j tions between images and sentences. Recently, several new
Fifusion = wi f i . (4)
multimodal approaches based on RNN variants have been
j=1
proposed and have achieved outstanding results in many
j
vision applications. For example, Li et al. [221] designed
Here, wi denotes the weight vectors that can be computed a GRU-based embedding framework to describe the content
as follows: of an image. They used GRU to generate a description of vari-
j
able length from a given image. Similarly, Sano et al. [222]
j exp( f i ) proposed a multimodal BiLSTM for ambulatory sleep detec-
wi = N
. (5)
k=1 exp( f i )
k

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2949

Fig. 7 Structure of a bimodal Depth RGB Image


CNN

Conv1 Conv1

Conv2 Conv2

Conv3 Conv3

Conv4 FC1 FC2 Conv4 FC1 FC2

Conv5 Conv5

Fusion

Prediction

Bidrectional recurrent

Softmax Next
Input word Embedding 1 Embedding 2 Multimodal word

Intermediate CNN Image

Fig. 8 A schematic illustration of bidirectional multimodal RNN (m-RNN) [223]

tion. In this case, BiLSTM was used to extract features from where f (.) denotes the activation function, w and r consist
the wearable device and synthesize temporal information. of the word embedding feature and the hidden states in both
Figure 8 illustrates a multimodal m-RNN architecture directions of the recurrent layer and I represent the visual
that incorporates both word embeddings and visual features features.
using a bidirectional recurrent mechanism and a pre-trained In summary, the multimodal RNN model is a robust tool
CNN. As can be seen, m-RNN consists of three components: for analyzing both short- and long-term dependencies of mul-
a language network component, a vision network compo- timodal data sequences using the backpropagation algorithm.
nent, and a multimodal layer component. The multimodal However, the model has a slow convergence rate due to the
layer here maps semantic information across sub-networks high computational cost in the hidden state transfer function.
by temporally learning word embeddings and visual features.
3.4.2.5 Generative adversarial networks based
Formally, it can be expressed as follows:
Generative adversarial networks (GANs) are part of deep
generative architectures, designed to learn the data distri-
bution through the adversarial learning. Historically, they
m(t) = f (vw .w(t), vr .r (t), v I .I ), (6) were first developed by Goodfellow et al. [16], which

123
2950 K. Bayoudh et al.

demonstrated the ability to generate realistic and reason- Multimodal data

ably impractical representations from noisy data domains. Modality n Discriminators Prediction
Structurally, GAN is a unified network consisting of two .
sub-networks, a generator network (G) and a discriminator .
Positive data
network (D), which interact continuously during the learn- Modality 2
ing process. The principle of its operation is as follows: The
generator network takes as input the latent distribution space Modality 1
(i.e., a random noise (z)) and generates an artificial sample.
The discriminator takes the true sample and those generated
by the generator and tries to predict whether the input sam- Negative data
Generators
ple is true (false) or not. Hence, it is a binary classification
problem, where the output must be between 0 (generated)
and 1 (true). In other words, the generator’s main task is to Random noise
generate a realistic image, while the discriminator’s task is to
determine whether the generated image is true or false. Sub-
Fig. 9 A schematic illustration of multimodal GAN
sequently, they should use an objective function to represent
the distance between the distribution of generated samples
( pz ) and the distribution of real ones ( pdata ). The adversar-
ial training strategy consists of using a minimax objective
function V (G, D), which can be expressed as follows: In summary, unsupervised GAN is one of the most pow-
erful generative models that can address scenarios where
training data is lacking or some hidden concepts are missing.
min max V (G, D) = Ex∼ pdata |log(D(x))| However, it is extremely tricky to train the network when
G D
+Ez∼ pz |log(1 − D(x))| (7) generating discrete distributions, and the process itself is
unstable compared to other generative networks. Moreover,
the function that this network seeks to optimize is an adver-
Since their development in 2014, generative adversar- sarial loss function without any normalization.
ial training algorithms have been widely used in various
3.4.2.6 Attention mechanism based
unimodal applications such as scene generation [17], image-
to-image translation [18], and image super-resolution [224, In recent years, the attention mechanism (AM) has become
225]. To obtain the latest advances in super-resolution algo- one of the most challenging tasks in computer vision and
rithms for a variety of remote sensing applications, we invite machine translation [232]. The idea of AM is to focus on
the reader to refer to the excellent survey article by Rohith a particular position in the input image by computing the
et al. [226]. weighted sum of feature vectors and mapping them into a
In addition to its use in unimodal applications, the gener- final contextual representation. In other words, it learns how
ative adversarial learning paradigm has recently been widely to reduce some irrelevant attributes from a set of feature vec-
adopted in multimodal arenas, where two or more modali- tors. In the multimodal analysis, an attentional model can
ties are involved, such as image captioning [227] and image be designed to combine multiple modalities, each with its
retrieval [228]. In recent years, GAN-based schemes have internal representation (e.g., spatial features, motion features,
been receiving a lot of attention and interest in the field of etc.). That is, when a set of features is derived from spatiotem-
multimodal vision. For example, Xu et al. [229] proposed poral cues, these variable-length vectors are semantically
a fine-grained text-image generation framework using an combined into a single fixed-length vector. Furthermore, an
attentional GAN model to create high-quality images from AM can be integrated into RNN models to improve the gen-
text. Similarly, Huang et al. [230] proposed an unsupervised eralization capability of the former by capturing the most
image-to-image translation architecture that is based on the representative and discriminating patterns from heteroge-
idea that the image style of one domain can be mapped neous datasets. A formalism for integrating an AM into the
into the styles of many domains. In [231], Toriya et al. basic RNN model was developed by Bahdanau et al. [1].
addressed the task of image alignment between a pair of mul- Since the encoding side of an RNN generates a fixed-length
timodal images by mapping the appearance features of the feature vector from its input sequence, this can lead to very
first modality to the other using GAN models. Here, GANs tedious and time-consuming parameter tuning. Therefore, the
were used as a means to apply keypoint-mapping techniques AM acts as a contextual bridge between the encoding and
to multimodal images. Figure 9 shows a simplified diagram decoding sides of an RNN to pay attention only to a partic-
of a multimodal GAN. ular position in the input representation.

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2951

Consider as an example of neural machine translation [1] Target

(see Fig. 10), where an encoder is trained to map a sequence Y t-1 Yt


of input vectors x = (x1 , . . . , x Tx ) into a fixed-length vector
c and a decoderto predict thenext word (yt  ) from previous S t-1 St Decoder
predicted ones y1 , . . . , yt  −1 . Here, c refers to an encoded
vector produced by a sequence of hidden states that can be
expressed as follows: Context vector

c = q(h 1 , . . . , h Tx ), (8) Additive attention Attention layer

where q denotes some activation functions. The hidden state


h t (h t ∈ Rn ) at time step t can be formulated as:
Alignment weights

a t,1 a t,2 a t,3 a t,T


h t = f (xt , . . . , h t−1 ). (9)
h1 h2 h3 hT Encoder
The context vector ci can then becomputed as a weighted
sum of a sequence of annotations h 1 , . . . , h Tx as follows:
X1 X2 X3 XT


Tx Source sequence
ci = σi j h j , (10)
j=1
Fig. 10 A schematic illustration of the attention-based machine trans-
lation model
where the alignment weight σi j of each annotation h j can be
calculated as: Furthermore, the number of parameters to be trained is huge
exp(ei j ) compared to other deep networks such as RNNs, CNNs, etc.
σi j = Tx
, (11)
k=1 exp(eik ) 3.5 Multitask learning
and ei j = a(si−1 , h j ). si−1 is the hidden state at the (i −1)-th
position of the input sequence. More recently, multitask learning (MTL) [108,109] has
Since its introduction, the AM has gained wide adop- become an increasingly popular topic in the deep learn-
tion in the computer vision community due to its spectral ing community. Specifically, the MTL paradigm frequently
capabilities for many multimodal applications such as video arises in a context close to multimodal concepts. In contrast
description [233,234], salient object detection [235], etc. For to single-task learning, the idea behind this paradigm is to
example, Hori et al. [233] proposed a multimodal atten- learn a shared representation that can be used to respond
tion framework for video captioning and sentence generation to several tasks in order to ensure better generalizability.
based on the encoder–decoder structure using RNNs. In par- Although, there are some similarities between the fusion
ticular, the multimodal attention model was used as a way methods discussed in Sect. 3.4 and the methods used to per-
to integrate audio, image, and motion features by select- form multi-tasks simultaneously. What they have in common
ing the most relevant context vector from each modality. is that the sharing of the structure between all tasks can be
In [236], Yang et al. suggested the use of stacked attention learned jointly to improve performance. The conventional
networks to search for image regions that correlate with a typology of the MTL approach consists of two subtasks:
query answer and identify representative features of a given
question more precisely. More recently, Guo et al. [237] – Hard parameter sharing [110]: It consists of extracting a
introduced a normalized variant of the self-attention mecha- generic representation for different tasks using the same
nism, called normalised self-attention (NSA), which aims to parameters. It is usually applied to avoid overfitting prob-
encode and decode the image and caption features and nor- lems.
malize the distribution of internal activations during training. – Soft parameter sharing [111]: It consists of extracting a
In summary, the multimodal AM provides a robust solu- set of feature vectors and simultaneously drawing simi-
tion for cross-modal data fusion by selecting the local larity relationships between them.
fine-grained salient features in a multidimensional space and
filtering out any hidden noise. However, the only weakness Figure 11 shows a meta-architecture for the two-task case.
of AM is that the training algorithm is unstable, which may As can be seen, there are six intermediate layers in total, one
affect the predictive power of the decision-making system. shared input layer (bottom), two task-specific output layers

123
2952 K. Bayoudh et al.

performance. A more common solution is to find an effi-


cient method that transfers knowledge already derived from
another trained model onto a huge dataset (e.g., 1000k-
ImageNet) [198]. Transfer learning (TL) [70] is one of the
model regularization techniques that have proven their effec-
tiveness for training deep models with a limited amount of
available data and avoiding overfitting problems. Transfer-
ring knowledge from a pre-trained model associated with a
sensory modality to a new task or similar domain facilitates
the learning and fine-tuning of a target model using a target
dataset.
The technique can accelerate the entire learning process
by reducing inference time and computational complexity.
Fig. 11 A meta-architecture in the case of two tasks A and B [109] Moreover, the learning process can learn the data distribution
in a non-parallel manner and ensure its synchronization over
time. It can also learn rich and informative representations by
(top), and three hidden layers per task, each divided into two
using cooperative interactions among modalities. Moreover,
G-subspaces. Typically, MTL contributes to the performance
it can improve the quality of the information transferred by
of the target task based on knowledge gained from auxiliary
eliminating any latent noise and conflict [113,115–117]. For
tasks.
example, Palaskar et al. [113] proposed a multimodal integra-
tion pipeline that loads the parameters of a pre-trained model
3.6 Multimodal alignment on the source dataset (transcript and video) to initialize the
training of the target dataset (summary, question, and video).
Multimodal alignment consists of linearly linking the fea- They used hierarchical attention [114] as a merging mecha-
tures of two or more different modalities. Its areas of appli- nism that can be used to generate a synthesis vector from mul-
cation include medical image registration [169], machine timodal sources. An example of a multimodal transfer learn-
translation [1], etc. Specifically, multimodal image alignment ing pipeline based on the fine-tuning mechanism is shown in
provides a spatial mapping capability between images taken Fig. 12. It can be seen that a deep model is first pre-trained
by sensors of different modalities, which may be catego- on a source domain, the learned parameters are then shifted
rized into feature-based [167,168] and patch-based [165,166] to different modalities (i.e., fine-tuned models) and finally
methods. Feature-based methods detect and extract a set of blended into the target domain using fusion techniques.
matching features that should be structurally consistent to
describe their spatial patterns. Patch-based methods first split 3.8 Zero-shot learning
each image into local patches and then consider the similar-
ity between them by computing their cross-correlation and In practice, the amount of labeled data samples for effective
combination. Generally, the alignment task can be divided model training is often insufficient to recognize all possi-
into two subtasks: the attentional alignment task [170,171] ble object categories in an image (i.e., seen and unseen
and the semantic alignment task [172,173]. The attentional classes). This is why zero-shot learning [130] takes place.
alignment task is based on the attentional mapping between This supervised learning approach opens up many valuable
the features of the input modality and the target one, while applications such as object detection [131], object clas-
the semantic alignment task takes the form of an alignment sification and retrieval of videos [141], especially when
method that directly provides alignment capabilities to a pre- appropriate datasets are missing. In other words, it addresses
dictive model. The most popular use of semantic alignment multi-class learning problems when some classes do not have
is to create a dataset with associated labels and then gener- sufficient training data. However, during the learning pro-
ate a semantically aligned dataset. Both of these tasks have cess, additional visual and semantic features such as word
proven effective in multimodal alignment, where attentional embeddings [132], visual attributes [133], or descriptions
alignment features are better able to take into account the [134] can be assigned to both seen and unseen classes. In
long-term dependencies between different concepts. the context of multimodality, a multimodal mapping scheme
typically combines visual and semantic attributes using only
3.7 Multimodal transfer learning data related to the seen classes. The objective is to project a
set of synthesized features in order to make the model more
Typically, training a deep model from scratch requires a large generalizable toward the recognition of the unseen class in
amount of labeled data to achieve an acceptable level of test samples [135]. Such methods tend to use GAN models to

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2953

Fig. 12 An illustration of an Domain 1 / Modality 1


example of a multimodal
transfer learning process
Pre-trained model
Source: large-scale dataset

Training Transfer learning

Source domain

Prediction Fine-tuned model 1

Domain 2 / Modality 2
Fusion
Transfer learning

Fine-tuned model 2

Transfer learning

synthesize and reconstruct the visual features of the unseen 4.1 Generic computer vision tasks
classes, resulting in high accuracy classification and ensuring
a balance between seen and unseen class labels [136,137]. 4.1.1 Object detection

Object detection tasks generally consist of identifying rect-


angular windows (i.e., bounding boxes) in the image (i.e.,
object localization) and assigning class labels to them (i.e.,
object classification), through a process of patch extraction
4 Tasks and applications and representation (i.e., region of interest (RoI)). The local-
ization process aims at defining the coordinates and position
When modeling multimodal data, several compromises have of the patch. In order to classify each object instance, a patch
to be made between system performance, computational bur- proposal strategy may be applied before the final prediction
den, and processing speed. Also, many other factors must step. In practice, there are several possible detection meth-
be regarded when selecting a deep model due to its sensi- ods. The most typical of these is to apply the classifier to
tivity and complexity. In general, multimodality has been an arbitrary region of the image or to a range of different
employed in many vision tasks and applications, such as shapes and scales. In the case of detecting patches, the same
face recognition and image retrieval. Table 2 summarizes techniques used in traditional computer vision, such as the
the reviewed multimodal applications, their technical details, sliding window (SW) fashion, can be easily applied when
and the best results obtained according to evaluation metrics patches are generated in SW mode, neural networks can
such as accuracy (ACC) and precision (PREC). In the fol- be used to predict the target information. However, due to
lowing, we first describe the core tasks of computer vision, their complexity, this type of solution is not cost-effective,
followed by a comprehensive discussion of each application both in terms of training duration and memory consump-
and its intent. tion. In order to significantly reduce this complexity, the

123
2954 K. Bayoudh et al.

deep learning community has pioneered a new generation RGB data can improve the performance of one-stage based
of CNN-based frameworks. Recent literature has focused on detection systems.
this challenging task: In [67], Jiao et al. studied a variety of
4.1.1.2 Two-stage detectors
deep object detectors, ranging from one-stage detectors to
two-stage detectors. Monomodal based The R-CNN detector [74] employs the
patch proposal procedure using the selective search [80] strat-
4.1.1.1 One-stage detectors
egy and applies the SVM classifier to classify any potential
Monomodal based The overfeat architecture [24] consists proposals. Fast R-CNN was introduced in [75] to improve
of several processing steps, each of which is dedicated to the detection efficiency of R-CNN. The principle of Fast R-
the extraction of multi-scale feature maps by applying the CNN is as follows: it first feeds the input image into the CNN
dense SW method to efficiently perform the object detection network, extracts a set of feature vectors, applies a patch
task. To significantly increase the processing speed of object proposal mechanism, generates potential candidate regions
detection pipelines, Redmon et al. [25] implemented a one- using the RoI pooling layer, reshapes them to a fixed size,
stage lightweight detection strategy called YOLO (You Only and then performs the final object detection prediction. As
Look Once). This approach treats the object detection task as an efficient extension of fast R-CNN, Faster R-CNN [76]
a regression problem, analyzing the entire input image and serves to use a deep CNN as a proposal generator. It has an
simultaneously predicting the bounding box coordinates and internal strategy for proposing patches called region proposal
associated class labels. However, in some vision applications, network (RPN). Simultaneously, RPN carries out classifica-
such as autonomous driving, security, video surveillance, tion and localization regression to generate a set of RoIs. The
etc., real-time conditions become necessary. In this respect, primary objective is to improve the localization task and the
two-stage detectors are generally slow in terms of real-time overall performance of the decision system. In other words,
processing. In contrast, SSD (single-shot multibox detector) the first network uses prior information about being an object,
[78] has reduced the needs of the patches’ proposal net- and the second one (at the end of the classifier) that deals with
work and, thus, accelerated the object detection pipeline. this information for each class. The feature pyramid network
It can learn multi-scale feature representation from multi- (FPN) detector [77] consists of a pyramidal structure that
resolution images. Its capability to detect objects at different allows the learning of hierarchical feature maps extracted
scales enables it to enhance the robustness of the entire chain. at each level of representation. According to [77], learning
Like most object detectors, the SSD detector consists of two multi-scale representations is very slow and requires a lot
processing stages: extracting the feature map through the of memory. However, FPN can generate pyramidal repre-
VGG16 model and detecting the object by applying a con- sentations with a higher semantic resolution than traditional
volutional filter through the Conv4 − 3 layer. As similar to pyramidal designs.
the principle of YOLO and SSD detectors, RetinaNet [79]
takes only one stage to detect dense objects by producing
multi-scale semantic feature maps using a feature pyramid Multimodal based As mentioned before, two-stage detec-
network (FPN) backbone and the ResNet model. To deal with tors are generally based on a combination of a CNN model
the class imbalance in the training phase, a novel loss func- to perform classification and a patch proposal module to
tion called “focal loss” is considered by [79]. This function generate candidate regions like RPNs. These techniques
allows training a one-stage detector with high accuracy by have proven effective for the accurate detection of multiple
reducing the level of artifacts. objects under normal and extreme environmental conditions.
However, multi-object detection in both indoor and out-
Multimodal based High-precision object recognition sys- door environments under varying environmental and lighting
tems with multiple sensors are aware of external noise and conditions remains one of the major challenges facing the
environmental sensitivity (e.g., lighting variations, occlu- computer vision community. Furthermore, a better trade-off
sion, etc.). More recently, the availability of low-cost and between accuracy and computational efficiency in two-stage
robust sensors (e.g., RGB-D sensors, stereo, etc.) has encour- object detection remains an open question [84]. The question
aged the computer vision community to focus on combining may be addressed more effectively by combining two or more
the RGB modality with other sensing modalities. According sensory modalities simultaneously. However, the most com-
to experimental results, it has been shown that the use of mon approach is to concatenate heterogeneous features from
depth information [183,184], optical flow information [185], different modalities to generate an artificial multimodal rep-
and LiDAR point clouds [186] in addition to conventional resentation. The recent literature has shown that it is attractive
to learn shared representations from the complementarity and
synergies between several modalities for increasing the dis-
criminatory power of models [190]. Such modalities may

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2955

include visual RGB-D [187], audio-visual data [188], visi- RGB and thermal data [196], RGB and infrared data [197],
ble and thermal data [189], etc. etc.
4.1.1.3 Multi-stage detectors
4.1.3 Semantic segmentation
Monomodal based Cascade R-CNN [26] is one of the most
effective multi-stage detectors that have proven their robust- In image processing, image segmentation is a process of
ness over one and two-stage methods. It is a cascaded version grouping pixels of the image together according to partic-
of R-CNN aimed at achieving a better compromise between ular criteria. Semantic segmentation consists of assigning a
object localization and classification. This framework has class label to each pixel of a segmented region. Several stud-
proven its capability in overcoming some of the main chal- ies have provided an overview of the different techniques
lenges of object detection, including overtraining problems used for semantic segmentation of visual data, including the
[5,6] and false alarm distribution caused by the patches’ works of [27,28]. Scene segmentation is a subtask of seman-
proposal stage. In other words, the trained model may be tic segmentation that enables intelligent systems to perceive
over-specialized on the training data and can no longer gener- and interact in their surrounding environment [27,66]. The
alize on the test data. The problem can be solved by stopping image can be split into non-overlapping regions according
the learning process before reaching a poor convergence rate, to particular criteria, such as pixel and edge detection and
increasing the data distribution in various ways, etc. points of interest. Some algorithms are then used to define
Multimodal based More recently, only a few multimodal- inter-class correlations for these regions.
based multi-stage detection frameworks [191–193] have
been developed and have achieved outstanding detection per- Monomodal based Over the last few years, the fully convo-
formance on benchmark datasets. lutional network (FCN) [29] has become one of the robust
models for a wide range of image types (multimedia, aerial,
medical, etc.). The network consists of replacing the final
dense layers with convolution layers, hence the reason for
4.1.2 Visual tracking its name “FCN”. However, the convolutional side (i.e., the
feature extraction side) of the FCN generates low-resolution
For decades, visual tracking has been one of the major chal- representations which lead to fairly fuzzy object boundaries
lenges for the computer vision community. The objective and noisy segmentations. Consequently, this requires the
is to observe the motion of a given object in real time. A use of a posteriori regularizations to smooth out the seg-
tracker can predict the trajectory of a given rigid object from mentation results, such as conditional random field (CRF)
a chronologically ordered sequence of frames. The task has networks [69]. As a light variant of semantic segmentation,
attracted a lot of interest because of its enormous relevance instance segmentation yields a semantic mask for each object
in many real-world applications, including video surveil- instance in the image. For this purpose, some methods have
lance [82], autonomous driving [83], etc. Over the last few been developed, including Mask-RCNN [30], Hybrid Task
decades, most deep learning-based object tracking systems Cascade (HTC) [31], etc. For instance, the Mask R-CNN
have been based on CNN architectures [84,139]. For exam- model offers the possibility of locating instances of objects
ple, in 1995, Nowlan et al. [85] implemented the first tracking with class labels and segmenting them with semantic masks.
system that tracks hand gestures in a sequence of frames Scene parsing is a visual recognition process that is based on
using a CNN model. Multi-object tracking (MOT) has been semantic segmentation and deep architectures. A scene can
extensively explored in recent literature for a wide range of be parsed into a series of regions labeled for each pixel that
applications [86,138]. Indeed, MOT (tracking-by-detection) is mapped to semantic classes. The task is highly useful in
is another aspect of the generic object tracking task. However, several real-time applications, such as self-driving cars, traf-
MOT methods are mainly designed to optimize the dynamic fic scene analysis, etc. However, fine-grained visual labeling
matching of the objects of interest detected in each frame. To and multi-scale feature distortions pose the main challenges
date, the majority of the existing tracking algorithms have in scene parsing.
yet to be adapted to various factors, such as illumination and
scale variation, occlusions, etc [178]. Multimodal MOT is Multimodal based More recently, it has been shown in the
a universal aspect of MOT aimed at ensuring the accuracy literature that the accuracy of scene parsing can be improved
of autonomous systems by mapping the motion sequence of by combining several detection modalities instead of a single
dynamic objects [194]. To date, several multimodal variants one [91]. Many different methods are available, such as soft
of MOT have been proposed to improve the speed and accu- correspondences [94], 3D scene analysis from RGB-D data
racy of visual tracking by using multiple data sources, e.g., [95], to ensure dense and accurate scene parsing of indoor
thermal, shortwave infrared, and hyperspectral data [195], and outdoor environments.

123
2956 K. Bayoudh et al.

4.2 Multimodal applications different modalities (face and iris) to establish an individual’s
identity.
4.2.1 Human recognition
4.2.3 Image retrieval
In recent years, a wide range of deep learning techniques has
been developed that focus on human recognition in videos.
Content-based image research (CBIR), commonly known as
Human recognition seeks to identify the same target at dif-
query by image content (QBIC) and content-based visual
ferent points in space-time derived from complex scenes.
information retrieval (CBVIR) [54], is the process of recover-
Some studies have attempted to enhance the quality of per-
ing visual content (e.g., colors, edges, textures, etc) stored in
son recognition from two data sources (audio-visual data)
datasets by learning their visual representations. The retrieval
using DBN and DBM [72] models, which have allowed
procedure leads to the generation of metadata (i.e., key-
several types of representation to be combined and coordi-
words, tags, labels, and so on). The CBIR mechanism can
nated. Some of these works include [48], [73]. According
be simulated in two fundamental phases: the offline database
to Salakhutdinov et al. [72], a DBM is a generative model
indexing phase and the online retrieval step. During the
that includes several layers of hidden variables. In [48], the
indexing stage, image signatures will be generated and stored
structure of deep multimodal Boltzmann machines (DMBM)
in a database. In the retrieval phase, the image to be recovered
[71] is similar to that of DBM, but it can admit more than
will be treated as a query and the matching process will recon-
one modality. Therefore, each modality will be covered
cile this image signature with that stored in the database. Over
individually using adaptive approaches. After joining the
the last few years, several cross-modal image retrieval tasks,
multi-domain features, the high-level classification will be
e.g., text-to-image retrieval [100], sketch-to-image retrieval
performed by an individual classifier. In [73], Koo et al. devel-
[101], cross-view image retrieval [102], composing text and
oped a multimodal human recognition framework based on
image-to-image [103], etc. have been covered in the litera-
face and body information extracted from deep CNNs. They
ture.
employed the late fusion policy to merge the high-level fea-
tures across the different modalities.
4.2.4 Gesture recognition

4.2.2 Face recognition Gesture recognition is one of the most sophisticated tasks
of computer vision. The task has already gained the atten-
Face recognition has long been extremely important, rang- tion of the deep learning community for many reasons. In
ing from conventional approaches that involve the extraction particular, its potential is to facilitate human–computer inter-
and selection of handcrafted features, such as Viola and Jones action and detect motion in real time. As gestures become
detectors [49] to the automatic extraction and training of end- more diversified and enriched, our instinctive intelligence
to-end hierarchical features from raw data. This process has will recognize basic actions and associate them with generic
been widely used in biometric systems for control and mon- behaviors. The challenge of action recognition is mainly
itoring purposes. The most biometric systems rely on three related to the difficulty of extracting body silhouettes from
modes of operation: enrolment, authentication (verification), foreground rigid objects to focus on their emotions [96].
and identification [92]. However, most facial recognition sys- Occlusions that occur between different object parts can lead
tems, including biometric systems, suffer from a restriction to a significant decrease in performance. However, various
in terms of universality and variations in the appearance factors, such as variations in speed, scale, noise, and object
of visual patterns. End-to-end training of multimodal facial position, can significantly affect the recognition process.
representations can effectively help to overcome this limi- Some real-world applications of gesture recognition include
tation. Multimodal facial recognition systems can integrate driver assistance, smart surveillance, human–machine inter-
complex representations derived from multiple modalities action, etc. Regarding the multimodal dimensions of gesture
at different scales and levels (e.g., feature level, decision recognition, the authors of [97] proposed a multi-stream
level, score level, rank level, etc.). Note that face detec- architecture based on the RNN (LSTM) model to capture
tion, face identification, and face reconstruction are subtasks spatial-temporal features from gesture data. In [98], the
of face recognition [50]. Numerous works in the literature authors developed a multimodal gesture recognition system
have demonstrated the benefits of multimodal recognition using the 3D Residual CNN (ResC3D) model [99] trained
systems. In [51], Ding et al. proposed a new late fusion pol- on an RGB-D dataset. The features extracted by the ResC3D
icy using CNNs for multimodal facial feature extraction and model are then combined with a canonical correlation scheme
SAEs for dimensional reduction. The authors of [93] intro- to ensure consistency in the fusion process. Likewise, Abav-
duced a biometric system that combines biometric traits from isani et al. [200] developed a fusion approach to derive

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2957

knowledge from multiple modalities in individual unimodal respond to a given question. To this end, the agent must first
3D CNN networks. explore its environment, capture visual information, and then
answer the question posed. In [90], the authors proposed the
4.2.5 Image captioning multi-target embodied question answering (MT-EQA) task
as a generalization of EQA. In contrast to EQA, MT-EQA
Recently, image captioning has become an active research considered some questions related to multiple targets, where
topic in the field of multimodal vision, i.e., the automatic an agent has to navigate toward various locations to answer
generation of text captions to describe the content of images. a question asked (Fig. 13a).
In a supervised learning way, training of model parameters
is provided by a set of labeled learning examples in the 4.2.8 Video question answering
form of an image and its related captions. The task has also
been demonstrated its ability for application in a variety of Currently, video question answering (VQA) [125–127,129,
real-world systems, including social media recommendation, 143] is one of the promising lines of research for reason-
image indexing, image annotation, etc. Most recently, Biten ing the correct answer to a particular question, based on the
et al. [52] combined both visual and textual data to gener- spatiotemporal visual content of video sequences. To answer
ate captions across two stages: template caption generation that question, we need to consider the correlation between
stage and entity insertion stage. Similarly, Peri et al. [53] features in the spatial and temporal dimensions (Fig. 13b).
proposed a multimodal framework that encodes both images The VQA task can be conceptually divided into three sub-
and captions using CNN and RNN as an intermediate level tasks. The first task is to identify the endpoints of the problem
representation and then decodes these multimodal represen- in the natural domain, while the second task is to capture the
tations into a new caption that is similar to the input. The correlation of the problem in the spatial domain. The third
authors of [128] presented an unsupervised image caption- task consists of reasoning about how this correlation varies in
ing framework based on a new alignment method that allows space over time. Typically, video sequences contain audio-
the simultaneous integration of visual and textual streams visual information of substantially different structures and
through semantic learning of multimodal embeddings of visual appearance, which requires reasoning schemes that
the language and vision domains. Moreover, a multimodal take into account the spatiotemporal nature of the data. To this
model can also aggregate motion information [174], acous- end, increased attention has been paid to these challenges by
tic information [175], temporal information [176], etc. from developing a wide range of spatiotemporal reasoning mech-
successive frames to assign a caption for each one. We invite anisms. Currently, the most common existing methods use
the reader to read the survey of Liu et al. [177] to learn more attention [125,127,129] and memory [126] mechanisms to
about the methods, techniques, and challenges of image cap- efficiently learn visual artifacts and the semantic correla-
tioning. tions that allow questions to be answered accurately. These
techniques are more effective for spatial-temporal video rep-
4.2.6 Vision-and-language navigation resentation and reasoning as they increase the memorization
and discrimination capacity of models.
Visual-and-language navigation (VLN) [87,88,118–121] is
a multimodal task that has become increasingly popular in 4.2.9 Style transfer
recent years. The idea behind VLN is to combine several
active domains (i.e., natural language, vision, and action) Neural style transfer (NST), also known as style transfer, has
to enable robots (intelligent agents) to navigate easily in recently gained momentum following the publication of the
unstructured environments. A key innovation in this area is works of Gatys et al. [156]. Gatys et al. [156] demonstrated
the synthesis of heterogeneous data into multiple modali- that visual features of models could be combined to represent
ties using natural language commands to navigate through image styles. It arises in a context of strong growth in DNNs
crowded locations and visual cues to perceive the surround- for several applications, including art and painting [157,158].
ings. It seeks to establish an interaction between visual For example, Lian et al. [157] proposed a style transfer-based
patterns and natural language concepts by merging these method that takes any natural portrait of a human and trans-
modalities into a single representation. forms it into Picasso’s cubism style. Informally, style transfer
is an optimization-based technique that renders the content
4.2.7 Embodied question answering of an existing image (content image) in the style of another
image (style image). Figure 14 depicts an example of trans-
Embodied question answering (EQA) [89,90,122] is an ferring the style of a specific painting to a scene image using
emerging multimodal task in which an intelligent agent acts the DeepArts tool [162]. In practice, style transfer involves
intelligently in a three-dimensional environment in order to applying a particular artistic style to a content image. For

123
2958 K. Bayoudh et al.

Fig. 13 Difference in results between EQA and VQA tasks: a EQA [90], b VQA [129]

this, a loss function must be specified and minimized. It is


essentially a weighted sum of the error (loss) between the
input content and the output image and the loss between the
original style and the applied style [161]. Over the last few
years, some research has been published to improve this tool
by considering the mapping of semantic patterns in content
and style images, from which the multimodal style transfer
(MST) emerges [159,160]. The authors of [159] proposed a
universal graph-based style transfer to transform multimodal
features by matching style patterns and semantic content and
appearance in a way that avoids the lack of flexibility in real-
world scenarios. The use of the graph cut technique allows a
better matching between content features and style clusters,
which was formulated as an energy minimization issue. In
[160], Wang et al. introduced a residual CNN architecture
and loss network to transfer the artistic style of the input
picture across multiple scales and dimensions. Specifically,
the residual network receives an image as input and learns
to produce multi-scale representations as output. These rep-
resentations are then separately considered as inputs to the
loss network so that a stylization loss can be computed for
each one. Fig. 14 Example of NST algorithm output to transform the style of a
painting to a given image

4.2.10 Medical data analysis


multiple imaging modalities (e.g., FA, OCT fundus, etc.) and
In recent years, deep learning algorithms have been devel- then implementing learning and optimization processes to
oped to save time and dependability during patient care by achieve greater registration accuracy. In unsupervised mode,
improving clinical accuracy and detecting abnormalities in it is possible to encode complex visual patterns over two
medical images [104,140]. As one application, retinal image input imaging modalities (3D MR and TRUS) without the
registration [142] is an increasingly challenging task of med- need for explicit labels [144]. In clinical practice, the differ-
ical image analysis that is receiving more and more attention ent imaging modalities (e.g., computed tomography (CT),
from the computer vision and healthcare communities. In chest X-rays, etc.) provide rich and informative features that
[142], Lee et al. proposed a new CNN-based retinal image allow for a more accurate diagnosis in the early stages of the
registration method to learn multimodal features simultane- disease [238]. More recently, the scientific community has
ously from several imaging modalities. This method consists already taken an active interest in this topic to fight against the
of combining CNN features and small patches taken from emerging Coronavirus, known as COVID-19 (SARS-CoV-2)

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2959

[105]. To date, the COVID-19 pandemic has spread rapidly


in most countries of the world, endangering people’s lives.
Deep learning techniques and the availability of medical data
contributed considerably to tackling the pandemic. The lat-
est literature indicates that the combination of multimodal
data can predict and screen for this virus more accurately
[106,145]. However, many studies still need to be undertaken
in the future.

4.2.11 Autonomous systems

Up to now, deep learning has proven to be a powerful tool


for generating multimodal data suitable for robotics and
autonomous systems [146]. These systems involve, for exam- Fig. 15 Waymo self-driving car equipped with several on-board sensors
ple, the interaction of sophisticated perception/vision and [163]
haptic sensors (e.g., monocular cameras, stereo cameras, and
so on) [147], the merging of depth and color information
from RGB-D cameras [148], and so on. Figure 15 shows an result, more powerful feature extractors will require more
autonomous vehicle with several on-board sensors, includ- parameters and, therefore, more learning data. For instance,
ing a camera and several radars and LiDARs. Most existing Caesar et al. [155] demonstrated how generalization per-
approaches combine RGB data with infrared images or 3D formance could be greatly improved when developing a
LiDAR points [164] to improve the sensitivity of perception multimodal dataset, called nuScenes, which is acquired by
systems, which can be suitable to all conditions and scenar- a wide range of remote sensors, including six cameras, five
ios. For instance, RGB-D cameras (e.g., Microsoft Kinect, radars, and one LiDAR. The dataset consists of 1000 scenes
Asus Xtion, and so on) can provide color and pixel-wise depth in total, each about 20 s long and fully labeled with 3D bound-
information, characterizing the distance of visual objects in ing boxes that cover 23 classes and eight attributes.
a complex scene [199]. Among the advantages of these types
of sensors are their low computational cost, their long-range,
their ability to have an internal mechanism to limit the impact 5 Popular visual multimodal datasets
of bad weather, etc. [149]. More recently, some automated
systems, such as mobile robots, have been used in manufac- A growing trend towards deep multimodal learning has
turing environments. However, in a manufacturing context, been fuelled by the availability of high-dimensional mul-
these systems are usually already routinely programmed with tisource datasets obtained from various sensors, including
repetitive actions that lack the capacity for autonomy. They RGB-D cameras (depth sensors). Multimodal data acquisi-
also depend on an unstructured environment for autonomous tion is increasingly used in many research disciplines. The
decision making (e.g., navigation, localization, and environ- deep multimodal analysis relies on a large amount of het-
ment mapping (SLAM)). erogeneous sensor data to achieve high performance and
For decades, visual SLAM (simultaneous localization avoid overfitting problems. Until now, a series of benchmark
and mapping) has been an active area of research in the datasets have been developed for the training and valida-
robotics and computer vision communities [148,150]. The tion of deep multimodal learning algorithms. This opens up
challenge lies both in locating a robot and mapping its sur- the question of which ones should be chosen and how they
rounding environment. Several methods have been reported can be used for benchmarking purposes with state-of-the-art
to improve the mapping accuracy of real-time scenarios in methods. To answer this question, in this section we present
unstructured and large-scale environments. Some of these a selection of multimodal datasets commonly used in vision
methods include descriptor-based monocular cameras with applications, including RGB-D and RGB flow datasets. Typi-
ORB-SLAM [151], stereovision with ORB-SLAM2 [152], cally, optical flow information is used to capture the motion of
and photometric error-based methods such as LSD-SLAM moving objects in a video sequence. It was originally devel-
[153] or DSO [154]. However, there are still many challenges oped by Horn et al. [65], formulated as a two-dimensional
facing these data-driven automated systems, particularly for vector flow that captures spatio-temporal motion variations
intelligent perception and mapping. Some of these challenges in images under fairly controlled conditions in both indoor
are reflected in the fact that large amounts of data are required and outdoor environments. The emphasis on these modalities
to train models. Therefore, large-scale datasets are required (RGB, depth, and flow data) is based on the fact that for many
to ensure that systems produce the desired outcomes. As a vision-based multimodal problems, it has been shown that the

123
2960 K. Bayoudh et al.

fusion of optical flow and depth information with RGB yields combines a wide range of data types such as RGB stereo
the best performance [242,243]. A selection of RGB-D and rendering, optical flow maps, and so on.
RGB flow datasets and their detailed information is given in – MPI-Sintel: The dataset consists of 1040 annotated opti-
Table 3, so that researchers can easily choose the right dataset cal flow and corresponding RGB images from very long
for their needs. Table 3 shows the typical computer vision sequences.
tasks, such as object recognition and semantic segmentation,
along with their respective benchmark datasets.
All datasets listed in Table 3 will be detailed in the fol-
lowing paragraphs: 6 Discussion, limitations, and challenges

Over the last few decades, the deep learning paradigm has
– RGB-D Object: According to the original paper [55], proven its ability to outperform human expertise in many
the larger-scale RGB-D object dataset consists of RGB practices. Deep learning algorithms involve a sequence of
videos and depth sequences of 300 object instances in multiple layers of nonlinear processing units that are used
51 categories from multiple view angles for a total of to extract and transform feature vectors coming from raw
250,000 images. data. Up to now, the deep learning community is still seek-
– BigBIRD: The dataset was originally introduced by [56]. ing a better trade-off between complex model structuring,
It contains 125 objects, 600 RGB-D point clouds, and 600 computational power requirements, and real-time processing
12 megapixel images taken by two sensors: Kinect and capability. Among its assets, computer vision seeks to give
DSLR cameras. machines the visual capabilities of human beings thanks to
– A large dataset of object scans: It includes more than deep learning algorithms that are fed with information from a
10,000 scanned and reconstructed objects in nine cate- wide range of sensors. In recent years, the trend toward its use
gories acquired by PrimeSense Carmine cameras. in a fairly wide range of applications has become increasingly
– RGB-D Semantic Segmentation: The dataset has origi- evident. Therefore, it is necessary to develop applications that
nally been proposed in [58], it was acquired by the Kinect can automatically predict the target information. However,
RGB-D sensor. It contains six categories such as juice most current scene-content analysis methods are still limited
bottles, coffee cans and boxes of salt, etc. On the one in their ability to deal with information that is not usable in
hand, the training set contains three 3D models for each real-life contexts. But this field is very interesting for the sci-
category. On the other hand, the testing set includes 16 entific and industrial communities. This aspect of uncertainty
objects scenes. underlines the need to propose innovative and practical meth-
– RGB-D Scenes v.1: The dataset contains eight scenes in ods under very similar conditions to those used in practice.
which each scene corresponds to a single video sequence In general, capturing multimodal data streams under differ-
of several RGB-D images. ent acquisition conditions and increasing the data volume
– RGBD Scenes v.2: The dataset contains 14 scenes makes it easier to recognize visual content. Deep learning
of video sequences including furniture that have been models are often robust strategies for dealing with the lin-
acquired by the Kinect device. ear and nonlinear combination of multimodal data. Despite
– NYU: There are two versions of the dataset (NYU-v1 and the impressive results of deep multimodal learning, no abso-
NYU-v2) that were recorded by the Kinect sensor. On the lute conclusions can be drawn in this regard. Considering
one hand, NYU-v1 dataset contains 64 different indoor this exponential growth, the main challenges of multimodal
scenes and 108617 unlabelled images. On the other hand, learning methods are the following:
NYU-v2 Dataset includes 464 different indoor scenes and
407024 unlabeled images. – Dimensionality and data conflict: Confusion between
– RGB-D People: This dataset was initially introduced by various data sources is a challenge for future analysis.
[60], it consists of more than 3000 RGB-D images cap- The multimodal data is usually available in various for-
tured from Kinect sensors. mats. This variation makes it difficult to extract valuable
– SceneNet RGB-D: This dataset contains 5M RGB-D information from the data. However, multimodal infor-
images extracted from a total of 16895 configurations. mation generally has a large dimension. In other words,
– Kinetics-400: It consists of a massive dataset of YouTube acquiring and processing a large amount of multimodal
video URLs that includes a diverse set of human actions. data is costly in terms of computation complexity and
The dataset includes more than 300,000 video sequences memory consumption. Moreover, the synchronization of
across 400 classes of human action. temporal data allows maximizing the correlation between
– Scene Flow: The dataset includes over 39,000 high- the features of several levels of representation. However,
resolution frames from synthetic video sequences. It feature-level fusion is more flexible than decision-level

123
Table 2 Summary of the multimodal applications reviewed, their related technical details, and best results achieved
References Year Application Sensing modality/data sources Fusion scheme Dataset/best results

[73] 2018 Person recognition Face and body information Late fusion (Score-level DFB-DB1 (EER = 1.52%)
fusion)
ChokePoint (EER = 0.58%)
[51] 2015 Face recognition Holistic face + Rendered frontal pose data Late fusion LFW (ACC = 98.43%)
CASIA-WebFace (ACC = 99.02%)
[93] 2020 Face recognition Biometric traits (face and iris) Feature concatenation CASIA-ORL (ACC = 99.16%)
CASIA-FERET (ACC = 99.33%)
[100] 2016 Image retrieval Visual + Textual Joint embeddings Flickr30K (mAP = 47.72%; R@10 = 79.9%)
MSCOCO (R@10 = 86.9%)
[101] 2016 Image retrieval Photos + Sketches Joint embeddings Fine-grained SBIR Database (R@5 = 19.8%)
[102] 2015 Image retrieval Cross-view image pairs Alignment A dataset of 78k pairs of Google street-view images
(AP = 41.9%)
[103] 2019 Image retrieval Visual + Textual Feature concatenation Fashion-200k (R@50 = 63.8%)
MIT-State (R@10 = 43.1%)
CS (R@1 = 73.7%)
[97] 2015 Gesture recognition RGB + D Recurrent fusion, Late SKIG (ACC = 97.8%)
fusion, and Early fusion
[98] 2017 Gesture recognition RGB + D A canonical correlation Chalearn LAP IsoGD (ACC = 67.71%)
scheme
[200] 2019 Gesture recognition RGB + D + Opt. flow A spatio-temporal semantic VIVA hand gestures (ACC = 86.08%)
alignment loss (SSA)
EgoGesture (ACC = 93.87%)
A survey on deep multimodal learning for computer vision: advances, trends, applications, and...

NVGestures (ACC = 86.93%)


[52] 2019 Image captioning Visual + Textual RNN + Attention GoodNews (Bleu-1 = 8.92%)
mechanism
[53] 2019 Image captioning Visual + Textual + Acoustic Alignment MSCOCO (R@10 = 91.6%)
Flickr30K (R@10 = 79.0%)
[128] 2019 Image captioning Visual + Textual Alignment MSCOCO (BLUE-1 = 61.7%)
[174] 2019 Image captioning Visual + Textual Gated fusion network MSR-VTT (BLUE-1 = 81.2%)
MSVD (BLUE-4 = 53.9%)
[175] 2019 Image captioning Visual + Acoustic GRU Encoder-Decoder Proposed dataset (BLUE-1 = 36.9%)
[176] 2020 Image captioning Visual + Textual (Spatio-temporal data) Object-aware knowledge MSR-VTT (BLUE-4 = 40.5%)
distillation mechanism
MSVD (BLUE-4 = 52.2%)

123
2961
Table 2 continued
2962

References Year Application Sensing modality/data sources Fusion scheme Dataset/best results

123
[87] 2018 Vision-and-language navigation Visual + Textual (instructions) Attention mechanism + R2R (SPL = 18%)
LSTM
[88] 2019 Vision-and-language navigation Visual + Textual Attention mechanism + R2R (SPL = 38%)
Language Encoder
[118] 2020 Vision-and-language navigation Visual + Textual (instructions) Domain adaptation R2R (Performance gap = 8.6)
R4R (Performance gap = 23.9)
CVDN (Performance gap = 3.55)
[119] 2020 Vision-and-language navigation Visual + Textual (instructions) Early fusion + Late fusion R2R (SPL = 59%)
[120] 2020 Vision-and-language navigation Visual + Textual (instructions) Attention mechanism + VLN-CE (SPL = 35%)
Feature concatenation
[121] 2019 Vision-and-language navigation Visual + Textual (instructions) Encoder-decoder + ASKNAV (Success rate = 52.26%)
Multiplicative attention
mechanism
[89] 2018 Embodied question answering Visual + Textual (questions) Attention mechanism + EQA-v1 (MR = 3.22)
Alignment
[90] 2019 Embodied question answering Visual + Textual (questions) Feature concatenation EQA-v1 (ACC = 61.45%)
[122] 2019 Embodied question answering Visual + Textual (questions) Alignment VideoNavQA (ACC = 64.08%)
[125] 2019 Video question answering Visual + Textual (questions) Bilinear fusion TDIUC (ACC = 88.20%)
VQA-CP (ACC = 39.54%)
VQA-v2 (ACC = 65.14%)
[126] 2019 Video question answering Visual + Textual (questions) Alignment TGIF-QA (ACC = 53.8%)
MSVD-QA (ACC = 33.7%)
MSRVTT-QA (ACC = 33.00%)
Youtube2Text-QA (ACC = 82.5%)
[127] 2020 Video question answering Visual + Textual (questions) Hierarchical Conditional MSRVTT-QA (ACC = 35.6%)
Relation Networks
(HCRN)
MSVD-QA (ACC = 36.1%)
[129] 2019 Video question answering Visual + Textual (questions) Dual-LSTM + Spatial and TGIF-QA (l2 distance = 4.22)
temporal attention
[159] 2019 Style transfer Content + Style Graph based matching A dataset of images from MSCOCO and WikiArt
(PV = 33.45%)
[160] 2017 Style transfer Content + Style Hierarchical feature A dataset of images from MSCOCO (PS = 0.54s)
concatenation
K. Bayoudh et al.
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2963

ACC accuracy, MR mean rank, SPL success weighted by path length, mAP mean average precision, AP average precision, R@i recall for setting i, PREC precision, PV percentage of the votes,
fusion due to the homogeneity of data samples. As men-
tioned before, some dimensionality reduction algorithms

A dataset of X-Ray, CT and Ultrasound images


Institutes of Health and Mount Sinai Hospital

KITTI Odometry (tr el =1.78%; rr el = 0.95%)


(e.g., k-NN, PCA, etc) and models already exist that com-

A total of 763 sets of data from the National

nuScenes (mAP = 28.9%; NDS = 44.9%)


press (encode) input signal or extract a reduced set of low
dimensional patterns to facilitate their analysis and fur-
Color fundus-OCT (ACC = 84.59%)

PS processing speed, TRE target registration error, NDS nuScenes detection score, ATE absolute trajectory error, Trel average translational error percentage, Rrel rotational error
AirSim (trel = 4.53%; rrel =8.75%)
Color fundus-FA (ACC = 90.10%)
ther processing.
– Data availability: One of the most significant chal-
lenges of deep multimodal learning is the large amount
of data required to learn discriminative feature maps. The
Dataset/best results

(TRE = 3.48mm)

(PREC = 100%) amount of multimodal data significantly affects the over-


all performance of the vision system. In some cases, the
number of training samples for a given dataset may not be
sufficient to effectively train a deeper or wider network.
However, networks trained with a limited number of
examples can no longer generalize well to a new dataset.
As mentioned earlier, several methods have been used to
increase the size of the dataset by generating additional
Filter-based approaches or

learning samples. One of the most common techniques


nonlinear optimization
Feature concatenation

includes data augmentation, which is a transformation


process that is applied to the input data to increase the
Fusion scheme

size of the data to make it more invariant. Also, AE


approaches
Data fusion

can address missing patterns by generating intermediate


Alignment
Alignment

shared representations from the input data and showing


intra- and inter-pattern correlations.
– Real-time processing and scalability: Multimodal real-
time data processing should be considered to improve
the performance of deep learning architectures. Current
RGB + D + Inertial measurements

trends focus on proposing complex architectures to build


Sensing modality/data sources

new real-time processing systems with a better trade-off


X-Ray + Ultrasound + CT

RGB + LiDAR + Radar

between accuracy and efficiency. However, the need to


reduce computing capacity remains the main challenge,
RGB + FA + OCT

which can lead to a deterioration in the overall accuracy of


MR + TRUS

training algorithms. Vision-based multimodal algorithms


constantly require new technological developments from
year to year (e.g., cloud computing technologies, local
GPU devices, etc.) to enable the growing scalability
needed to handle the next generation of multimodal appli-
cations. For instance, the edge/cloud computing solution
Medical data analysis

Medical data analysis


Medical data analysis

Autonomous systems
Autonomous systems

for mulitmodal analysis provides an effortless way to


create and handle multimodal datasets for training and
deploying models [107]. In practice, autonomous vehi-
Application

cles, healthcare robots, and other real-time embedded


systems consume more hardware resources, storage, and
battery than other emerging technologies, resulting in a
lack of adaptation to future needs.
2019

2018

2020

2019
2019
Year

7 Conclusion
Table 2 continued

This study provided a comprehensive overview of recent


References

multimodal deep learning in the computer vision commu-


nity. The focus of this survey is on the analogy between
[142]

[144]

[145]

[155]
[199]

inter- and intra-modal learning when dealing with heteroge-

123
2964 K. Bayoudh et al.

Table 3 A selection of the frequently used multimodal datasets in the literature


Reference Year Dataset Modality Main tasks Size

[55] 2011 RGB-D Object RGB + D Object recognition Contains 300 object instances under 51
categories from different angles for a total of
250,000 RGB-D images
[56] 2014 BigBIRD RGB + D Object recognition Contains 125 objects, 600 RGB-D point clouds,
and 600 12 megapixel images
[57] 2016 A large dataset of RGB + D Object recognition Contains more than 10,000 scanned and
object scans reconstructed objects in 9 categories
[58] 2011 RGB-D Semantic RGB + D Semantic segmentation Contains 3 3D models for 6 categories and 16
Segmentation test object scenes
[55] 2011 RGB-D Scenes v.1 RGB + D Object recognition Contains 8 video scenes from several RGB-D
images
Semantic segmentation
[55] 2014 RGB-D Scenes v.2 RGB + D Object recognition Contains 14 scenes of video sequences
Semantic segmentation
[59] 2011 NYU v1-v2 RGB + D Semantic segmentation NYU-v1 contains 64 different indoor scenes and
108617 unlabelled images. NYU-v2 contains
464 different indoor scenes and 407024
unlabeled images
[60] 2011 RGB-D People RGB + D Object recognition Contains more than 3000 RGB-D images
[61] 2016 SceneNet RGB-D RGB + D Semantic segmentation Contains 5M RGB-D images
Instance segmentation
Object detection
[62] 2017 Kinetics-400 RGB + Opt. flow Motion recognition Contains more than 300,000 video sequences in
400 classes
[63] 2016 Scene Flow RGB + Opt. flow Object segmentation Contains over 39,000 high resolution images
[64] 2012 MPI-Sintel RGB + Opt. flow Semantic segmentation Contains 1040 annotated optical flow and
matching RGB images
Object recognition

neous data. In this context, we provide a brief history of deep Declarations


learning, summarize typical deep learning concepts and algo-
rithms that have evolved from shallow networks to deeper
Conflict of interest The authors declare that they have no conflict of
networks such as RNN, DBN, and DAE, and show their role
interest.
in multimodal fusion. We also provide an overview of mul-
timodal datasets commonly used in the literature (RGB-D
and RGB-Opt. flow) and report a methodological analysis
of computer vision problems and multimodal applications.
Vision-based implementation strategies are also discussed
References
in detail to improve comprehension of the multimodal algo- 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Transla-
rithm’s ability to make fast and efficient decisions. The survey tion by Jointly Learning to Align and Translate. arXiv:1409.0473
also presented state-of-the-art and widely used methods for (2016)
producing uniform and multimodal distributions across dif- 2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a
review and new perspectives. IEEE Trans. Pattern Anal. Mach.
ferent modalities. Furthermore, it is important to note that Intell. 35, 1798–1828 (2013)
each multimodal problem requires a specific fusion strategy, 3. Bayoudh, K.: From machine learning to deep learning, (1st ed.),
ranging from traditional methods to deep learning tech- Ebook, ISBN: 9781387465606 (2017)
niques. Nevertheless, choosing the right fusion of different 4. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521,
436–444 (2015)
schemes remains a vital challenge for the computer vision 5. Lawrence, S., Giles, C.L.: Overfitting and neural networks: con-
community in terms of accuracy and efficiency. jugate gradient and backpropagation. In: Proceedings of the
IEEE-INNS-ENNS International Joint Conference on Neural Net-
works. IJCNN 2000. Neural Computing: New Challenges and
Perspectives for the New Millennium, pp. 114–119 (2000)

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2965

6. Bilbao, I., Bilbao, J.: Overfitting problem and the over-training 31. Chen, K. et al.: Hybrid task cascade for instance segmentation.
in the era of data: particularly for artificial neural networks. In: arXiv:1901.07518 (2019)
2017 Eighth International Conference on Intelligent Computing 32. Marechal, C. et al.: Survey on AI-based multimodal methods for
and Information Systems (ICICIS), pp. 173–177 (2017) emotion detection. In: High-Performance Modelling and Simu-
7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classifica- lation for Big Data Applications: Selected Results of the COST
tion with deep convolutional neural networks. Commun. ACM Action IC1406 cHiPSet, pp. 307–324 (2019)
60, 84–90 (2017) 33. Radu, V., et al.: Multimodal deep learning for activity and con-
8. Rosenblatt, F.: Perceptron simulation experiments. Proc. IRE 48, text recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous
301–309 (1960) Technol. 1, 157:1–157:27 (2018)
9. Van Der Malsburg, C.: Frank Rosenblatt: principles of 34. Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a
neurodynamics–perceptrons and the theory of brain mechanisms. survey on recent advances and trends. IEEE Signal Process. Mag.
Brain Theory, 245–248 (1986) 34, 96–108 (2017)
10. Huang, Y, Sun, S, Duan, X, Chen, Z.: A study on deep neural net- 35. Guo, W., Wang, J., Wang, S.: Deep multimodal representation
works framework. In: IEEE Advanced Information Management, learning: a survey. IEEE Access 7, 63373–63394 (2019)
Communicates, Electronic and Automation Control Conference 36. Lahat, D., Adali, T., Jutten, C.: Multimodal data fusion: an
(IMCEC), pp. 1519–1522 (2016) overview of methods, challenges, and prospects. Proc. IEEE
11. Sheela, K.G. Deepa, S.N.: Review on methods to fix number 103(9), 1449–1477 (2015)
of hidden neurons in neural networks. Math. Problems. Eng. 37. Baltrušaitis, T., Ahuja, C., Morency, L.-P.: Multimodal Machine
2013(25740) (2013) Learning: A Survey and Taxonomy. arXiv:1705.09406 (2017)
12. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning represen- 38. Morvant, E., Habrard, A., Ayache, S.: Majority vote of diverse
tations by back-propagating errors. Nature 323, 533–536 (1986) classifiers for late fusion. In: Structural, Syntactic, and Statistical
13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Pattern Recognition, pp. 153–162 (2014)
Comput. 9, 1735–1780 (1997) 39. Liu, Z. et al.: Efficient Low-Rank Multimodal Fusion with
14. Li, S., Li, W., Cook, C., Zhu, C., Gao, Y.: Independently recurrent Modality-Specific Factors. arXiv:1806.00064 (2018)
neural network (IndRNN): building a longer and deeper RNN. 40. Zhang, D., Zhai, X.: SVM-based spectrum sensing in cognitive
arXiv:1803.04831 (2018) radio. In: 7th International Conference on Wireless Communica-
15. Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm tions, Networking and Mobile Computing, pp. 1–4 (2011)
for deep belief nets. Neural Comput. 18, 1527–1554 (2006) 41. Gönen, M., Alpaydın, E.: Multiple Kernel learning algorithms. J.
16. Goodfellow, I.J., et al.: Generative adversarial networks. Mach. Learn. Res. 12, 2211–2268 (2011)
arXiv:1406.2661 (2014) 42. Aiolli, F., Donini, M.: EasyMKL: a scalable multiple kernel learn-
17. Turkoglu, M.O., Thong, W., Spreeuwers, L., Kicanaoglu, B.: ing algorithm. Neurocomputing 169, 215–224 (2015)
A layer-based sequential framework for scene generation with 43. Wen, H., et al.: Multi-modal multiple kernel learning for accurate
GANs. arXiv:1902.00671 (2019) identification of Tourette syndrome children. Pattern Recognit.
18. Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image trans- 63, 601–611 (2017)
lation with conditional adversarial networks. arXiv:1611.07004 44. Rabiner, L.R.: A tutorial on hidden Markov models and selected
(2018) applications in speech recognition. Proc. IEEE 77, 257–286
19. Creswell, A., et al.: Generative adversarial networks: an overview. (1989)
IEEE Signal Process. Mag. 35, 53–65 (2018) 45. Ghahramani, Z., Jordan, M.I.: Factorial hidden Markov models.
20. Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the Mach. Learn. 29, 245–273 (1997)
recent architectures of deep convolutional neural networks. Artif 46. Linde, Y., Buzo, A., Gray, R.: An algorithm for vector quantizer
Intell Rev (2020) design. IEEE Trans. Commun. 28, 84–95 (1980)
21. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based 47. Gael, J.V., Teh, Y.W., Ghahramani, Z.: The infinite factorial hid-
learning applied to document recognition. Proc. IEEE 86, 2278– den Markov model. In: Proceedings of the 21st International
2324 (1998) Conference on Neural Information Processing Systems, pp. 1697–
22. Simonyan, K., Zisserman, A.: Very deep convolutional networks 1704 (2008)
for large-scale image recognition. arXiv:1409.1556 (2015) 48. Alam, M. R., Bennamoun, M., Togneri, R., Sohel, F.: A deep
23. Stone, J.V.: Principal component analysis and factor analysis. neural network for audio-visual person recognition. In: IEEE 7th
In: Independent Component Analysis: A Tutorial Introduction, International Conference on Biometrics Theory, Applications and
MITP, pp. 129–135 (2004) Systems (BTAS), pp. 1–6 (2015)
24. Sermanet, P. et al.: OverFeat: integrated recognition, localiza- 49. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Com-
tion and detection using convolutional networks. arXiv:1312.6229 put. Vis. 57, 137–154 (2004)
(2014) 50. Wang, M., Deng, W.: Deep Face Recognition: A Survey.
25. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only arXiv:1804.06655 (2019)
look once: unified, real-time object detection. arXiv:1506.02640 51. Ding, C., Tao, D.: Robust face recognition via multimodal deep
(2016) face representation. IEEE Trans. Multimed. 17, 2049–2058 (2015)
26. Cai, Z., Vasconcelos, N.: Cascade R-CNN: high quality object 52. Biten, A.F., Gomez, L., Rusiñol, M., Karatzas, D.: Good News,
detection and instance segmentation. arXiv:1906.09756 (2019) Everyone! Context driven entity-aware captioning for news
27. Thoma, M.: A survey of semantic segmentation. images. arXiv:1904.01475 (2019)
arXiv:1602.06541 (2016) 53. Peri, D., Sah, S., Ptucha, R.: Show, Translate and Tell.
28. Guo, Y., Liu, Y., Georgiou, T., Lew, M.S.: A review of semantic arXiv:1903.06275 (2019)
segmentation using deep neural networks. Int. J. Multimed. Infom. 54. Duan, G., Yang, J., Yang, Y.: Content-based image retrieval
Retr. 7, 87–93 (2018) research. Phys. Proc. 22, 471–477 (2011)
29. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks 55. Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-
for semantic segmentation. arXiv:1411.4038 (2015) view RGB-D object dataset. In: IEEE International Conference
30. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. on Robotics and Automation, pp. 1817–1824 (2011)
arXiv:1703.06870 (2018)

123
2966 K. Bayoudh et al.

56. Singh, A., Sha, J., Narayan, K.S., Achim, T., Abbeel, P.: Big- 80. Uijlings, J.R., Sande, K.E., Gevers, T., Smeulders, A.W.: Selective
BIRD: A large-scale 3D database of object instances. In: IEEE search for object recognition. Int. J. Comput. Vis. 104, 154–171
International Conference on Robotics and Automation (ICRA), (2013)
pp. 509–516 (2014) 81. Xiao, Y., Codevilla, F., Gurram, A., Urfalioglu, O., López, A.M.:
57. Choi, S., Zhou, Q.-Y., Miller, S., Koltun, V.: A Large Dataset of Multimodal end-to-end autonomous driving. arXiv:1906.03199
Object Scans. arXiv:1602.02481 (2016) (2019)
58. Tombari, F., Di Stefano, L., Giardino, S.: Online learning for 82. 1.Mohanapriya, D., Mahesh, K.: Chapter 5—an efficient frame-
automatic segmentation of 3D data. In: IEEE/RSJ International work for object tracking in video surveillance. In: The Cognitive
Conference on Intelligent Robots and Systems, pp. 4857–4864 Approach in Cloud Computing and Internet of Things Technolo-
(2011) gies for Surveillance Tracking Systems, pp. 65–74 (2020)
59. Silberman, N., Fergus, R.: Indoor scene segmentation using a 83. Rangesh, A., Trivedi, M.M.: No blind spots: full-surround multi-
structured light sensor. In: International Conference on Computer object tracking for autonomous vehicles using cameras and
Vision Workshops (2011) LiDARs. IEEE Trans. Intelli. Veh. 4, 588–599 (2019)
60. Spinello, L., Arras, K.O.: People detection in RGB-D data. In: 84. Liu, L., et al.: Deep learning for generic object detection: a survey.
Intelligent and Robotic Systems (2011) Int. J. Comput. Vis. 128, 261–318 (2020)
61. Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, 85. Nowlan, S., Platt, J.: A convolutional neural network hand tracker.
R.: SceneNet: Understanding Real World Indoor Scenes With In: Advances in Neural Information Processing Systems, pp. 901–
Synthetic Data. arXiv:1511.07041 (2015) 908 (1995)
62. Kay, W. et al.: The Kinetics Human Action Video Dataset. 86. Ciaparrone, G., et al.: Deep learning in video multi-object track-
arXiv:1705.06950 (2017) ing: a survey. Neurocomputing 381, 61–88 (2020)
63. Mayer, N. et al.: A large dataset to train convolutional networks for 87. Anderson, P. et al.: Vision-and-language navigation: interpreting
disparity, optical flow, and scene flow estimation. In: IEEE Con- visually-grounded navigation instructions in real environments.
ference on Computer Vision and Pattern Recognition (CVPR), pp. In: IEEE/CVF Conference on Computer Vision and Pattern
4040–4048 (2016) Recognition, pp. 3674–3683 (2018)
64. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalis- 88. Wang, X. et al.: Reinforced Cross-Modal Matching and Self-
tic open source movie for optical flow evaluation. Comput. Vis. Supervised Imitation Learning for Vision-Language Navigation.
ECCV 2012, 611–625 (2012) arXiv:1811.10092 (2019)
65. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. 89. Das, A. et al.: Embodied question answering. In: Proceedings of
Intell. 17, 185–203 (1981) the IEEE Conference on Computer Vision and Pattern Recogni-
66. Wang, W., Fu, Y., Pan, Z., Li, X., Zhuang, Y.: Real-time driv- tion, pp. 1–10 (2018)
ing scene semantic segmentation. IEEE Access 8, 36776–36788 90. Yu, L. et al.: Multi-target embodied question answering. In: Pro-
(2020) ceedings of the IEEE Conference on Computer Vision and Pattern
67. Jiao, L., et al.: A survey of deep learning-based object detection. Recognition, pp. 6309–6318 (2019)
IEEE Access 7, 128837–128868 (2019) 91. Wang, A., Lu, J., Wang, G., Cai, J., Cham, T.-J.: Multi-modal
68. Dilawari, A., Khan, M.U.G.: ASoVS: abstractive summarization unsupervised feature learning for RGB-D scene labeling. In: Com-
of video sequences. IEEE Access 7, 29253–29263 (2019) puter Vision—ECCV, pp. 453–467 (2014)
69. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random 92. Dargan, S., Kumar, M.: A comprehensive survey on the biomet-
fields: probabilistic models for segmenting and labeling sequence ric recognition systems based on physiological and behavioral
data. In: Proceedings of the Eighteenth International Conference modalities. Expert Syst. Appl. 143, 113114 (2020)
on Machine Learning, pp. 282–289 (2001) 93. Ammour, B., Boubchir, L., Bouden, T., Ramdani, M.: Face-Iris
70. Shao, L., Zhu, F., Li, X.: Transfer learning for visual catego- multimodal biometric identification system. Electronics 9, 85
rization: a survey. IEEE Trans. Neural Netw. Learn. Syst. 26, (2020)
1019–1034 (2015) 94. Namin, S.T., Najafi, M., Salzmann, M., Petersson, L.: Cutting
71. Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep edge: soft correspondences in multimodal scene parsing. In: IEEE
Boltzmann machines. J. Mach. Learn. Res. 15(1), 2949–2980 International Conference on Computer Vision (ICCV), pp. 1188–
(2014) 1196 (2015)
72. Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: 95. Zou, C., Guo, R., Li, Z., Hoiem, D.: Complete 3D scene parsing
Artificial Intelligence and Statistics, pp. 448–455 (2009) from an RGBD image. Int. J. Comput. Vis. 127, 143–162 (2019)
73. Koo, J.H., Cho, S.W., Baek, N.R., Kim, M.C., Park, K.R.: 96. Escalera, S., Athitsos, V., Guyon, I.: Challenges in multimodal
CNN-based multimodal human recognition in surveillance envi- gesture recognition. J. Mach. Learn. Res. 17, 1–54 (2016)
ronments. Sensors 18, 3040 (2018) 97. Nishida, N., Nakayama, H.: Multimodal gesture recognition
74. Girshick, R., Donahue, J., Darrell, T., Malik, J. Rich feature hier- using multi-stream recurrent neural network. In: Revised Selected
archies for accurate object detection and semantic segmentation. Papers of the 7th Pacific-Rim Symposium on Image and Video
arXiv:1311.2524 (2014) Technology, pp. 682–694 (2015)
75. Girshick, R.: Fast R-CNN. arXiv:1504.08083 (2015) 98. Miao, Q. et al.: Multimodal gesture recognition based on the
76. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards ResC3D network. In: IEEE International Conference on Com-
real-time object detection with region proposal networks. puter Vision Workshops (ICCVW), pp. 3047–3055 (2017)
arXiv:1506.01497 (2016) 99. Tran, D., Ray, J., Shou, Z., Chang, S.-F., Paluri, M.: Con-
77. Lin, T.-Y. et al.: Feature pyramid networks for object detection. vNet architecture search for spatiotemporal feature learning.
arXiv:1612.03144 (2017) arXiv:1708.05038 (2017)
78. Liu, W. et al.: SSD: single shot multibox detector, pp. 21–37. 100. Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving
arXiv:1512.02325 (2016) image-text embeddings. In: IEEE Conference on Computer Vision
79. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss and Pattern Recognition (CVPR), pp. 5005–5013 (2016)
for dense object detection. arXiv:1708.02002 (2018) 101. Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database:
learning to retrieve badly drawn bunnies. ACM Trans. Graph. 35,
119:1–119:12 (2016)

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2967

102. Lin, T.-Y., Yin Cui, Belongie, S., Hays, J.: Learning deep represen- 126. Fan, C. et al.: Heterogeneous memory enhanced multimodal
tations for ground-to-aerial geolocalization. In: IEEE Conference attention model for video question answering. In: CVPR, pp.
on Computer Vision and Pattern Recognition (CVPR), pp. 5007– 1999–2007 (2019)
5015 (2015) 127. Le, et al.: Hierarchical Conditional Relation Networks for Video
103. Vo, N. et al.: Composing text and image for image retrieval—an Question Answering. arXiv:2002.10698 (2020)
empirical odyssey. In: Proceedings of the IEEE Conference on 128. Laina, I., et al.: Towards unsupervised image captioning with
Computer Vision and Pattern Recognition, pp. 6439–6448 (2019) shared multimodal embeddings. In: ICCV, pp. 7414–7424 (2019)
104. Xu, Y.: Deep learning in multimodal medical image analysis. In: 129. Jang, Y., et al.: Video question answering with spatio-temporal
Health Information Science, pp. 193–200 (2019) reasoning. Int. J. Comput. Vis. 127, 1385–1412 (2019)
105. Shi, F., et al.: Review of artificial intelligence techniques in imag- 130. Wang, W., et al.: A survey of zero-shot learning: settings, methods,
ing data acquisition, segmentation and diagnosis for COVID-19. and applications. ACM Trans. Intell. Syst. Technol. 10, 13:1–
IEEE Rev. Biomed. Eng. 1, 2020 (2020) 13:37 (2019)
106. Santosh, K.C.: AI-driven tools for coronavirus outbreak: need of 131. Wei, L., et al.: A single-shot multi-level feature reused neural
active learning and cross-population train/test models on multitu- network for object detection. Vis. Comput. (2020). https://doi.
dinal/multimodal data. J. Med. Syst. 44, 93 (2020) org/10.1007/s00371-019-01787-3
107. Wang, X., et al.: Convergence of edge computing and deep learn- 132. Hascoet, T., et al.: Semantic embeddings of generic objects for
ing: a comprehensive survey. IEEE Commun. Surv. Tutorials 1, zero-shot learning. J. Image Video Proc. 2019, 13 (2019)
2020 (2020) 133. Liu, Y., et al.: Attribute attention for semantic disambiguation in
108. Ruder, S.: An Overview of Multi-Task Learning in Deep Neural zero-shot learning. In: ICCV, pp. 6697–6706 (2019)
Networks. arXiv:1706.05098 (2017) 134. Li, K., et al.: Rethinking zero-shot learning: a conditional visual
109. Ruder, S., Bingel, J., Augenstein, I., Søgaard, A.: Latent Multi- classification perspective. In: ICCV, pp. 3582–3591 (2019)
task Architecture Learning. arXiv:1705.08142 (2018) 135. Liu, Y., Tuytelaars, T.: A: deep multi-modal explanation model
110. Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 for zero-shot learning. IEEE Trans. Image Process. 29, 4788–4803
(1997) (2020)
111. Duong, L., Cohn, T., Bird, S., Cook, P. low resource depen- 136. Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating
dency parsing: cross-lingual parameter sharing in a neural network networks for zero-shot learning. In: CVPR, pp. 5542–5551 (2018)
parser. In: Proceedings of the 53rd Annual Meeting of the Asso- 137. Kumar, Y. et al.: Harnessing GANs for Zero-shot Learning of New
ciation for Computational Linguistics and the 7th International Classes in Visual Speech Recognition. arXiv:1901.10139 (2020)
Joint Conference on Natural Language Processing, pp. 845–850 138. Zhang, X., et al.: Online multi-object tracking with pedestrian
(2015) re-identification and occlusion processing. Vis. Comput. (2020).
112. Peng, Y., et al.: CCL: cross-modal correlation learning with multi- https://doi.org/10.1007/s00371-020-01854-0
grained fusion by hierarchical network. IEEE Trans. Multimed. 139. Abbass, M.Y., et al.: Efficient object tracking using hierarchical
20(2), 405–420 (2017) convolutional features model and correlation filters. Vis. Comput.
113. Palaskar, S., Sanabria, R., Metze, F.: Transfer learning for multi- (2020). https://doi.org/10.1007/s00371-020-01833-5
modal dialog. Comput. Speech Lang. 64, 101093 (2020) 140. Xi, P.: An integrated approach for medical abnormality detection
114. Libovický, J., Helcl, J.: Attention strategies for multi-source using deep patch convolutional neural networks. Vis. Comput. 36,
sequence-to-sequence learning. In: Proceedings of the 55th 1869–1882 (2020)
Annual Meeting of the Association for Computational Linguistics 141. Parida, K., et al.: Coordinated joint multimodal embeddings for
(Vol. 2: Short Papers), pp. 196–202 (2017) generalized audio-visual zero-shot classification and retrieval of
115. He, G., et al.: Classification-aware semi-supervised domain adap- videos. In: CVPR, pp. 3251–3260 (2020)
tation. In: CVPR, pp. 964–965 (2020) 142. Lee, J. A., et al.: Deep step pattern representation for multimodal
116. Rao, R., et al.: Quality and relevance metrics for selection of retinal image registration. In: CVPR, pp. 5077–5086 (2019)
multimodal pretraining data. In: CVPR, pp. 956–957 (2020) 143. Hashemi Hosseinabad, S., Safayani, M., Mirzaei, A.: Multiple
117. Bucci, S., Loghmani, M.R., Caputo, B.: Multimodal Deep Domain answers to a question: a new approach for visual question answer-
Adaptation. arXiv:1807.11697 (2018) ing. Vis. Comput. (2020). https://doi.org/10.1007/s00371-019-
118. Zhang, Y., Tan, H., Bansal, M.: Diagnosing the Environment Bias 01786-4
in Vision-and-Language Navigation. arXiv:2005.03086 (2020) 144. Yan, P., et al.: Adversarial image registration with application for
119. Landi, F., et al.: Perceive, Transform, and Act: Multi- mr and trus image fusion. arXiv:1804.11024 (2018)
Modal Attention Networks for Vision-and-Language Navigation. 145. Horry, Michael. J. et al.: COVID-19 Detection through Transfer
arXiv:1911.12377 (2020) Learning using Multimodal Imaging Data. IEEE Access 1 (2020)
120. Krantz, et al.: Beyond the Nav-Graph: Vision-and-Language Nav- https://doi.org/10.1109/ACCESS.2020.3016780
igation in Continuous Environments. arXiv:2004.02857 (2020) 146. Grigorescu, S., Trasnea, B., Cocias, T., Macesanu, G.: A survey of
121. Nguyen, K., et al.: Vision-based Navigation with Language-based deep learning techniques for autonomous driving. J. Field Robot.
Assistance via Imitation Learning with Indirect Intervention. 37, 362–386 (2020)
arXiv:1812.04155 (2019) 147. Metzger, A., Drewing, K.: Memory influences haptic perception
122. Cangea, et al.: VideoNavQA: Bridging the Gap between Visual of softness. Sci. Rep. 9, 14383 (2019)
and Embodied Question Answering. arXiv:1908.04950 (2019) 148. Guclu, O., Can, A.B.: Integrating global and local image features
123. Zarbakhsh, P., Demirel, H.: 4D facial expression recognition using for enhanced loop closure detection in RGB-D SLAM systems.
multimodal time series analysis of geometric landmark-based Vis. Comput. 36, 1271–1290 (2020)
deformations. Vis. Comput. 36, 951–965 (2020) 149. Van Brummelen, J., et al.: Autonomous vehicle perception: the
124. Joze, H.R.V., et al.: MMTM: multimodal transfer module for CNN technology of today and tomorrow. Transp. Res. C Emerg. Tech-
fusion. In: CVPR, pp. 13289–13299 (2020) nol. 89, 384–406 (2018)
125. Cadene, et al.: MUREL: multimodal relational reasoning for 150. He, M., et al.: A review of monocular visual odometry. Vis. Com-
visual question answering. In: CVPR, pp. 1989–1998 (2019) put. 36, 1053–1065 (2020)
151. Liu, S., et al.: Accurate and robust monocular SLAM with omni-
directional cameras. Sensors 19, 4494 (2019)

123
2968 K. Bayoudh et al.

152. Mur-Artal, R., Tardos, J.D.: ORB-SLAM2: an open-source 181. Wu, X., Sahoo, D. Hoi, S.C.H.: Recent Advances in Deep Learn-
SLAM system for monocular. Stereo RGB-D Cameras (2016). ing for Object Detection. arXiv:1908.03673 (2019)
https://doi.org/10.1109/TRO.2017.2705103 182. Pouyanfar, S., et al.: A survey on deep learning: algorithms, tech-
153. Engel, J., et al.: LSD-SLAM: large-scale direct monocular SLAM. niques, and applications. ACM Comput. Surv. 51, 92:1–92:36
In: Computer Vision—ECCV, pp. 834–849 (2014) (2018)
154. Engel, J., et al.: Direct Sparse Odometry. arXiv:1607.02565 183. Ophoff, T., et al.: Exploring RGB+depth fusion for real-time
(2016) object detection. Sensors 19, 866 (2019)
155. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous 184. Luo, Q., et al.: 3D-SSD: learning hierarchical features from RGB-
driving. In: CVPR, pp. 11621–11631 (2020) D images for amodal 3D object detection. Neurocomputing 378,
156. Gatys, L., et al.: A Neural Algorithm of Artistic Style. 364–374 (2020)
arXiv:1508.06576 (2015) 185. Zhang, S., et al.: Video object detection base on rgb and optical
157. Lian, G., Zhang, K.: Transformation of portraits to Picasso’s flow analysis. In: CCHI, pp. 280–284 (2019). https://doi.org/10.
cubism style. Vis. Comput. 36, 799–807 (2020) 1109/CCHI.2019.8901921
158. Wang, L., et al.: Photographic style transfer. Vis. Comput. 36, 186. Simon, M., et al.: Complexer-YOLO: real-time 3D object detec-
317–331 (2020) tion and tracking on semantic point clouds. In: CVPRW, pp.
159. Zhang, Y. et al.: Multimodal style transfer via graph cuts. In: 1190–1199 (2019). https://doi.org/10.1109/CVPRW.2019.00158
ICCV, pp. 5943–5951 (2019) 187. Tu, S., et al.: Passion fruit detection and counting based on mul-
160. Wang, X., et al.: Multimodal Transfer: A Hierarchical Deep tiple scale faster R-CNN using RGB-D images. Precision Agric.
Convolutional Neural Network for Fast Artistic Style Transfer. 21, 1072–1091 (2020)
arXiv:1612.01895 (2017) 188. Li, J., et al.: Facial expression recognition with faster R-CNN.
161. Jing, Y., et al.: Neural Style Transfer: A Review. Proc. Comput. Sci. 107, 135–140 (2017)
arXiv:1705.04058 (2018) 189. Liu, S.: Enhanced situation awareness through CNN-based deep
162. DeepArts: turn your photos into art. https://deepart.io (2020). multimodal image fusion. OE 59, 053103 (2020)
Accessed 18 Aug 2020 190. Michael, Y.B., Rosenhahn, V.M.: Multimodal Scene Understand-
163. Waymo: Waymo safety report: On the road to fully self-driving. ing, 1st edn. Academic Press, London (2019)
https://waymo.com/safety (2020). Accessed 18 Aug 2020 191. Djuric, N., et al.: MultiXNet: Multiclass Multistage Multimodal
164. Wang, Z., Wu, Y., Niu, Q.: Multi-sensor fusion in automated driv- Motion Prediction. arXiv:2006.02000 (2020)
ing: a survey. IEEE Access 8, 2847–2868 (2020) 192. Asvadi, A., et al.: Multimodal vehicle detection: fusing 3D-
165. Ščupáková, K., et al.: A patch-based super resolution algorithm LIDAR and color camera data. Pattern Recogn. Lett. 115, 20–29
for improving image resolution in clinical mass spectrometry. Sci. (2018)
Rep. 9, 2915 (2019) 193. Mahmud, T., et al.: A novel multi-stage training approach for
166. Bashiri, F.S., et al.: Multi-modal medical image registration with human activity recognition from multimodal wearable sensor data
full or partial data: a manifold learning approach. J. Imag. 5, 5 using deep neural network. IEEE Sens. J. (2020). https://doi.org/
(2019) 10.1109/JSEN.2020.3015781
167. Chen, C., et al. Progressive Feature Alignment for Unsupervised 194. Zhang, W., et al.: Robust Multi-Modality Multi-Object Tracking.
Domain Adaptation. arXiv:1811.08585 (2019) arXiv:1909.03850 (2019)
168. Jin, X., et al.: Feature Alignment and Restoration for Domain 195. Kandylakis, Z., et al.: Fusing multimodal video data for detecting
Generalization and Adaptation. arXiv:2006.12009 (2020) moving objects/targets in challenging indoor and outdoor scenes.
169. Guan, S.-Y., et al.: A review of point feature based medical image Remote Sens. 11, 446 (2019)
registration. Chin. J. Mech. Eng. 31, 76 (2018) 196. Yang, R., et al.: Learning target-oriented dual attention for robust
170. Dapogny, A., et al.: Deep Entwined Learning Head Pose and RGB-T tracking. In: ICIP, pp. 3975–3979 (2019). https://doi.org/
Face Alignment Inside an Attentional Cascade with Doubly- 10.1109/ICIP.2019.8803528
Conditional fusion. arXiv:2004.06558 (2020) 197. Lan, X., et al.: Modality-correlation-aware sparse representation
171. Yue, L., et al.: Attentional alignment network. In: BMVC (2018) for RGB-infrared object tracking. Pattern Recogn. Lett. 130, 12–
172. Liu, Z., et al.: Semantic Alignment: Finding Semanti- 20 (2020)
cally Consistent Ground-truth for Facial Landmark Detection. 198. Bayoudh, K., et al.: Transfer learning based hybrid 2D–3D CNN
arXiv:1903.10661 (2019) for traffic sign recognition and semantic road detection applied in
173. Hao, F., et al.: Collect and select: semantic alignment metric learn- advanced driver assistance systems. Appl. Intell. (2020). https://
ing for few-shot learning. In: CVPR, pp. 8460–8469 (2019) doi.org/10.1007/s10489-020-01801-5
174. Wang, B., et al.: Controllable Video Captioning with 199. Shamwell, E.J., et al.: Unsupervised deep visual-inertial odometry
POS Sequence Guidance Based on Gated Fusion Network. with online error correction for RGB-D imagery. IEEE Trans. Pat-
arXiv:1908.10072 (2019) tern Anal. Mach. Intell. (2019). https://doi.org/10.1109/TPAMI.
175. Wu, M., et al.: Audio caption: listen and tell. In: ICASSP, pp. 2019.2909895
830–834 (2019) https://doi.org/10.1109/ICASSP.2019.8682377 200. Abavisani, M., et al.: Improving the Performance of Unimodal
176. Pan, B., et al. Spatio-temporal graph for video captioning with Dynamic Hand-Gesture Recognition with Multimodal Training.
knowledge distillation. In: CVPR, pp. 10870–10879 (2020) arXiv:1812.06145 (2019)
177. Liu, X., Xu, Q., Wang, N.: A survey on deep neural network-based 201. Yang, X., et al.: A survey on canonical correlation analysis. IEEE
image captioning. Vis. Comput. 35, 445–470 (2019) Trans. Knowl. Data Eng. 1, 2019 (2019)
178. Abbass, M.Y., et al.: A survey on online learning for visual track- 202. Hardoon, D.R., et al.: Canonical correlation analysis: an overview
ing. Vis. Comput. (2020). https://doi.org/10.1007/s00371-020- with application to learning methods. Neural Comput. 16, 2639–
01848-y 2664 (2004)
179. Guo, Y., et al.: Deep learning for visual understanding: a review. 203. Chandar, S., et al.: Correlational neural networks. Neural Comput.
Neurocomputing 187, 27–48 (2016) 28, 257–285 (2016)
180. Hatcher, W.G., Yu, W.: A survey of deep learning: platforms, 204. Engilberge, M., et al.: Finding beans in burgers: deep semantic-
applications and emerging research trends. IEEE Access 6, visual embedding with localization. In: CVPR, pp. 3984–3993
24411–24432 (2018) (2018)

123
A survey on deep multimodal learning for computer vision: advances, trends, applications, and... 2969

205. Shahroudy, A., et al.: Deep Multimodal Feature Analysis for Person re-identification. Vis. Comput. (2020). https://doi.org/10.
Action Recognition in RGB+D Videos. arXiv:1603.07120 (2016) 1007/s00371-020-02015-z
206. Srivastava, N., et al.: Multimodal learning with deep Boltzmann 229. Xu, T., et al.: AttnGAN: fine-grained text to image generation
machines. J. Mach. Learn. Res. 15, 2949–2980 (2014) with attentional generative adversarial networks. In: IEEE/CVF
207. Bank, D., et al.: Autoencoders. arXiv:2003.05991 (2020) Conference on Computer Vision and Pattern Recognition, pp.
208. Bhatt, G., Jha, P., Raman, B.: Representation learning using step- 1316–1324 (2018)
based deep multi-modal autoencoders. Pattern Recogn. 95, 12–23 230. Huang, X., et al.: Multimodal unsupervised image-to-image trans-
(2019) lation. In: CVPR, pp. 172–189 (2018)
209. Liu, Y., Feng, X., Zhou, Z.: Multimodal video classification with 231. Toriya, H., et al.: SAR2OPT: image alignment between multi-
stacked contractive autoencoders. Signal Process. 120, 761–766 modal images using generative adversarial networks. In: IEEE
(2016) International Geoscience and Remote Sensing Symposium, pp.
210. Kim, J., Chung, K.: Multi-modal stacked denoising autoencoder 923–926 (2019)
for handling missing data in healthcare big data. IEEE Access 8, 232. Chaudhari, S., et al.: An Attentive Survey of Attention Models.
104933–104943 (2020) arXiv:1904.02874 (2020)
211. Singh, V., et al.: Feature learning using stacked autoencoder for 233. Hori, C., et al.: Attention-based multimodal fusion for video
shared and multimodal fusion of medical images. In: Computa- description. In: IEEE International Conference on Computer
tional Intelligence: Theories, Applications and Future Directions, Vision (ICCV), pp. 4203–4212 (2017)
pp. 53–66 (2019) 234. Huang, X., Wang, M., Gong, M.: Fine-grained talking face gener-
212. Said, A. B., et al.: Multimodal deep learning approach for joint ation with video reinterpretation. Vis. Comput. 37, 95–105 (2021)
EEG-EMG data compression and classification. In: IEEE Wire- 235. Liu, Z., et al.: Multi-level progressive parallel attention guided
less Communications and Networking Conference (WCNC), pp. salient object detection for RGB-D images. Vis. Comput. (2020).
1–6 (2017) https://doi.org/10.1007/s00371-020-01821-9
213. Ma, L., et al.: Multimodal convolutional neural networks for 236. Yang, Z., et al.: Stacked attention networks for image question
matching image and sentence. In: IEEE International Conference answering. In: IEEE Conference on Computer Vision and Pattern
on Computer Vision (ICCV), pp. 2623–2631 (2015) Recognition (CVPR), pp. 21–29 (2016)
214. Couprie, C., et al.: Toward real-time indoor semantic segmentation 237. Guo, L., et al.: Normalized and geometry-aware self-attention net-
using depth information. J. Mach. Learn. Res. (2014) work for image captioning. In: CVPR, pp. 10327–10336 (2020)
215. Madhuranga, D., et al.: Real-time multimodal ADL recognition 238. Bayoudh, K., et al.: Hybrid-COVID: a novel hybrid 2D/3D
using convolution neural networks. Vis. Comput. (2020) CNN based on cross-domain adaptation approach for COVID-
216. Gao, M., et al.: RGB-D-based object recognition using multi- 19 screening from chest X-ray images. Phys. Eng. Sci. Med. 43,
modal convolutional neural networks: a survey. IEEE Access 7, 1415–1431 (2020)
43110–43136 (2019) 239. Zhang, S., et al.: Joint learning of image detail and transmission
217. Zhang, Z., et al.: RGB-D-based gaze point estimation via multi- map for single image dehazing. Vis. Comput. 36, 305–316 (2020)
column CNNs and facial landmarks global optimization. Vis. 240. Zhang, S., He, F.: DRCDN: learning deep residual convolutional
Comput. (2020) dehazing networks. Vis. Comput. 36, 1797–1808 (2020)
218. Singh, R., et al.: Combining CNN streams of dynamic image and 241. Basly, H., et al.: DTR-HAR: deep temporal residual representation
depth data for action recognition. Multimed. Syst. 26, 313–322 for human activity recognition. Vis. Comput. (2021). https://doi.
(2020) org/10.1007/s00371-021-02064-y
219. Abdulnabi, A.H., et al.: Multimodal recurrent neural networks 242. Zhou, T., et al.: RGB-D salient object detection: a survey. Comp.
with information transfer layers for indoor scene labeling. IEEE Vis. Med. (2021). https://doi.org/10.1007/s41095-020-0199-z
Trans. Multimed. 20, 1656–1671 (2018) 243. Savian, S., et al.: Optical flow estimation with deep learning, a
220. Zhao, D., et al.: A multimodal fusion approach for image caption- survey on recent advances. In: Deep Biometrics, pp. 257–287
ing. Neurocomputing 329, 476–485 (2019) (2020)
221. Li, X., et al.: Multi-modal gated recurrent units for image descrip-
tion. Multimed. Tools Appl. 77, 29847–29869 (2018)
222. Sano, A., et al.: Multimodal ambulatory sleep detection using Publisher’s Note Springer Nature remains neutral with regard to juris-
lstm recurrent neural networks. IEEE J. Biomed. Health Inform. dictional claims in published maps and institutional affiliations.
23, 1607–1617 (2019)
223. Shu, Y., et al.: Bidirectional multimodal recurrent neural networks
with refined visual features for image captioning. In: Internet Mul- Khaled Bayoudh received a Bach-
timedia Computing and Service, pp. 75–84 (2018) elor’s degree in Computer Science
224. Song, H., et al.: S2 RGANS: sonar-image super-resolution based from the Higher Institute of Com-
on generative adversarial network. Vis. Comput. (2020). https:// puter Science and Mathematics
doi.org/10.1007/s00371-020-01986-3 of Monastir (ISIMM), University
225. Ma, T., Tian, W.: Back-projection-based progressive growing gen- of Monastir, Monastir, Tunisia, in
erative adversarial network for single image super-resolution. Vis. 2014. Then, he graduated with a
Comput. (2020). https://doi.org/10.1007/s00371-020-01843-3 Master’s degree in Highway and
226. Rohith, G., Kumar, L.S.: Paradigm shifts in super-resolution tech- Traffic Engineering: Curricular
niques for remote sensing applications. Vis. Comput. (2020). Reform for Mediterranean Area
https://doi.org/10.1007/s00371-020-01957-8 (HiT4Med) from the National Engi-
227. Jia, X., et al.: TICS: text-image-based semantic CAPTCHA syn- neering School of Sousse (ENISo),
thesis via multi-condition adversarial learning. Vis. Comput. University of Sousse, Sousse,
(2021). https://doi.org/10.1007/s00371-021-02061-1 Tunisia, in 2017. In 2018, he
228. Fan, X., et al.: Modality-transfer generative adversarial network received the M1 Master’s degree
and dual-level unified latent representation for visible thermal in Software Engineering from ISIMM. He is currently a PhD student
at the National School of Engineering of Monastir (ENIM), and a

123
2970 K. Bayoudh et al.

researcher in the Electronics and Micro-electronics Laboratory (EµE) Abdellatif Mtibaa is currently
at the Faculty of Sciences of Monastir (FSM), University of Monas- full Professor in Micro-Electronics,
tir, Monastir, Tunisia. His research focuses on Artificial Intelligence, Hardware Design and Embedded
Machine Learning, Deep Learning, Multimodal and Hybrid Learning, System with Electrical Department
Intelligent Systems, and so on. at the National School of Engi-
neering of Monastir and Head of
Raja Knani obtained a Master’s Circuits Systems Reconfigurable-
degree in Micro and Nanoelec- ENIM-Group at Electronic and
tronics from the FSM, University microelectronic Laboratory. He
of Monastir, Monastir, Tunisia, in holds a Diploma in Electrical Engi-
2014. She is currently a PhD stu- neering in 1985 and received his
dent, and a researcher in the Elec- PhD degree in Electrical Engi-
tronics and Microelectronics Lab- neering in 2000. His current
oratory (EµE) at the FSM, Uni- research interests include System
versity of Monastir, Monastir, on Programmable Chip, high level
Tunisia. She is interested partic- synthesis, rapid prototyping and reconfigurable architecture for real-
ularly in Artificial Intelligence, time multimedia applications. Dr. Abdellatif MTIBAA has authored/co-
Human-computer interaction, Ges- authored over 200 papers in international journals and conferences. He
ture recognition and tracking, and served on the technical program committees for several international
so on. conferences. He also served as a co-organizer of several international
conferences.

Fayçal Hamdaoui received the


Electrical Engineering degree from
the National School of Engineer-
ing of Monastir (ENIM), Univer-
sity of Monastir, Tunisia, in 2010.
In July 2011, he graduated with a
Master diploma and in 2015 with
a PhD Degree, both in the Elec-
trical Engineering and both from
ENIM. He is currently an Asso-
ciate Professor at ENIM and a
researcher in the Laboratory of
Control, Electrical Systems and
Environment (LASEE) at ENIM.
His research interests are use of
Artificial Intelligence (Deep Learning and Machine Learning), soft
computing on image and video processing, embedded systems, SoC
and SoPC programming.

123

You might also like