0% found this document useful (0 votes)

11 views11 pages

Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN

Uploaded by

tasinsafwathc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views11 pages

Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN

Uploaded by

tasinsafwathc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING 1

arXiv:1908.11618v2 [cs.CV] 2 Sep 2019

Multi-Grained Spatio-temporal Modeling for
Lip-reading
Chenhao Wang Institute of Computing Technology,
[email protected] Chinese Academy of Sciences, Beijing
100190, China

Abstract
Lip-reading aims to recognize speech content from videos via visual analysis of
speakers’ lip movements. This is a challenging task due to the existence of homophemes
– words which involve identical or highly similar lip movements, as well as diverse lip
appearances and motion patterns among the speakers. To address these challenges, we
propose a novel lip-reading model which captures not only the nuance between words
but also styles of different speakers, by a multi-grained spatio-temporal modeling of the
speaking process. Specifically, we first extract both frame-level fine-grained features
and short-term medium-grained features by the visual front-end, which are then com-
bined to obtain discriminative representations for words with similar phonemes. Next, a
bidirectional ConvLSTM augmented with temporal attention aggregates spatio-temporal
information in the entire input sequence, which is expected to be able to capture the
coarse-gained patterns of each word and robust to various conditions in speaker identity,
lighting conditions, and so on. By making full use of the information from different lev-
els in a unified framework, the model is not only able to distinguish words with similar
pronunciations, but also becomes robust to appearance changes. We evaluate our method
on two challenging word-level lip-reading benchmarks and show the effectiveness of the
proposed method, which also demonstrate the above claims.

1 Introduction
Lip-reading, the ability to understand speech using only visual information, is an attractive
but highly challenging skill. It plays a crucial role in human communication and speech
understanding, as highlighted by the McGurk effect. There are several valuable applications,
such as aids for hearing-impaired or speech-impaired persons, analysis of silent movies, and
liveness verification in video authentication systems. It is also an important complement to
the acoustic speech recognition systems, especially in noisy environments. For such rea-
sons and also the development of deep learning which enables efficient feature learning and
extraction, lip-reading has been receiving more and more attention in recent years.
A typical lip-reading framework consists of two steps: analyzing the motion information
in the image sequence, and converting that information into words or sentences. One com-
mon challenge in this process is various imaging conditions, such as poor lighting, strong
shadows, motion blur, low resolution, foreshortening, etc. More importantly, there is a fun-
damental limitation on performance due to homophemes. These are many words or phrases
that sound different, but involve the same or very similar movements of the speaker’s lips.
c 2019. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
2 WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING

For example, the phonemes "p", "b" in English are visually identical; while the words "pack"
and "back", are homophemes that can hardly be distinguished through lip-reading when there
is no more context information.
Motivated by these problems, we hope to build a model which utilizes both fine-grained
and coarse-grained spatio-temporal features to enhance the model’s discriminative power
and robustness. Specifically , we propose a multi-grained spatio-temporal network for lip-
reading. The front-end network uses a spatio-temporal ConvNet and a spatial-only ConvNet
in parallel, which extract medium-grained short-term and fine-grained, per-time-step features
respectively. In order to fuse these features more effectively, we introduce a spatial attention
mask to learn an adaptive, position-wise feature fusion strategy. A two-layer bidirectional
ConvLSTM augmented with (forward) input attention is used as the back-end to generate
the coarse-grained long-term spatio-temporal features.
In summary, we make three contributions. Firstly, we propose a novel multi-grained
spatio-temporal network to solve the lip-reading problem. Secondly, instead of simple con-
catenation, we fuse the information of different granularity with a learnable spatial attention
mechanism. Finally, we apply ConvLSTM to the lip-reading task for the first time. We report
the word classification results on two challenging lip-reading datasets, LRW and LRW-1000.

2 Related Work
In this section, we briefly summarize previous related work about lip-reading and ConvL-
STMs.
Lip reading. Research on lip-reading has a long history. Most early methods are based on
carefully hand-engineered features. A classical type of methods is to use Hidden Markov
Models (HMM) to model the temporal structure within the extracted frame-wise features
[3, 4, 14]. Other well-known features include the Discrete Cosine Transform (DCT) [13],
Active Appearance Model (AAM), Motion History Image (MHI) [8], Local Binary Pattern
(LBP) [26], and vertical optical flow based features [15]. With the rapid development of
deep learning technologies and the appearance of large-scale lip-reading databases [5, 7, 23],
researchers have started to use convolutional neural networks to extract the features of each
frame and also use recurrent units for holistic temporal modeling [1, 11, 20]. In 2016, [5]
proposed the first large-scale word-level lip-reading database together with several end-to-
end lip-reading models. Since then, more and more work perform end-to-end recognition
with the help of deep neural networks (DNN).
According to the design of the front-end network, these modern lip-reading methods
can be roughly divided into three categories: (a) fully 2D CNN based, which build on the
success of 2D ConvNets in image representation learning; (b) fully 3D CNN based, which
is inspired by the success of 3D ConvNets in action recognition, among which LipNet[2]
is a representative work that yields good results on the GRID audiovisual corpus; and (c)
mixture of 2D and 3D convolutions, which inherit the merits of both (a) and (b) by capturing
the temporal dynamics in a sequence and extracting discriminative features in the spatial
domain simultaneously. Recently, methods of type (c) have become dominant in lip-reading
due to its excellent performance. For example, in 2018, [12] attained 83.0% word accuracy
on the LRW dataset based on the type (c) architecture, achieving a new state-of-the-art result.
However, the above method simply stacks 3D and 2D convolutional layers, which may not
fully unleash the power of the two components. Our model proposes a new approach to take
the respective advantages of 3D and 2D ConvNets, by using them as two separate branches
WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING 3

Fine-Grained

ResNet-34

Coarse-Grained
24×24 12×12 6×6 3×3 1/1/1
2-layer
T*M+S*(1-M) BiConvLST
M

Medium-Grained
classification
Dense3D-52

24×24 12×12 6×6 3×3 1/1/1

mask

Figure 1: The architecture of the proposed framework, which consists of a spatio-temporal

convolution module followed by a two-branch structure and a two-layer Bi-ConvLSTM with
forward input attention. Finally, a fully-connected layer is used to obtain the prediction
results.

and fusing the features adaptively, similar to the popular two-stream architecture for action
recognition [17].
LSTM and ConvLSTM. For general-purpose sequence modeling, LSTM [9] as a special
RNN structure has been proven stable and powerful in modeling long-range dependencies.
LSTMs often lead to better performance where temporal modeling capacity is required, and
are thus widely used in NLP, video prediction, lip-reading, and so on. A common practice
of using LSTMs in video recognition is to employ a fully-connected layer before the LSTM.
Although this FC-LSTM layer has been proven powerful for temporal modeling, it loses too
much information about the spatial correlation in the data. To address this, Shi et al. proposed
ConvLSTM[16], which is based on the LSTM design but considers both temporal and spatial
correlation in a video sequence with additional convolution operations, effectively fusing
temporal and spatial features. It has been successfully applied to action recognition [10,
21], gesture recognition [24, 27] and other fields [19]. Additionally, a new spatio-temporal
LSTM unit [22] is recently designed to memorize both temporal and spatial representations,
obtaining better performance than the conventional LSTM.
In this paper, we introduce ConvLSTM to the lip-reading task for the first time. When
aggregating information from the whole lip sequence, its ability to capture both long and
short term temporal dependencies while considering the spatial relationships in feature maps
makes it ideal for accommodating to differences across speakers. We also augment the Con-
vLSTM with an attention mechanism on the inputs, which will be described in detail in
Sec. 3.3.1

3 Multi-Grained Spatio-temporal Modeling For

Lip-reading
Given a sequence of the mouth region corresponding to an utterance, our goal is to capture
both the fine-grained patterns that can distinguish one word from another, and the coarse-
grained patterns describing mouth shapes and motion information that are ideally invariant
4 WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING

to the varied styles of different speakers.

As mentioned earlier, simply cascading 2D and 3D convolution may not be optimal
for lip-reading, since some movements may be very weak and thereby lost during pool-
ing. Therefore, we split the learning process into three sub-networks that complement
each other. In this section, we present the proposed multi-grained spatio-temporal frame-
work which learns the latent spatio-temporal patterns of different words from three different
spatio-temporal scales for the lip-reading task. As shown in Fig. 1, the network consists of
a 2D ResNet-34 based fine-grained module, a 52-layer DenseNet-3D medium-grained mod-
ule, and a coarse-grained module that adaptively fuses and aggregates the features from these
two modules. By jointly learning the latent patterns at multiple spatio-temporal scales and
efficiently fusing these information, we achieve much better performance. We now give a
detailed description of the architecture.

3.1 Fine-grained Module

Words with similar mouth movements are fairly ubiquitous. However, when we compare the
sequences side by side and examine each time-step, very often we can still observe slight
differences in appearance. This observation leads to the idea that enhancing spatial repre-
sentations alone to some extent may improve the discriminative power of the model. As
an effective tool to capture the salient features in images, 2D convolutional operations have
been proven successful in several related tasks, such as image recognition, object detection,
segmentation, and so on. We introduce cascaded 2D convolutional operations here to extract
the salient features in each frame. Different from the traditional role of 2D convolutional
operation in other methods, the 2D convolutions introduced here should not merely func-
tion as a feature extractor, but highlight salient appearance cues in each frame, which will
eventually help enhance the fine-grained patterns for subtle differences among words. In our
model, the 2D ConvNet is a 34-layer ResNet.

3.2 Medium-grained Module

3D convolution have become widely adopted in video recognition and proven capable of
capturing short-term spatio-temporal patterns. They are expected to be more robust than
using 2D convolutions which produce frame-wise features because they account for mo-
tion information. Moreover, while there are words with subtle differences that require fine-
grained information, most words are still able to be distinguished through the ways they are
pronounced, albeit somewhat speaker-dependent. This requires the model to be capable of
modeling medium-grained, short-term dynamics, which is a job suitable for 3D convolutions.
In our model, the medium-grained 3D ConvNet is a 52-layer 3D-DenseNet [23].

3.3 Coarse-grained Module

The coarse-grained module begins by fusing the features from the previous two modules.
Different from most previous methods which directly cascade 2D and 3D convolutions, we
introduce an attention mechanism to combine the fine-grained features and the medium-
grained features into a primary representation. As shown in Fig. 1, the attention mask is
implemented with an 1 × 1 × 1 convolution, which adaptively adjusts the fusion weights at
each spatial location. This spatial attention mask and the final fused features F are obtained
WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING 5

x1 x2 x3

ConvLSTM forward
Concat f input attention unit
Concat Concat h t -1

xf xb xf xb xf xb
h ft
sigmoid CLTMfs
CLSTMb CLSTMb CLSTMb ...

x ft ~
x tf
CF-ATT CF-ATT CF-ATT CF-ATT
ConvLSTM forward
ATT ATT ATT input attention unit

x1 x2 x3

(a) (b)

Figure 2: (a) The forward input attention augmented Bi-ConvLSTM. The attention-
augmented ConvLSTM layer CF-ATT processes the inputs in the forward direction atten-
tively, while the plain ConvLSTM layer CLSTMb processes information in reversed order.
(b) The ConvLSTM forward input attention unit, where , as before, denotes element-wise
multiplication.

by
S = 2DCNN(X), T = 3DCNN(X),
mask = σ (WT), (1)
F=T mask + S (1 − mask).
where X are the input feature maps, S, T are the respective outputs of the two branches, W is
a learned parameter, σ is the sigmoid function, and denotes element-wise multiplication.
Every person has his or her own speaking style and habits, such as nodding or turning
his or her head while speaking. Meanwhile, owing to the appearance factors such as lighting
conditions, speaker’s pose, make-up, accent, age and so on, the image sequences of even the
same word would have several different styles. Considering the diversity of the appearance
factors, a robust lip-reading model has to model the global latent patterns in the sequence in a
high-level to highlight the representative patterns and cover the slight style-variations in the
sequence. FC-LSTMs are capable of modeling long-range temporal dependencies and have
a powerful gating mechanism. But the major drawback of FC-LSTM in handling spatio-
temporal data is its usage of full connections in input-to-state and state-to-state transitions
in which no spatial correlation is encoded. To overcome this problem, we use a two-layer
bidirectional ConvLSTM module augmented with forward input attention which proceeds to
model the global latent patterns in the whole sequence based on the fused initial representa-
tions. It is able to cover the various styles and speech modes in the speaking process, which
will be demonstrated in the experiment section.

3.3.1 Bi-ConvLSTM with Forward Input Attention

Compared with the conventional LSTM, ConvLSTM proposed in [16], as a convolutional
counterpart of conventional fully connected LSTM, introduces the convolution operation into
input-to-state and state-to-state transitions. ConvLSTM is capable of modeling 2D spatio-
temporal image sequences by explicitly encoding their 2D spatial structures into the temporal
domain. ConvLSTM models temporal dependency while preserving spatial information.
6 WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING

Thus it has been widely applied for many spatio-temporal tasks. Similar to FC-LSTM, a
ConvLSTM unit consists of a memory cell ct , an input gate it , an output gate ot and a forget
gate ft . The main equations of ConvLSTM are as follows:

it = σ (Wxi ∗ Xt + Whi ∗ Ht−1 + Wci ◦ Ct−1 + bi )

ft = σ Wx f ∗ Xt + Wh f ∗ Ht−1 + Wc f ◦ Ct−1 + b f
Ct = ft ◦ Ct−1 + it ◦ tanh (Wxc ∗ Xt + Whc ∗ Ht−1 + bc ) (2)
ot = σ (Wxo ∗ Xt + Who ∗ Ht−1 + Wco ◦ Ct + bo )
Ht = ot ◦ tanh (Ct )

where ‘∗’ denotes the convolution operator and ‘◦’ denotes the Hadamard product.
However, the structures of existing RNN neurons mainly focus on controlling the contri-
butions of current and historical information but do not explore the difference in importance
among different time-steps [25]. So we introduce an attention mechanism to the forward
direction of the bidirectional ConvLSTM, as shown in Fig. 2. The input attention can deter-
mine the relative importance of different frames and assign a suitable weight to each time-
step. This augmented Bi-ConvLSTM can not only learn spatial temporal features but also
select important frames. We only use attention on the inputs to Bi-ConvLSTM’s forward
direction:
at = σ (WXa Xf;t + Wha hf;t−1 ) (3)
where the current (forward) input Xf;t and the previous hidden state hf;t−1 are used to deter-
mine the levels of importance of each frame of the forward input Xf;t .
The attention response modulates the forward input and computes
e f;t = at ◦ Xf;t
X (4)

The recursive computations of activations of the other units in the RNN block are then based
on the attention-weighted input X
e f;t , instead of the original input Xf;t .

4 Experiments
In this section, we present the results of our experiments on the word-level LRW and LRW-
1000 datasets. We give a brief description to the two datasets and our implementation, and
finally a detailed analysis of our experimental results.
4.1 Datasets
Lip Reading in the Wild (LRW) [5]. The LRW database consists of short segments (1.16
seconds) from BBC programs, mainly news and talk shows. It is a very challenging dataset
since it contains more than 1000 speakers and large variations in head pose and illumination.
For each target word, it has a training set of 1000 segments, a validation and an evaluation
set of 50 segments each. The total duration of this corpus is 173 hours. The corpus with 500
words is also much larger than previous lip-reading databases used for word recognition.

LRW-1000 [23]. LRW-1000 is a challenging Mandarin lip-reading dataset due to its large
variations in scale, resolution, background clutter, and speaker attributes. The speakers are
mostly interviewers, broadcasters, program guests, and so on. The dataset consists of 1000
WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING 7

'ABOUT' 'BETTER'

'CAMERON' 'HEALTH'

Figure 3: The attention mask automatically adjusts the position-specific fusion weights and
generates an initial representation. For clarity, only the frames corresponding to the target
word are shown.

word classes and has 718, 018 samples, totaling 57 hours. The minimum and maximum
length of the samples are about 0.01 seconds and 2.25 seconds respectively, with an average
of about 0.3 seconds for each sample.

4.2 Implementation Details

Our models are implemented with PyTorch and trained on servers with three NVIDIA Titan
X GPUs, each with 12GB memory. In our experiments, for the LRW dataset, the mouth
region of interests (ROIs) are already centered, and a fixed bounding box of 96 by 96 is
used for all videos. All images are converted to grayscale, and then cropped to 88 × 88.
As an effective data augmentation step, we also randomly flip all the frames in the same
sequence horizontally. For the two-branch models, we first train each individual branch to
convergence, and then fine-tune the model end-to-end. We use the Adam optimizer with an
initial learning rate of 0.0001 and a momentum of 0.9. During the fine-tuning with RGB
LRW-1000, the maximum number of frames is set to 30.
The first convolutional layer has kernel of size 64 × 5 × 7 × 7 (channels / time / height
/ width), while max pooling has a kernel of size 1 × 3 × 3. We then reshape the feature
map to 24 × 24. In our model, the two branches are constructed by a 34-layer ResNet and
a 52-layer 3D-DenseNet [23] respectively. We use a 1 × 1 × 1 3D convolution to reduce the
dimensionality. Then 512 × 29 × 3 × 3 fusion feature is fed to a two-layer Bi-ConvLSTM
with forward input attention. The Bi-ConvLSTM has kernel size 3 × 3. The output layer is
a fully connection layer to obtain prediction results. We average the framewise prediction
for the final results. The two blocks of layers transform the tensors as 88 × 88 → 22 ×
−−−−→
22 upsample 24 × 24 → 12 × 12 → 6 × 6 → 3 × 3.
8 WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING
Table 1: Classification accuracy of the two-branch network on the LRW database and LRW-
1000 database. ‘ResNet-34’ uses the 34-layer ResNet frontend proposed in [12] (the results
are our reproduction). ‘DenseNet-3D’ uses a 52-layer DenseNet-3D front-end proposed in
[23].

Method LRW LRW-1000

DenseNet-3D + Bi-GRU 81.70% 34.76% [23]
ResNet-34 + Bi-GRU 81.70% 38.19% [23]
Two-branch + Bi-GRU 82.98% 36.48%
Two-branch + Bi-ConvLSTM 83.15% 36.12%
Proposed Model 83.34% 36.91%

4.3 Results

Performance estimates are expressed in terms of word-level error rate on LRW dataset and
LRW-1000 dataset, respectively. We set up a few control experiments including only 2D
CNN branch, only 3DCNN branch, two-branch / Bi-GRU, two-branch / Bi-ConvLSTM and
our model. Results on two datasets are provided in Table 1. On the LRW dataset, our model
shows marginally better results which we believe is because the model can learn the multi-
grained spatio-temporal features.
From Table 1 we can find that the ResNet-34 model and the DenseNet-3D model perform
equally well on the LRW dataset, achieving an accuracy of 81.70%. However, the recogni-
tion results of these two structures are different. In LRW-1000, the ResNet-34 / Bi-GRU is
better than 3D-DenseNet / Bi-GRU. The possible reason for this is that 2D CNN can better
capture the fine-grained features in each time-step to discriminate words. Compared with
the baseline two-branch models, we introduce the soft attention based fusion mechanism to
learn an adaptive weight to keep the most discriminative information from the two branches
and indeed to lead to more powerful spatio-temporal features. On LRW dataset, compared
with the results of our ResNet-34 + Bi-GRU baseline, there is an increase of 1.28%. But the
two-branch performance is higher than the DenseNet-3D / Bi-GRU results. The attention
mask is shown in Fig. 3. From these figures, we can find that the attention mask can learn
the weights well. It can pay close attention to the lip area to make the learning process au-
tomatically modify the fusion weights to generate the early-stage representation. Therefore
the two-branch / Bi-GRU architecture can obtain more robust results.
For the LRW database, compared with two-branch / Bi-GRU and two-branch / Bi-ConvLSTM,
it is clear from the results that bidirectional ConvLSTM modules are able to significantly im-
prove the performance over two-branch / Bi-GRU. This structure not only indicates that tem-
poral information has been learned but also highlights the importance of spatial information
for the lip-reading task.
Clips from the LRW dataset include context and may introduce redundant information to
the network. From Table 2 we can find that the Bi-ConvLSTM with forward input attention
works better, likely because it can focus on controlling the contributions of current and his-
torical different importance levels on different frames and identify the most important ones.
Table 2 shows the effectiveness of our forward input attention Bi-ConvLSTM. Therefore our
model outperforms the two-branch / Bi-ConvLSTM.
WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING 9
Table 2: Comparison with the state-of-the-art on the test set of LRW and LRW-1000. ’(re-
produced)’ denotes the result of our reproduction.
(a) the state-of-the-art on the LRW (b) the state-of-the-art on LRW-1000
Method Accuracy Method Accuracy
Chung18 [6] 71.50% LSTM-5 [23] 25.76%
Chung17 [7] 76.20% D3D [23] 34.76%
Petridis18 (end-to-end) [12] 82.00% 3D+2D [23] 38.19%
Petridis18 (reproduced) 81.70% 3D+2D (reproduced) 33.78%
Stafylakis17 [18] 83.00% Proposed Model 36.91%
Proposed Model 83.34%
4.4 Comparison with the state-of-the-art
Table 2 summarizes the performance of state-of-the-art networks on LRW and LRW-1000.
Our network has an absolute increase of 1.6% over our reproduction of the baseline ResNet-
34 model in [12] on LRW database. From the above results we see that the mixed 3D-
2D architecture still shows very strong performance. However the results also shows the
importance of fine-grained spatio-temporal features in the lip-reading task. The results also
confirm that it is reasonable to use the attention mask to merge the fine-grained and medium-
grained features, and replace FC-LSTM with ConvLSTM. Our model takes full advantage of
the 3D ConvNet, the 2D ConvNet and the ConvLSTM. The proposed attention-augmented
variant of ConvLSTM further enhances its ability for spatio-temporal feature fusion. The
forward input attention in Bi-ConvLSTM not only learns spatial and temporal features but
also explore the different importance levels of different frames. But we reproduced the
3D+2D model in the database of the accuracy is lower in the [23]. This reason may be that
we do not use the fully-connected layers in the model, and we also do not use three-stage
training. Therefore, the best recognition results can be obtained by taking full use of the
intrinsic advantages of the different networks.

5 Conclusion
We have proposed a novel two-branch model with forward input attention augmented Bi-
ConvLSTM for lip-reading. The model utilizes both 2D and 3D ConvNets to extract both
frame-wise spatial features and short-term spatio-temporal features, and then fuses the fea-
tures with an adaptive mask to obtain strong, multi-grained features. Finally, we use a Bi-
ConvLSTM augmented with forward input attention to model long-term spatio-temporal
information of the sequence. Using this architecture, we demonstrate state-of-the-art per-
formance on two challenging lip-reading datasets. We believe the model has great potential
beyond visual speech recognition. How to better utilize spatial information in temporal se-
quence modeling to obtain more fine-grained spatio-temporal features is also a worthwhile
research. In the future, we will continue to simplify the front-end and extract multi-grained
features with a more lightweight structure.

References
[1] Ibrahim Almajai, Stephen Cox, Richard Harvey, and Yuxuan Lan. Improved speaker
independent lipreading using speaker adaptive training and deep neural networks. In
10 WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING

IEEE International Conference on Acoustics, 2016.

[2] Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Lip-
net: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016.

[3] Chandramouli Chandrasekaran, Andrea Trubanova, Sébastien Stillittano, Alice

Caplier, and Asif A Ghazanfar. The natural statistics of audiovisual speech. PLoS
computational biology, 5(7):e1000436, 2009.

[4] Greg I Chiou and Jenq-Neng Hwang. Lipreading from color video. IEEE Transactions
on Image Processing, 6(8):1192–1195, 1997.

[5] Joon Son Chung and Andrew Zisserman. Lip reading in the wild. In Asian Conference
on Computer Vision, pages 87–103. Springer, 2016.

[6] Joon Son Chung and Andrew Zisserman. Learning to lip read words by watching
videos. Computer Vision and Image Understanding, 2018.

[7] Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading
sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 3444–3453. IEEE, 2017.

[8] Paul Duchnowski, Martin Hunke, Dietrich Busching, Uwe Meier, and Alex Waibel.
Toward movement-invariant automatic lip-reading and speech recognition. In Acous-
tics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference
on, volume 1, pages 109–112. IEEE, 1995.

[9] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa-
tion, 9(8):1735–1780, 1997.

[10] Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek.
Videolstm convolves, attends and flows for action recognition. Computer Vision and
Image Understanding, 166:41–50, 2018.

[11] Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G Okuno, and Tetsuya
Ogata. Audio-visual speech recognition using deep learning. Applied Intelligence, 42
(4):722–737, 2015.

[12] Stavros Petridis, Themos Stafylakis, Pingehuan Ma, Feipeng Cai, Georgios Tz-
imiropoulos, and Maja Pantic. End-to-end audiovisual speech recognition. In 2018
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 6548–6552. IEEE, 2018.

[13] Gerasimos Potamianos, Hans Peter Graf, and Eric Cosatto. An image transform ap-
proach for hmm based automatic lipreading. In Image Processing, 1998. ICIP 98.
Proceedings. 1998 International Conference on, pages 173–177. IEEE, 1998.

[14] Gerasimos Potamianos, Chalapathy Neti, Guillaume Gravier, Ashutosh Garg, and An-
drew W Senior. Recent advances in the automatic recognition of audiovisual speech.
Proceedings of the IEEE, 91(9):1306–1326, 2003.
WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING 11

[15] Ayaz A Shaikh, Dinesh K Kumar, Wai C Yau, MZ Che Azemin, and Jayavardhana
Gubbi. Lip reading using optical flow and support vector machines. In Image and
Signal Processing (CISP), 2010 3rd International Congress on, volume 1, pages 327–
330. IEEE, 2010.
[16] Xingjian Shi, Zhourong Chen, Hao Wang, Dit Yan Yeung, Wai Kin Wong, and
Wang Chun Woo. Convolutional lstm network: A machine learning approach for pre-
cipitation nowcasting. In International Conference on Neural Information Processing
Systems, 2015.
[17] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for ac-
tion recognition in videos. In Advances in neural information processing systems, pages
568–576, 2014.
[18] Themos Stafylakis and Georgios Tzimiropoulos. Combining residual networks with
lstms for lipreading. arXiv preprint arXiv:1703.04105, 2017.
[19] Swathikiran Sudhakaran and Oswald Lanz. Convolutional long short-term memory
networks for recognizing first person interactions. In Proceedings of the IEEE Interna-
tional Conference on Computer Vision, pages 2339–2346, 2017.
[20] Kwanchiva Thangthai, Richard W Harvey, Stephen J Cox, and Barry-John Theobald.
Improving lip-reading performance for robust audiovisual speech recognition using
dnns. In AVSP, pages 127–131, 2015.
[21] Lei Wang, Yangyang Xu, Jun Cheng, Haiying Xia, Jianqin Yin, and Jiaji Wu. Hu-
man action recognition by learning spatio-temporal features with deep neural networks.
IEEE Access, 6:17913–17922, 2018.
[22] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. Pre-
drnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In
Advances in Neural Information Processing Systems, pages 879–888, 2017.
[23] Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jingyun
Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. Lrw-1000: A naturally-distributed
large-scale benchmark for lip reading in the wild. In 2019 14th IEEE International
Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–8. IEEE,
2019.
[24] Liang Zhang, Guangming Zhu, Peiyi Shen, Juan Song, Syed Afaq Shah, and Mo-
hammed Bennamoun. Learning spatiotemporal features using 3dcnn and convolutional
lstm for gesture recognition. In Proceedings of the IEEE International Conference on
Computer Vision, pages 3120–3128, 2017.
[25] Pengfei Zhang, Jianru Xue, Cuiling Lan, Wenjun Zeng, Zhanning Gao, and Nanning
Zheng. Adding attentiveness to the neurons in recurrent neural networks. In Proceed-
ings of the European Conference on Computer Vision (ECCV), pages 135–151, 2018.
[26] Guoying Zhao, Mark Barnard, and Matti Pietikainen. Lipreading with local spatiotem-
poral descriptors. IEEE Transactions on Multimedia, 11(7):1254–1265, 2009.
[27] Guangming Zhu, Liang Zhang, Peiyi Shen, and Juan Song. Multimodal gesture recog-
nition using 3-d convolution and convolutional lstm. IEEE Access, 5:4517–4524, 2017.

Lip Reading With Hahn Convolutional Neural Networks
No ratings yet
Lip Reading With Hahn Convolutional Neural Networks
28 pages
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
No ratings yet
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
8 pages
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
No ratings yet
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
5 pages
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
No ratings yet
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
11 pages
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
No ratings yet
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
13 pages
Chung 18
No ratings yet
Chung 18
28 pages
Afouras Et Al - 2018 - Deep Lip Reading
No ratings yet
Afouras Et Al - 2018 - Deep Lip Reading
8 pages
2001 08702v1
No ratings yet
2001 08702v1
6 pages
Vision Based Lip Reading System Using Deep Learning: July 2022
No ratings yet
Vision Based Lip Reading System Using Deep Learning: July 2022
7 pages
Lip Reading with CNN for Noisy Environments
No ratings yet
Lip Reading with CNN for Noisy Environments
5 pages
ANN Paper
No ratings yet
ANN Paper
7 pages
Lip Reading Using CNN and LTSM
No ratings yet
Lip Reading Using CNN and LTSM
9 pages
Deep Learning Lip-Reading Survey
No ratings yet
Deep Learning Lip-Reading Survey
22 pages
A Lip Reading Method Based On 3D Convolutional Vision Transformer
No ratings yet
A Lip Reading Method Based On 3D Convolutional Vision Transformer
8 pages
Deep Learning Lip-Reading Survey
No ratings yet
Deep Learning Lip-Reading Survey
8 pages
LipReadNet: Advancing Lip Reading
No ratings yet
LipReadNet: Advancing Lip Reading
6 pages
ANN Paper
No ratings yet
ANN Paper
6 pages
Lip Reading with STCNN and ConvLSTM
No ratings yet
Lip Reading with STCNN and ConvLSTM
8 pages
3D-2D-CNN-BLSTM for Lipreading
No ratings yet
3D-2D-CNN-BLSTM for Lipreading
5 pages
Lip Reading via Mutual Information Maximization
No ratings yet
Lip Reading via Mutual Information Maximization
8 pages
DL Review
No ratings yet
DL Review
4 pages
Deep Audio-Visual Speech Recognition
No ratings yet
Deep Audio-Visual Speech Recognition
13 pages
Deformation Flow Based Two-Stream Network For Lip Reading
No ratings yet
Deformation Flow Based Two-Stream Network For Lip Reading
7 pages
Second Paper
No ratings yet
Second Paper
7 pages
Video-Based Lip Reading Models
No ratings yet
Video-Based Lip Reading Models
9 pages
Toward Language-Independent Lip Reading A Transfer Learning Approach
No ratings yet
Toward Language-Independent Lip Reading A Transfer Learning Approach
4 pages
Deep Learning for Lip Reading
No ratings yet
Deep Learning for Lip Reading
5 pages
Effective Strategies for Lip-Reading Training
No ratings yet
Effective Strategies for Lip-Reading Training
5 pages
Decoder-Encoder LSTM For Lip Reading: Souheil Fenghour Daqing Chen Perry Xiao
No ratings yet
Decoder-Encoder LSTM For Lip Reading: Souheil Fenghour Daqing Chen Perry Xiao
5 pages
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
No ratings yet
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
5 pages
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
No ratings yet
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
6 pages
Deep Learning Lip Reading Model
No ratings yet
Deep Learning Lip Reading Model
6 pages
CNN-Based Lip Reading Method
No ratings yet
CNN-Based Lip Reading Method
7 pages
Icassp19 Zhoupan
No ratings yet
Icassp19 Zhoupan
5 pages
Engineering Science and Technology, An International Journal
No ratings yet
Engineering Science and Technology, An International Journal
10 pages
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
No ratings yet
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
11 pages
584 Camera Ready
No ratings yet
584 Camera Ready
6 pages
2.1 s2.0 S0925231225009610 Main
No ratings yet
2.1 s2.0 S0925231225009610 Main
10 pages
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
No ratings yet
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
11 pages
2 Base
No ratings yet
2 Base
5 pages
Review I - Documentation Format
No ratings yet
Review I - Documentation Format
20 pages
Deep Learning for Visual Lip Reading
No ratings yet
Deep Learning for Visual Lip Reading
15 pages
Authnet: A Deep Learning Based Authentication Mechanism Using Temporal Facial Feature Movements
No ratings yet
Authnet: A Deep Learning Based Authentication Mechanism Using Temporal Facial Feature Movements
7 pages
Analyzing Lower Half Facial Gestures For Lip Reading Applications Survey On Vision Techniques
No ratings yet
Analyzing Lower Half Facial Gestures For Lip Reading Applications Survey On Vision Techniques
45 pages
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
No ratings yet
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
8 pages
Hybrid CNN-ViT for Lip Reading 2024
No ratings yet
Hybrid CNN-ViT for Lip Reading 2024
11 pages
Phoneme-Based Lip Reading System
No ratings yet
Phoneme-Based Lip Reading System
21 pages
Phoneme-Based Lip-Reading System
No ratings yet
Phoneme-Based Lip-Reading System
10 pages
Analysis of Lip-Reading Using Deep Learning Techniques A Review
No ratings yet
Analysis of Lip-Reading Using Deep Learning Techniques A Review
6 pages
Adieu Features End-To-End Speech Emotion Recognition Usinga Deep Convolutional Recurrent Network
No ratings yet
Adieu Features End-To-End Speech Emotion Recognition Usinga Deep Convolutional Recurrent Network
5 pages
Lip Reading Using Deep Learning in Turkish Language
No ratings yet
Lip Reading Using Deep Learning in Turkish Language
12 pages
Park College of Engineering and Teknology Lip Reading Using Neural Network
No ratings yet
Park College of Engineering and Teknology Lip Reading Using Neural Network
10 pages
Lipreading Using A Comparative Machine Learning Approach
No ratings yet
Lipreading Using A Comparative Machine Learning Approach
7 pages
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
10 pages
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
No ratings yet
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
8 pages
1 s2.0 S2666764923000450 Main
No ratings yet
1 s2.0 S2666764923000450 Main
10 pages
Pseudo-Convolutional Policy Gradient For Sequence-to-Sequence Lip-Reading
No ratings yet
Pseudo-Convolutional Policy Gradient For Sequence-to-Sequence Lip-Reading
8 pages
Multimodal Machine Learning
No ratings yet
Multimodal Machine Learning
27 pages
Robust License Plate Detection Using Con
No ratings yet
Robust License Plate Detection Using Con
4 pages
Ai, Iot, Big Data and Cloud Computing For Industry 4.0
No ratings yet
Ai, Iot, Big Data and Cloud Computing For Industry 4.0
589 pages
Training Neural Network
No ratings yet
Training Neural Network
114 pages
Deep Learning
No ratings yet
Deep Learning
50 pages
Lange and Sippel MachineLearning Hydrology
No ratings yet
Lange and Sippel MachineLearning Hydrology
26 pages
AI 101 CheatSheet for Beginners
No ratings yet
AI 101 CheatSheet for Beginners
27 pages
Breast Cancer Detection Based On Thermographic Images Using Machine Learning and Deep Learning Algorithms
No ratings yet
Breast Cancer Detection Based On Thermographic Images Using Machine Learning and Deep Learning Algorithms
8 pages
Sadeghi Et Al. - 2023
No ratings yet
Sadeghi Et Al. - 2023
18 pages
CS Vii Sem 2024300824034401
No ratings yet
CS Vii Sem 2024300824034401
27 pages
Sightaid: Empowering The Visually Impaired in The Kingdom of Saudi Arabia (Ksa) With Deep Learning-Based Intelligent Wearable Vision System
No ratings yet
Sightaid: Empowering The Visually Impaired in The Kingdom of Saudi Arabia (Ksa) With Deep Learning-Based Intelligent Wearable Vision System
21 pages
Potato Disease Detection Report Final
No ratings yet
Potato Disease Detection Report Final
35 pages
Data and Hardware Efficient Design For Convolutional Neural Network!
No ratings yet
Data and Hardware Efficient Design For Convolutional Neural Network!
10 pages
딥러닝 기반 의미론적 분할 기법을 통한 건물 자동추출 연구 모델의 가중치 경중과 전이학습에
No ratings yet
딥러닝 기반 의미론적 분할 기법을 통한 건물 자동추출 연구 모델의 가중치 경중과 전이학습에
11 pages
Pathogen-Based Classification of Plant Diseases A Deep Transfer Learning Approach For Intelligent Support Systems
No ratings yet
Pathogen-Based Classification of Plant Diseases A Deep Transfer Learning Approach For Intelligent Support Systems
18 pages
AI Logo Quiz in PictoBlox Guide
No ratings yet
AI Logo Quiz in PictoBlox Guide
6 pages
CNN Optimizer Algorithms for Text Classification
No ratings yet
CNN Optimizer Algorithms for Text Classification
8 pages
UAV Surveillance Technologies Overview
No ratings yet
UAV Surveillance Technologies Overview
22 pages
Deep Learning: CNNs Explained
No ratings yet
Deep Learning: CNNs Explained
252 pages
Voice Assisted Object Detection For Visually Impaired
No ratings yet
Voice Assisted Object Detection For Visually Impaired
5 pages
AI Makeup Recommendation System
No ratings yet
AI Makeup Recommendation System
12 pages
Deep Learning Models For Plant Disease Detection and Diagnosis.
No ratings yet
Deep Learning Models For Plant Disease Detection and Diagnosis.
9 pages
A Novel Modified U-Shaped 3-D Capsule Network MUDCap3 For Stroke Lesion Segmentation From Brain MRI
No ratings yet
A Novel Modified U-Shaped 3-D Capsule Network MUDCap3 For Stroke Lesion Segmentation From Brain MRI
6 pages
FLOWCHART
No ratings yet
FLOWCHART
43 pages
MOOC PART 3 in Gndu
No ratings yet
MOOC PART 3 in Gndu
9 pages
Scalable Deep Learning Accelerator Design
No ratings yet
Scalable Deep Learning Accelerator Design
38 pages
Notes
No ratings yet
Notes
34 pages
Periocular-Assisted Multi-Feature Collaboration For Dynamic Iris Recognition
No ratings yet
Periocular-Assisted Multi-Feature Collaboration For Dynamic Iris Recognition
14 pages
Artificial Intelligence AI Enabled Vehicle Detection and Counting Using Deep Learning
No ratings yet
Artificial Intelligence AI Enabled Vehicle Detection and Counting Using Deep Learning
5 pages
YOLOv3 for Object Detection
No ratings yet
YOLOv3 for Object Detection
6 pages
Android Food Detection App Using CNN
No ratings yet
Android Food Detection App Using CNN
6 pages

Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN

Uploaded by

Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN

Uploaded by

WANG: MULTI-GRAINED SPATIO-TEMPORAL MODELING FOR LIP-READING 1

arXiv:1908.11618v2 [cs.CV] 2 Sep 2019

24×24 12×12 6×6 3×3 1/1/1

Figure 1: The architecture of the proposed framework, which consists of a spatio-temporal

3 Multi-Grained Spatio-temporal Modeling For

to the varied styles of different speakers.

3.1 Fine-grained Module

3.2 Medium-grained Module

3.3 Coarse-grained Module

3.3.1 Bi-ConvLSTM with Forward Input Attention

it = σ (Wxi ∗ Xt + Whi ∗ Ht−1 + Wci ◦ Ct−1 + bi )

4.2 Implementation Details

Method LRW LRW-1000

IEEE International Conference on Acoustics, 2016.

[3] Chandramouli Chandrasekaran, Andrea Trubanova, Sébastien Stillittano, Alice

You might also like