0% found this document useful (0 votes)
378 views22 pages

Challenges in NMT - 2004.05809

This document provides a review of neural machine translation (NMT), including its encoder-decoder framework, challenges, recent progress, and future directions. NMT uses deep neural networks to directly model the mapping between source and target languages, achieving significant improvements over previous statistical machine translation methods. The review discusses key challenges for NMT like multimodality and low-resource languages, and highlights advances in areas such as Transformer models, multimodal translation, and document-level translation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
378 views22 pages

Challenges in NMT - 2004.05809

This document provides a review of neural machine translation (NMT), including its encoder-decoder framework, challenges, recent progress, and future directions. NMT uses deep neural networks to directly model the mapping between source and target languages, achieving significant improvements over previous statistical machine translation methods. The review discusses key challenges for NMT like multimodality and low-resource languages, and highlights advances in areas such as Transformer models, multimodal translation, and document-level translation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

. Invited Review .

Neural Machine Translation: Challenges, Progress and Future


Jiajun Zhang1,2* & Chengqing Zong1,2,3*
1 National Laboratory of Pattern Recognition, CASIA, Beijing, China;
arXiv:2004.05809v1 [cs.CL] 13 Apr 2020

2 Schoolof Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China;


3 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China

Machine translation (MT) is a technique that leverages computers to translate human languages automatically. Nowadays, neural
machine translation (NMT) which models direct mapping between source and target languages with deep neural networks has
achieved a big breakthrough in translation performance and become the de facto paradigm of MT. This article makes a review
of NMT framework, discusses the challenges in NMT, introduces some exciting recent progresses and finally looks forward to
some potential future research trends. In addition, we maintain the state-of-the-art methods for various NMT tasks at the website
https://github.com/ZNLP/SOTA-MT.

neural machine translation, Transformer, multimodal translation, low-resource translation, document translation

1 Introduction puters to learn how to translate from lots of human-translated


parallel sentence pairs (parallel corpus). The study of data-
The concept of machine translation (MT) was formally pro- driven approach has experienced three periods. In the middle
posed in 1949 by Warren Weaver [138] who believed it is of 1980s, [93] proposed example-based MT which translates
possible to use modern computers to automatically translate a sentence by retrieving the similar examples in the human-
human languages. From then on, machine translation has be- translated sentence pairs.
come one of the most challenging task in the area of natu- From early 1990s, statistical machine translation (SMT)
ral language processing and artificial intelligence. Many re- has been proposed and word or phrase level translation rules
searchers of several generations dedicated themselves to re- can be automatically learned from parallel corpora using
alize the dream of machine translation. probabilistic models [16, 27, 70]. Thanks to the availability
From the viewpoint of methodology, approaches to MT of more and more parallel corpora, sophisticated probabilis-
mainly fall into two categories: rule-based method and data- tic models such as noisy channel model and log-linear model
driven approach. Rule-based methods were dominant and achieve better and better translation performance. Many
preferable before 2000s. In this kind of methods, bilingual companies (e.g. Google, Microsoft and Baidu) have devel-
linguistic experts are responsible to design specific rules for oped online SMT systems which much benefit the users.
source language analysis, source-to-target language transfor- However, due to complicated integration of multiple man-
mation and target language generation. Since it is very sub- ually designed components such as translation model, lan-
jective and labor intensive, rule-based systems are difficult to guage model and reordering model, SMT cannot make full
be scalable and they are fragile when rules cannot cover the use of large-scale parallel corpora and translation quality is
unseen language phenomena. far from satisfactory.
In contrast, the data-driven approach aims at teaching com- No breakthrough has been achieved more than 10 years
2 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

until the introduction of deep learning into MT. Since 2014, 𝑦! 𝑦" 𝑦# 𝑦$ 𝑦%
neural machine translation (NMT) based on deep neural net-

Decoder
works has quickly developed [8, 45, 116, 122]. In 2016,
through extensive experiments on various language pairs,
[65, 143] demonstrated that NMT has made a big break-
through and obtained remarkable improvements compared to
SMT, and even approached human-level translation quality

Encoder
[55]. This article attempts to give a review of NMT frame-
work, discusses some challenging research tasks in NMT, in-
troduces some exciting progresses and forecasts several fu-
ture research topics.
The remainder of this article is organized as follows: Sec. 2 𝑥! 𝑥" 𝑥# 𝑥$ 𝑥% 𝑥&
first introduces the background and state-of-the-art paradigm
Figure 1 encoder-decoder framework for neural machine translation. The
of NMT. In Sec. 3 we discuss the key challenging research encoder encodes the input sequence x0 x1 x2 x3 x4 x5 into distributed seman-
tasks in NMT. From Sec. 4 to Sec. 7, the recent progresses tic representations based on which the decoder produces an output sequence
are presented concerning each challenge. Sec. 8 discusses y0 y1 y2 y3 y4 .

the current state of NMT compared to expert translators and


finally looks forward to some potential research trends in the
future.
 
exp g(x, y<i , yi , θ)
2 Neural machine translation P(yi |x, y<i , θ) = (2)
 
y0 ∈V exp g(x, y<i , y , θ)
0
P
2.1 Encoder-Decoder Framework

Neural machine translation is an end-to-end model follow- in which V denotes the vocabulary of the target language and
ing an encoder-decoder framework that usually includes g(·) is a non-linear function that calculates a real-valued score
two neural networks for encoder and decoder respectively for the prediction yi conditioned on the input x, the partial
[8,45,116,122]. As shown in Fig. 1, the encoder network first translation y<i and the model parameters θ. The non-linear
maps each input token of the source-language sentence into function g(·) is realized through the encoder and decoder net-
a low-dimensional real-valued vector (aka word embedding) works. The input sentence x is abstracted into hidden seman-
and then encodes the sequence of vectors into distributed se- tic representations h through multiple encoder layers. y<i is
mantic representations, from which the decoder network gen- summarized into the target-side history context representa-
erates the target-language sentence token by token 1) from left tion z with decoder network which further combines h and z
to right. using an attention mechanism to predict the score of yi .
From the probabilistic perspective, NMT models the con-
The network parameters θ can be optimized to maxi-
ditional probability of the target-language sentence y =
mize the log-likelihood over the bilingual training data D =
y0 , · · · , yi , · · · , yI given the source-language sentence x =
{(x(m) , y (m) )}m=1
M
:
x0 , · · · , x j , · · · , x J as a product of token-level translation
probabilities.

M
X
I
Y θ = argmaxθ∗ logP(y (m) |x(m) , θ∗ ) (3)
P(y|x, θ) = P(yi |x, y<i , θ) (1) m=1
i=0

where y<i = y0 , · · · , yi−1 is the partial translation which has These years have witnessed the fast development of the
been generated so far. x0 , y0 and x J , yI are often special sym- encoder-decoder networks from recurrent neural network
bols <s> and </s> indicating the start and end of a sentence [8, 116], to convolutional neural network [45] and then to
respectively. self-attention based neural network Transformer [122]. At
The token-level translation probability can defined as fol- present, Transformer is the state-of-the-art in terms of both
lows: quality and efficiency.
1) Currently, subword is the most popular translation token for NMT [109].
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 3

2.2 Transformer layer. Residual connection and layer normalization are per-
formed for each sub-layer in both of the encoder and decoder.
It is easy to notice that the attention mechanism is the
Output key component. There are three kinds of attention mecha-
Probabilities
nisms, including encoder self-attention, decoder masked self-
Softmax attention and encoder-decoder attention. They can be formal-
ized into the same formula.
Linear
 qK T 
Add&Norm
Attention(q, K, V ) = softmax √ V (4)
dk
Feed where q, K and V stand for a query, the key list and the
Forward
value list respectively. dk is the dimension of the key.
For the encoder self-attention, the queries, keys and values
Add&Norm
Add&Norm are from the same layer. For example, considering we cal-
Feed Multi-head
N culate the output of the first layer in the encoder at the j-th
Inter-Attention
Forward position, let x j be the sum vector of input token embedding
and the positional embedding. The query is vector x j . The
N
Add&Norm Add&Norm keys and values are the same and both are the embedding ma-
trix x = [x0 · · · x J ]. Then, multi-head attention is proposed
Multi-head Intra- Mask Multi-head
Attention Intra-Attention to calculate attentions in different subspaces.

MultiHead(q, K, V) = Concati (headi )WO


Position
+ Position (5)
Encoding +
Encoding headi = Attention(qWiQ , KWiK , VWiV )
Input Output
Embedding Embedding in which WiQ ,, WiK , WiV and WO denote projection parameter
matrices .
Outputs
Inputs
(shifted right) Using Equation 5 followed by residential connection, layer
normalization and a feed-forward network, we can get the
Figure 2 the Transformer architecture in which attention mechanism is the representation of the second layer. After N layers, we obtain
core in both of the encoder and decoder networks. shift right means that the the input contexts C = [h0 , · · · , h J ].
prediction of the previous time-step will shift right as the input context to
predict next output token.
The decoder masked self-attention is similar to that of en-
coder except that the query at the i-th position can only attend
to positions before i, since the predictions after i-th position
In Transformer2) , the encoder includes N identical layers and are not available in the auto-regressive left-to-right unidirec-
each layer is composed of two sub-layers: the self-attention tional inference.
sub-layer followed by the feed-forward sub-layer, as shown
in the left part of Fig. 2. The self-attention sub-layer cal-
qi K T
culates the output representation of a token by attending to zi = Attention(qi , K6i , V6i ) = softmax( √ 6i )V6i (6)
all the neighbors in the same layer, computing the correlation dk
score between this token and all the neighbors, and finally lin- The encoder-decoder attention mechanism is to calculate
early combining all the representations of the neighbors and the source-side dynamic context which is responsible to pre-
itself. The output of the N-th encoder layer is the source- dict the current target-language token. The query is the output
side semantic representation h. The decoder as shown in of the masked self-attention sub-layer zi . The keys and values
the right part in Fig. 2 also consists of N identical layers. are the same encoder contexts C. The residential connection,
Each layer has three sub-layers. The first one is the masked layer normalization and feed-forward sub-layer are then ap-
self-attention mechanism that summarizes the partial predic- plied to yield the output of a whole layer. After N such layers,
tion history. The second one is the encoder-decoder attention we obtain the final hidden state zi . Softmax function is then
sub-layer determining the dynamic source-side contexts for employed to predict the output yi , as shown in the upper right
current prediction and the third one is the feed-forward sub- part of Fig. 2.
2) Model and codes can be found at https://github.com/tensorflow/tensor2tensor
4 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

3 Key challenging research tasks man reading and writing. It is also easy for training and in-
ference. However, it has several drawbacks. On one hand,
Although Transformer has significantly advanced the devel- the decoding efficiency is quite limited since the i-th transla-
opment of neural machine translation, many challenges still tion token can be predicted only after all the previous i − 1
remain to be addressed. Obviously, designing better NMT predictions have been generated. On the other hand, predict-
framework must be the most important challenge. However, ing the i-th token can only access the previous history pre-
since the innovation of Transformer, almost no more effec- dictions while cannot utilize the future context information
tive NMT architecture has been proposed. [22] presented an in autoregressive manner, leading to inferior translation qual-
alternative encoder-decoder framework RNMT+ which com- ity. Thus, it is a challenge how to break the autoregressive
bines the merits of RNN-based and Transformer-based mod- inference constraint. Non-autoregressive decoding and bidi-
els to perform translation. [129, 150] investigated how to de- rectional inference are two solutions from the perspectives of
sign much deeper Transformer model and [78] presented a efficiency and quality respectively.
Reformer model enabling rich interaction between encoder
and decoder. [141] attempted to replace self-attention with 3. Low-resource translation. There are thousands of hu-
dynamic convolutions. [112] proposed the evolved Trans- man languages in the world and abundant bitexts are only
former using neural architecture search. [83] aimed to im- available in a handful of language pairs such as English-
prove Transformer from the perspective of multi-particle dy- German, English-French and English-Chinese. Even in the
namic system. Note that these models do not introduce big resource-rich language pair, the parallel data are unbalanced
change on the NMT architecture. Pursuing to design novel since most of the bitexts mainly exist in several domains (e.g.
and more effective NMT framework will be a long way to go. news and patents). That is to say, the lack of parallel training
In this section, we analyze and discuss the key challenges3) corpus is very common in most languages and domains. It is
facing NMT from its formulation. well-known that neural network parameters can be well opti-
mized on highly repeated events (frequent word/phrase trans-
From the introduction in Sec. 2.1, NMT is formally de-
lation pairs in the training data for NMT) and the standard
fined as a sequence-to-sequence prediction task in which four
NMT model will be poorly learned on low-resource language
assumptions are hidden in default. First, the input is a sen-
pairs. As a result, how to make full use of the parallel data
tence rather than paragraphs and documents. Second, the
in other languages (pivot-based translation and multilingual
output sequence is generated in a left-to-right autoregressive
translation) and how to take full advantage of non-parallel
manner. Third, the NMT model is optimized over the bilin-
data (semi-supervised translation and unsupervised transla-
gual training data which should include large-scale parallel
tion) are two challenges facing NMT.
sentences in order to learn good network parameters. Fourth,
the processing objects of NMT are the pure texts (tokens, 4. Multimodal neural machine translation. Intuitively,
words and sentences) instead of speech and videos. Accord- human language is not only about texts and understanding
ingly, four key challenges can be summarized as follows: the meaning of a language may need the help of other modal-
1. Document neural machine translation. In NMT for- ities such as speech, image and videos. Concerning the well-
mulation, sentence is the basic input for modeling. However, known example that determines the meaning of the word
some words in the sentence are ambiguous and the sense can bank when translating the sentence ”he went to the bank”, it
only be disambiguated with the context of surrounding sen- will be correctly translated if we are given an image in which
tences or paragraphs. And when translating a document, we a man is approaching a river. Furthermore, in many scenar-
need to guarantee the same terms in different sentences lead ios, we are required to translate a speech or a video. For ex-
to the same translation while performing translation sentence ample, simultaneous speech translation is more and more de-
by sentence independently cannot achieve this goal. More- manding in various conferences or international live events.
over, many discourse phenomena such as coreference, omis- Therefore, how to perform multimodal translation under the
sions and coherence, cannot be handled in the absence of encoder-decoder architecture is a big challenge of NMT. How
document-level information. Obviously, it is a big challenge to make full use of different modalities in multimodal trans-
how to take full advantage of contexts beyond sentences in lation and how to balance the quality and latency in simulta-
neural machine translation. neous speech translation are two specific challenges.
2. Non-autoregressive decoding and bidirectional infer-
ence. Left-to-right decoding token by token follows an au- In the following sections, we briefly introduce the recent
toregressive style which seems natural and is in line with hu- progress for each challenge.
3) [69, 80, 156] have also discussed various challenges.
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 5

second-pass translation second-pass translation

target translation

first-pass translation first-pass translation

previous sentences current sentence previous sentence current sentence

(a) cascaded attention (b) two-pass model

Figure 3 illustration of two docNMT models. The left part shows the cascaded attention model proposed by [153] in which the previous source sentences
are first leveraged to enhance the representation of current source sentence and then used again in the decoder. The right part illustrates the two-pass docNMT
model proposed by [146] in which sentence-level NMT first generates preliminary translation for each sentence and then the first-pass translations together
with the source-side sentences are employed to generate the final translation results.

4 Document-level neural machine translation deep document-level information under the encoder-decoder
framework. It does not need to explicitly model specific dis-
As we discussed in Sec. 3 that performing translation sen- course phenomenon as that in SMT. According to the types
tence by sentence independently would introduce several of used document information, document-level neural ma-
risks. An ambiguous word may not be correctly translated chine translation (docNMT) can roughly fall into three cat-
without the necessary information in the surrounding con- egories: dynamic translation memory [71, 120], surrounding
textual sentences. A same term in different sentences in the sentences [61, 90, 125, 128, 146, 148, 153] and the whole doc-
same document may result in inconsistent translations. Fur- ument [87, 88, 117].
thermore, many discourse phenomena, such as coreference,
[120] presented a dynamic cache-like memory to maintain
omissions and cross-sentence relations, cannot be well han-
the hidden representations of previously translated words.
dled. In a word, sentence-level translation will harm the co-
The memory contains a fixed number of cells and each cell
herence and cohesion of the translated documents if we ig-
is a triple (ct , st , yt ) where yt is the prediction at the t-th step,
nore the discourse connections and relations between sen-
ct is the source-side context representation calculated by the
tences.
attention model and st is the corresponding decoder state.
In general, document-level machine translation (docMT)
During inference, when predicting the i-th prediction for a
aims at exploiting the useful document-level information
test sentence, ci is first obtained through attention model and
(multiple sentences around the current sentence or the whole
the probability p(ct |ci ) is computed based on their similar-
document) to improve the translation quality of the current
ity. Then memory context representation mi is calculated by
sentence as well as the coherence and cohesion of the trans-
linearly combining all the values st with p(ct |ci ). This cache-
lated document. docMT has already been extensively stud-
like memory can encourage the words in similar contexts to
ied in the era of statistical machine translation (SMT), in
share similar translations so that cohesion can be enhanced to
which most researchers mainly propose explicit models to
some extent.
address some specific discourse phenomena, such as lexical
cohesion and consistency [46, 144, 145], coherence [15] and The biggest difference between the use of whole document
coreference [104]. Due to complicate integration of multi- and surrounding sentences lies in the number of sentences
ple components in SMT, these methods modeling discourse employed as the context. This article mainly introduces the
phenomenon do not lead to promising improvements. methods exploiting surrounding sentences for docNMT. Rel-
The NMT model dealing with semantics and translation in evant experiments further show that subsequent sentences on
the distributed vector space facilitates the use of wider and the right contribute little to the translation quality of the cur-
6 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

rent sentence. Thus, most of the recent work aim at fully previous sentences performs poorly in handling the discourse
exploring the previous sentences to enhance docNMT. These phenomena while exploiting both source sentences and tar-
methods can be divided into two categories. One just utilizes get translations leads to the best performance. Accordingly,
the previous source-side sentences [61, 119, 128, 153]. The [123, 124] recently focused on designing better document-
other uses the previous source sentences as well as their tar- level NMT to improve on specific discourse phenomena such
get translations [90, 146]. as deixis, ellipsis and lexical cohesion for English-Russian
If only previous source-side sentences are leveraged, the translation.
previous sentences can be concatenated with the current sen-
tence as the input to the NMT model [119] or could be en-
5 Non-autoregressive decoding and bidirec-
coded into a summarized source-side context with a hierar-
tional inference
chical neural network [128]. [153] presented a cascaded at-
tention model to make full use of the previous source sen- Most NMT models follow the autoregressive generation style
tences. As shown in the left part of Fig. 3, previous sen- which produces output word by word from left to right. Just
tence is first encoded as the document-level context represen- as Sec. 3 discussed, this paradigm has to wait for i − 1 time
tation. When encoding the current sentence, each word will steps before starting to predict the i-th target word. Further-
attend to the document-level context and obtain a context- more, left-to-right autoregressive decoding cannot exploit the
enhanced source representation. During the calculation of target-side future context (future predictions after i-th word).
cross-language attention in the decoder, the current source Recently, many research work attempt to break this decod-
sentence together with the document-level context are both ing paradigm. Non-autoregressive Transformer (NAT) [49]
leveraged to predict the target word. The probability of trans- is proposed to remarkably lower down the latency by emit-
lation sentence given the current sentence and the previous ting all of the target words at the same time and bidirectional
context sentences is formulated as follows: inference [161, 172] is introduced to improve the translation
quality by making full use of both history and future contexts.
I
Y
P(y|x, doc x ; θ) = P(yi |y<i , x, doc x ; θ) (7) 5.1 Non-autoregressive decoding
i=0

where doc x denotes the source-side document-level context, Non-autoregressive Transformer (NAT) aims at producing an
namely previous sentences. entire target output in parallel. Different from the autore-
If both of previous source sentences and their translations gressive Transformer model (AT) which terminates decod-
are employed, two-pass decoding is more suitable for the ing when emitting an end-of-sentence token h/si, NAT has
docNMT model [146]. As illustrated in the right part of to know how many target words should be generated before
Fig. 3, the sentence-level NMT model can generate prelimi- parallel decoding. Accordingly, NAT calculates the condi-
nary translations for each sentence in the first-pass decoding. tional probability of a translation y given the source sentence
Then, the second-pass model will produce final translations x as follows:
with the help of source sentences and their preliminary trans-
lation results. The probability of the target sentence in the I
Y
second pass can be written by: PNAT (y|x; θ) = PL (I|x; θ) · P(yi |x; θ) (9)
i=0

I
Y To determine the output length, [49] proposed to use the
P(y|x, doc x , docy ; θ) = P(yi |y<i , x, doc x , docy ; θ) (8) fertility model which predicts the number of target words
i=0
that should be translated for each source word. We can per-
in which docy denotes the first-pass translations of doc x . form word alignment on the bilingual training data to obtain
Since most methods for docNMT are designed to boost the the gold fertilities for each sentence pair. Then, the fertility
overall translation quality (e.g. BLEU score), it still remains model can be trained together with the translation model. For
a big problem whether these methods indeed well handle the each source word x j , suppose the predicted fertility is Φ(x j ).
discourse phenomena. To address this issue, [10] conducted The output length will be I = Jj=0 Φ(x j ).
P

an empirical investigation of the docNMT model on the per- Another issue remains that AT let the previous generated
formance of processing various discourse phenomena, such output yi−1 be the input at the next time step to predict the i-th
as coreference, cohesion and coherence. Their findings indi- target word but NAT has no such input in the decoder net-
cate that multi-encoder model exploring only the source-side work. [49] found that translation quality is particularly poor
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 7

𝑦! 𝑦" 𝑦$ ⋯ 𝑦%

AT
NAT 𝑦! 𝑦" 𝑦$ ⋯ 𝑦%
𝑥! 𝑥" 𝑥$ ⋯ 𝑥# SAT

𝑦! 𝑦" 𝑦$ 𝑦& ⋯ 𝑦%'" 𝑦%

N
AT
-E
D
I
𝑦! 𝑦" 𝑦$ ⋯ 𝑦%

phrase table 𝑦!∗ 𝑦"∗ 𝑦$∗ ⋯ 𝑦%∗

Figure 4 illustration of autoregressive NMT model and various non-autoregressive NMT models. AT denotes the conventional autoregressive NMT paradigm
in which the i-th prediction can fully utilize the partial translation of i − 1 words. NAT indicates the non-autoregreesive NMT model that generates all the target
words simultaneously. SAT is a variant of NAT which produces an ngram each time. NAT-EDI denotes the non-autoregressive NMT model with enhanced
decoder input which is generated by retrieving the phrase table.

if omitting the decoder input in NAT. To address this, they algorithm. Suppose the longest phrase in the phrase table
resort to the fertility model again and copy each source word contains K words. x0:K−1 is a phrase if it matches an entry
as many times as its fertility Φ(x j ) into the decoder input. in the phrase table. Otherwise they iteratively check x0:K−2 ,
The empirical experiments show that NAT can dramatically x0:K−3 and so on. If x0:h is a phrase, then they start to check
boost the decoding efficiency by 15× speedup compared to xh+1:h+K . After segmentation, each source phrase is mapped
AT. However, NAT severely suffers from accuracy degrada- into target translations which are concatenated together as the
tion. new decoder input, as shown in Fig. 4. Due to proper model-
The low translation quality may be due to at least two crit- ing of the decoder input with a highly efficient strategy, trans-
ical issues of NAT. First, there is no dependency between tar- lation quality is substantially improved while the decoding
get words although word dependency is ubiquitous in natu- speed is even faster than baseline NAT.
ral language generation. Second, the decoder inputs are the
copied source words which lie in different semantic space 5.2 Bidirectional inference
with target words. Recently, to address the shortcomings of
From the viewpoint of improving translation quality, autore-
the original NAT model, several methods are proposed to im-
gressive model can be enhanced by exploring the future con-
prove the translation quality of NAT while maintaining its
text on the right. In addition to predicting and estimating
efficiency [54, 75, 110, 127, 137, 139].
the future contexts with various models [151, 169, 171], re-
[127] proposed a semi-autoregressive Transformer model searchers find that left-to-right (L2R) and right-to-left (R2L)
(SAT) to combine the merits of both AT and NAT. SAT keeps autoregressive models can generate complementary transla-
the autoregressive property in global but performs NAT in lo- tions [57, 79, 161, 172]. For example, in Chinese-to-English
cal. Just as shown in Fig. 4, SAT generates K successive tar- translation, experiments show that L2R can generate better
get words at each time step in parallel. If K = 1, SAT will be prefix while R2L is good at producing suffix. Intuitively, it is
exactly AT. It will become NAT if K = I. By choosing an ap- a promising direction to combine the merits of bidirectional
propriate K, dependency relation between fragments are well inferences and fully exploit both history and future contexts
modeled and the translation quality can be much improved on the target side.
with some loss of efficiency. To this end, many researchers resort to exploring bidirec-
To mimic the decoder input in the AT model, [54] intro- tional decoding to take advantages of both L2R and R2L
duced a simple but effective method that employs a phrase inferences. These methods are mainly fall into four cat-
table which is the core component in SMT to convert source egories: 1, enhancing agreement between L2R and R2L
words into target words. Specifically, they first greedily seg- predictions [79, 164]; 2, reranking with bidirectional de-
ment the source sentence into phrases with maximum match coding [79, 106, 107]; 3, asynchronous bidirectional decod-
8 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

𝑙2𝑟
𝑦!%&' 𝑦"%&' 𝑦&%&' ⋯ %&'
𝑦()"

𝑥! 𝑥" ⋯ 𝑥$ ⋯ 𝑥#

𝑦('&%
! )" 𝑦('&%
! )& 𝑦('&%
! )* ⋯ 𝑦!'&%
𝑟2𝑙


<l2r> L2R

<l2r> SBAtt SBAtt
<r2l>
SBAtt SBAtt
<r2l> …
R2L

<pad>

T=0 T=1 T=2 …

Figure 5 illustration of the synchronous bidirectional inference model. The top demonstrates how the bidirectional contexts can be leveraged during inference.
The bottom compares the beam search algorithm between the conventional NMT and the synchronous bidirectional NMT.

ing [115, 161] and 4, synchronous bidirectional decoding [172] proposed a synchronous bidirectional decoding
[154, 172, 173]. model (SBD) that produces outputs using both L2R and R2L
Ideally, L2R decoding should generate the same transla- decoding simultaneously and interactively. Specifically, a
tion as R2L decoding. Under this reasonable assumption, new synchronous attention model is proposed to conduct in-
[79, 164] introduced agreement constraint or regularization teraction between L2R and R2L inferences. The top part in
between L2R and R2L predictions during training. Then, Fig. 5 gives a simple illustration of the proposed synchronous
L2R inference can be improved. bidirectional inference model. The dotted arrows in color on
The reranking algorithm is widely used in machine trans- the target side is the core of the SBD model. L2R and R2L in-
lation, and the R2L model can provide an estimation score for ferences interact with each other in an implicit way illustrated
the quality of L2R translation from another parameter space by the dotted arrows. All the arrows indicate the informa-
[79, 106, 107]. Specifically, L2R first generates a n-best list tion passing flow. Solid arrows show the conventional history
of translations. The R2L model is then leveraged to force de- context dependence while dotted arrows introduce the future
code each translation leading to a new score. Finally, the best context dependence on the other inference direction. For ex-
translation is selected according to the new scores. ample, besides the past predictions (yl2r l2r
0 , y1 ), L2R inference
r2l r2l
[115, 161] proposed an asynchronous bidirectional decod- can also utilize the future contexts (y0 , y1 ) generated by the
ing model (ASBD) which first obtains the R2L outputs and R2L inference when predicting yl2r 2 . The conditional proba-
optimizes the L2R inference model based on both of the bility of the translation can be written as follows:
source input and the R2L outputs. Specifically, [161] first
trained a R2L model with the bilingual training data. Then, Q
I →
−→ − →
− ←
− ←−
the optimized R2L decoder translates the source input of each  i=0 P( yi | y 0 · · · y i−1 , x, y 0 · · · y i−1 )
 if L2R
P(y|x) = 

sentence pair and produces the outputs (hidden states) which  I −1 P(←
 Q 0 −
y i |←−
y0 ···← −
y i−1 , x,→
−y · · ·→
0
−y )
i−1 if R2L
i=0
serve as the additional context for L2R prediction when op- (10)
timizing the L2R inference model. Due to explicit use of
right-side future contexts, the ASBD model significantly im- To accommodate L2R and R2L inferences at the same
proves the translation quality. But these approaches still suf- time, they introduced a novel beam search algorithm. As
fer from two issues. On one hand, they have to train two sep- shown in the bottom right of Fig. 5, at each timestep during
arate NMT models for L2R and R2L inferences respectively. decoding, each half beam maintains the hypotheses from L2R
And the two-pass decoding strategy makes the latency much and R2L decoding respectively and each hypothesis is gener-
increased. On the other hand, the two models cannot interact ated by leveraging already predicted outputs from both direc-
with each other during inference, which limits the potential tions. At last, the final translation is chosen from L2R and
of performance improvement. R2L results according to their translation probability. Thanks
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 9

to appropriate rich interaction, the SBD model substantially (one source language to multiple target languages). [176]
boosts the translation quality while the decoding speed is only proposed to share the decoder for many-to-one translation
10% slowed down. (many source languages to one target language). [43,44] pro-
[173] further noticed that L2R and R2L are not necessary posed to share attention mechanism for many-to-many trans-
to produce the entire translation sentence. They let L2R gen- lation (many source languages to many target languages).
erate the left half translation and make R2L produce the right Despite performance improved for low-resource languages,
half, and then two halves are concatenated to form the final these methods are required to design a specific encoder or
translation. Using proper training algorithms, they demon- decoder for each language, hinders their scalability in deal-
strated through extensive experiments that both translation ing with many languages.
quality and decoding efficiency can be significantly improved [64] goes a step further and let all source languages share
compared to the baseline Transformer model. the same encoder and all the target languages share the same
decoder. They have successfully trained a single encoder-
6 Low-resource Translation decoder NMT model for multilingual translation. The biggest
issue is that the decoder is unaware of which target language
Most NMT models assume that enough bilingual training should be translated to at the test phase. To this end, [64]
data is available, which is the rare case in real life. For a introduced a simple strategy and added a special token indi-
low-resource language pair, a natural question may arise that cating target language (e.g 2en and 2zh) at the beginning of
what kind of knowledge can be transferred to build a rel- the source sentence. By doing this, low-resource languages
atively good NMT system. This section will discuss three have the biggest chance to share translation knowledge from
kinds of methods. One attempts to share translation knowl- other resource-rich languages. It also enables zero-shot trans-
edge from other resource-rich language pairs, in which pivot lation as long as the two languages are employed as source
translation and multilingual translation are the two key tech- and target in the multilingual NMT model. In addition, this
niques. Pivot translation assumes that for the low-resource unified multilingual NMT is very scalable and could trans-
pair A and B, there is a language C that has rich bitexts with late all the languages in one model ideally. However, ex-
A and B respectively [26, 142]. This section mainly discusses periments find that the output is sometimes mixed of mul-
the technique of multilingual translation in the first category. tiple languages even using a translation direction indicator.
The second kind of methods resort to semi-supervised ap- Furthermore, this paradigm enforces different source/target
proach which takes full advantages of limited bilingual train- languages to share the same semantic space, without con-
ing data and abundant monolingual data. The last one lever- sidering the structural divergency among different languages.
ages unsupervised algorithm that requires monolingual data The consequence is that the single model based multilingual
only. NMT yields inferior translation performance compared to in-
dividually trained bilingual counterparts. Most of recent re-
6.1 Multilingual neural machine translation search work mainly focus on designing better models to well
balance the language-independent parameter sharing and the
Let us first have a quick recap about the NMT model based
language-sensitive module design.
on encoder-decoder framework. The encoder is responsible
for mapping the source language sentence into distributed se- [13] augmented the attention mechanism in decoder with
mantic representations. The decoder is to convert the source- language-specific signals. [134] proposed to use language-
side distributed semantic representations into target language sensitive positions and language-dependent hidden presenta-
sentence. Apparently, the encoder and the decoder (exclud- tions for one-to-many translation. [98] designed an algorithm
ing the cross-language attention component) are just single- to generate language-specific parameters. [118] designed a
language dependent. Intuitively, the same source language language clustering method and forced languages in the same
in different translation systems (e.g. Chinese-to-English, cluster to share the parameters in the same semantic space.
Chinese-to-Hindi) can share the same encoder and the same [135] attempted to generate two languages simultaneously
target language can share the same decoder (e.g. Chinese- and interactively by sharing encoder parameters. [136] pro-
to-English and Hindi-to-English). Multilingual neural ma- posed a compact and language-sensitive multilingual transla-
chine translation is a framework that aims at building a uni- tion model which attempts to share most of the parameters
fied NMT model capable of translating multiple languages while maintaining the language discrimination.
through parameter sharing and knowledge transferring. As shown in Fig. 6, [136] designed four novel modules in
[36] is the first to design a multi-task learning method the Transformer framework compared to single-model based
which shares the same encoder for one-to-many translation multilingual NMT. First, they introduced a representor to re-
10 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

place both encoder and decoder by sharing weight parameters attention parameters according to specific translation tasks
of the self-attention block, feed-forward and normalization dynamically.
blocks (middle part in Fig. 6). It makes the multilingual NMT 3. Language-sensitive discriminator (top part in Fig. 6):
model as compact as possible and maximizes the knowledge for this module, they employed a neural model fdis on the top
sharing among different languages. The objective function layer of reprensentor hrep
top , and this model outputs a language
over L language pairs becomes: judgment score Plang .

Ml
L X
Plang = softmax(Wdis × fdis (hrep
top ) + bdis ) (12)
logP(yl(m) |x(m)
X
L(θ) = l ; θrep , θatt ) (11)
l=1 m=1 Combining the above four ideas together, they showed
where θrep and θatt denote parameters of representor and at- through extensive experiments that the new method signifi-
tention mechanism respectively. cantly improves multilingual NMT on one-to-many, many-
to-many and zero-shot scenarios, outperforming bilingual
counterparts in most cases. It indicates that low-resource lan-
guage translation can greatly benefit from this kind of multi-
lingual NMT, and so do zero-resource language translations.

6.2 Semi-supervised neural machine translation


Semi-supervised neural machine translation is a paradigm
which aims at building a good NMT system with limited
bilingual training data D = {x(m) , y (m) }m=1
M
plus massive
(l x ) L x
source monolingual data D x = {x }lx =1 or target monolin-
L
gual data Dy = {y (ly ) }lyy=1 or both.
Monolingual data plays a very important role in SMT
where the target-side monolingual corpus is leveraged to train
a language model (LM) to measure the fluency of the transla-
tion candidates during decoding [27, 68, 70]. Using monolin-
gual data as a language model in NMT is not trivial since it
needs to modify the architecture of the NMT model. [52, 53]
integrated NMT with LM by combining hidden states of both
Figure 6 illustration of a compact and language-sensitive multilingual models, making the model much complicated.
NMT model. The compactness is ensured by sharing parameters between As for leveraging the target-side monolingual data, back-
encoder and decoder, denoted as representor. Language-sensitive capac-
translation (BT) proposed by [108] may be one of the best
ity is realized by three components: language-sensitive embedding (bot-
tom), language-sensitive cross-attention (middle) and language discrimina- solutions up to now. BT is easy and simple to use since it is
tor (top). model agnostic to the NMT framework [38, 58]. It only re-
quires to train a target-to-source translation system to trans-
However, the representor further reduces the ability to dis- late the target-side monolingual sentences back into source
criminate different languages. To address this, they intro- language. The source translation and its corresponding tar-
duced three language-sensitive modules. get sentence are paired as pseudo bitexts which combined to-
1. Language-sensitive embedding (bottom part in Fig. 6): gether with original bilingual training data to train the source-
they compared four categories of embedding sharing patterns, to-target NMT system. It has been proved to be particularly
namely language-based pattern (different languages have sep- useful in low-resource translation [66]. [38] conducted a deep
arate input embeddings), direction-based patter (languages in analysis to understand BT and investigate various methods
source side and target side have different input embeddings), for synthetic source sentence generation. [131] proposed to
representor-based pattern (shared input embeddings for all measure the confidence level of synthetic bilingual sentences
languages) and three-way weight tying pattern proposed by so as to filter the noise.
[101], in which the output embedding of the target side is In order to utilize the source-side monolingual data, [157]
also shared besides representor-based sharing. proposed two methods: forward translation and multi-task
2. Language-sensitive attention (middle part in Fig. 6): learning. Forward translation is similar to BT, and the multi-
this mechanism allows the model to select the cross-lingual task learning method performs source-to-target translation
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 11

source-to-target

𝑎
!
𝑃 𝑥 |𝑥; 𝜃"#$ , 𝜃"#$ = ( 𝑃 𝑦|𝑥; 𝜃"#$ 𝑃 𝑥′|𝑦; 𝜃$#"
𝑥 %
𝑦 = 𝑠2𝑡 𝑥
𝑥 ! = 𝑡2𝑠 𝑦
𝑅 𝑥 ! , 𝑥, 𝜃"#$ , 𝜃$#" = 𝐵𝐿𝐸𝑈 𝑥, 𝑥′ + 𝐿𝑀 𝑦 + 𝐿𝑀 𝑥′
𝑏
target-to-source

Figure 7 illustration of two methods exploring monolingual data. If the parameters are trained to maximize the objective function of (a), it is the auto-encoder
based method. If using reward as (b) shows, it is the dual learning method. Note that this figure only demonstrates the usage of source-side monolingual data
for simplicity. The use of target-side monolingual data is symmetric.

task and source sentence reordering task by sharing the same data D x = {x(lx ) }lLxx=1 and target-side monolingual data Dy =
encoder. L
{y (ly ) }lyy=1 .
Many researchers resort to use both side monolingual Unsupervised machine translation can date back to the era
data in NMT at the same time [25, 56, 163, 170]. We of SMT, in which decipherment approach is employed to
summarize two methods in Fig. 7: the auto-encoder based learn word translations from monolingual data [37, 96, 102]
semi-supervised learning method [25] and the dual learning or bilingual phrase pairs can be extracted and their probabili-
method [56]. For a source-side monolingual sentence x, [25] ties can be estimated from monolingual data [67, 155].
employed source-to-target translation as encoder to generate Since [91] found that word embeddings from two lan-
latent variable y and leverage target-to-source translation as guages can be mapped using some seed translation pairs,
decoder to reconstruct the input leading to x0 . They optimize bilingual word embedding learning or bilingual lexicon in-
the parameters by maximizing the reconstruction probabil- duction has attracted more and more attention [4, 21, 31, 41,
ity as shown in Fig. 7(a). The target-side monolingual data 158, 159]. [4] and [31] applied linear embedding mapping
is used in a symmetric way. Fig. 7(b) shows the objective and adversarial training to learn word pair matching in the
function for the dual learning method. [56] treated source-to- distribution level and achieve promising accuracy for similar
target translation as the primal task and target-to-source trans- languages.
lation as the dual task. Agent A sends through the primal task Bilingual lexicon induction greatly motivates the study
a translation of the source monolingual sentence to the agent of unsupervised NMT on sentence level. And two tech-
B. B is responsible to estimate the quality of the translation niques of denoising auto-encoder and back-translation make
with a language model and the dual task. The rewards includ- it possible for unsupervised NMT. The key idea is to find
ing the similarity between the input x and reconstructed one a common latent space between the two languages. [5]
x0 , and two language model scores LM(y), LM(x0 ), are em- and [72] both optimized dual tasks of source-to-target and
ployed to optimize the network parameters of both source- target-to-source translation. [5] employed shared encoder to
to-target and target-to-source NMT models. Similarly, the force two languages into a same semantic space, and two
target-side monolingual data is used in a symmetric way in language-dependent decoders. In contrast, [72] ensured the
dual learning. two languages share the same encoder and decoder, relying
[163] introduced an iterative back-translation algorithm to on an identifier to indicate specific language similar to single-
exploit both source and target monolingual data with an EM model based multilingual NMT [64]. The architecture and
optimization method. [170] proposed a mirror-generative training objective functions are illustrated in Fig. 8.
NMT model, that explores the monolingual data by unifying
The top in Fig. 8 shows the use of denoising auto-encoder.
the source-to-target NMT model, the target-to-source NMT
The encoder encodes a noisy version of the input x into hid-
model, and two language models. They showed better per-
den representation z src which is used to reconstruct the input
formance can be achieved compared to back-translation, iter-
with the decoder. The distance (auto-encoder loss Lauto ) be-
ative back-translation and dual learning.
tween the reconstruction x0 and the input x should be as small
as possible. To guarantee source and target languages share
6.3 Unsupervised neural machine translation
the same semantic space, an adversarial loss Ladv is intro-
Unsupervised neural machine translation addresses a very duced to fool the language identifier.
challenging scenario in which we are required to build a The bottom in Fig. 8 illustrates the use of back-translation.
NMT model using only massive source-side monolingual A target sentence y is first back-translated into x∗ using an old
12 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

𝑧!"# 𝑥′
𝑥~𝐷!"#
add
Enc(src) Dec(src) ℒ&)$*
𝑥
noise

ℒ&'(
add 𝑦′ 𝑦
𝑦~𝐷$%$ Enc(tgt) Dec(tgt) ℒ&)$*
noise 𝑧$%$

𝑧!"#
𝑦~𝐷$%$ $%&
𝑁𝑀𝑇!"#
𝑥∗ add
Enc(src) Dec(src)
𝑥′ ℒ!"#$% 𝑥
noise

ℒ&'(
𝑦∗ add 𝑦′ 𝑦
𝑥~𝐷!"# $%&
𝑁𝑀𝑇#"!
noise
Enc(tgt) Dec(tgt) ℒ!"#$%
𝑧$%$

Figure 8 architecture of the unsupervised NMT model. The top shows denosing auto-encoder that aims at reconstructing the same language input. The bot-
tom demonstrates back-translation which attempt to reconstruct the input in the other language using back-translation (target-to-source) and forward translation
(source-to-target) The auto-encoder loss Lauto , the translation loss Ltrans and the language adversarial loss Ladv are used together to optimize the dual NMT
models.

target-to-source NMT model (the one optimized in previous for machine translation are not publicly available currently.
iteration, and the initial model is the word-by-word transla- Recently, IWSLT-20204) organized the first evaluation on ve-
tion model based on bilingual word induction). Then, the dio translation in which annotated video data is only available
noisy version of the translation x∗ is encoded into z src which for validation and test sets.
is then translated back into target sentence y0 . The new NMT Translation for paired image-text, offline speech-to-text
model (encoder and decoder) is optimized to minimize the translation and simultaneous translation have become in-
translation loss Ltrans which is the distance between the trans- creasingly popular in recent years.
lation y0 and the original target input y. Similarly, an adver-
sarial loss is employed in the encoder module. This process 7.1 Image-Text Translation
iterates until convergence of the algorithm. Finally, the en-
coder and decoder can be applied to perform dual translation Given an image and its text description as source language,
tasks. the task of image-text translation aims at translating the de-
scription in source language into the target language, where
[147] argued that sharing some layers of encoder and de-
the translation process can be supported by information from
coder while making others language-specific could improve
the paired image. It is a task requiring the integration of nat-
the performance of unsupervised NMT. [6] further combined
ural language processing and computer vision. WMT5) orga-
the NMT and SMT to improve the unsupervised translation
nized the first evaluation task on image-text translation (they
quality. Most recently, [30, 103, 113] resorted to pre-training
call it multimodal translation) in 2016 and also released the
techniques to enhance the unsupervised NMT model. For
widely-used dataset Multi30K consisting of about 30K im-
example, [30] proposed a cross-lingual language model pre-
ages each of which has an English description and transla-
training method under BERT framework [33]. Then, two pre-
tions in German, French and Czech6) . Several effective mod-
trained cross-lingual language models are employed as the
els have been proposed since then. These methods mainly
encoder and decoder respectively to perform translation.
differ in the usage of the image information and we mainly
discuss four of them in this section.
7 Multimodal neural machine translation [59] proposed to encode the image into one distributed
vector representation or a sequence of vector representations
We know that humans communicate with each other in a using convolutional neural networks. Then, they padded the
multimodal environment in which we see, hear, smell and vector representations together with the sentence as the final
so on. Naturally, it is ideal to perform machine translation input for the NMT model which does not need to be modified
with the help of texts, speeches and images. Unfortunately, for adaptation. The core idea is that they did not distinguish
video corpora containing parallel texts, speech and images images from texts in the model design.
4) http://iwslt.org/doku.php?id=evaluation
5) https://www.statmt.org/wmt16/multimodal-task.html
6) https://github.com/multi30k/dataset
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 13

target sentence target sentence image

Image Decoder
Text Decoder

Text Decoder
double attention translation imagination
Text Encoder

Image Encoder

Encoder
source sentence image source sentence

(a) doubly attentive model (b) image imagination model

Figure 9 comparison between the doubly-attentive model and the image imagination model for image-text translation. In doubly attentive model, the image
is encoded as an additional input feature. While in image imagination model, the image is decoded output from the source sentence.

[19] presented a doubly-attentive decoder for the image- translation and the other for image imagination.
text translation task. The major difference from [59] is that
they design textual encoder and visual encoder respectively, M 
X 
and employ two separate attention models to balance the con- L(θ) = logP(y (m) |x(m) ; θ) + logP(I M (m) |x(m) ; θ) (14)
tribution of text and image during prediction at each time- m=1

step. All the above methods are proved to significantly improve


[39] introduced a multi-task learning framework to per- the translation quality. But it remains a natural question that
form image-text translation. They believe that one can imag- when and how does the image help the text translation. [18]
ine the image given the source language sentence. Based on conducted a detailed error analysis when translating both vi-
this assumption, they use one encoder and two decoders in sual and non-visual terms. They find that almost all kinds
a multi-task learning framework. The encoder first encodes of translation errors (not only the terms having strong visual
the source sentence into distributed semantic representations. connections) have decreased after using image as the addi-
One decoder generates the target language sentence from the tional context.
source-side representations. The other decoder is required to Alternatively, [17] attempted to answer when the visual
reconstruct the given image. It is easy to see that the images information is needed in the image-text translation. They de-
are only employed in the training stage but are not required signed an input degradation method to mask crucial infor-
during testing. From this perspective, the multi-task learning mation in the source sentence (e.g. masking color words or
framework can be applicable in more practical scenarios. entities) in order to see whether the paired image would make
[20] further proposed a latent variable model for image- up the missing information during translation. They find that
text translation. Different from previous methods, they de- visual information of the image can be helpful when it is com-
signed a generative model in which a latent variable is in plementary rather than redundant to the text modality.
charge of generating the source and target language sen-
tences, and the image as well. 7.2 Offline Speech-to-Text Translation
Fig. 9 illustrates the comparison between the doubly- Speech-to-Text translation abbreviated as speech translation
attentive model and the image imagination model. Suppose (ST) is a task that automatically converts the speech in the
the paired training data is D = {(x(m) , y (m) , I M (m) )}m=1
M
where source language (e.g. English) into the text in the target lan-
I M denotes image. The objective function of the doubly- guage (e.g. Chinese). Offline speech translation indicates that
attentive model can be formulated as follows: the complete speech (e.g. a sentence or a fragment in a time
interval) is given before we begin translating. Typically, ST
is accomplished with two cascaded components. Source lan-
M
guage speech is first transcribed into the source language text
X
L(θ) = logP(y (m) |x(m) , I M (m) ; θ) (13)
m=1 using an automatic speech recognition (ASR) system. Then,
the transcription is translated into target language text with a
In contrast, the image imagination model has the follow- neural machine translation system. It is still the mainstream
ing objective function which includes two parts, one for text approach to ST in real applications. In this kind of paradigm,
14 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

ASR and NMT are not coupled and can be optimized inde- tion layer for ASR and the word embeddings. The text sen-
pendently. tence in MT is converted into the same length as the CTC
Nevertheless, the pipeline method has two disadvantages. output sequence of the ASR model. By doing this, the ASR
On one hand, the errors propagate through the pipeline and encoder and the MT encoder will be consistent in length and
the ASR errors are difficult to make up during translation. semantic representations. Therefore, the pre-trained encoder
On the other hand, the efficiency is limited due to the two- and attention in the MT model can be used in ST in addition
phase process. [175] believed in early years that end-to-end to the ASR encoder and the MT decoder.
speech translation is possible with the development of mem-
ory, computational capacity and representation models. Deep
first showed that end-to-end model can outperform a cascade Distillation Loss
learning based on distributed
of independently representations
trained pipeline systemfacilitates the
on Fisher Callhome
end-to-end modeling for speech
Spanish-English speech translation.
translation task.[12] presented
Bansal et al. [12] Student Teacher
found that pretraining encoder on higher-resource ASR train- softmax softmax
an end-to-ending
model without
data can achieveusing any improvements
significant source language tran-
on low-resource
scriptions under
speech antranslation,
encoder-decoder
even when the framework. Different
audios in two tasks do not be- ST decoder MT decoder
long to the same language. However, these work mainly resort Feed Forward Feed Forward
from the pipeline paradigm, the end-to-end model should be Networks Networks
to pretraining acoustic encoder and do not take full advantage
optimized onofthe training
text data. data consisting of instances (source N
Multi­Head Multi­Head
N
Knowledge
speech, target text). We listdistillation
some of theis first adoptedused
recently to apply for model
datasets Attention Attention
compression, whose 7) main idea is to train a student 8)model
in Table. 1, including
to mimic the IWSLT
behaviors, Augmented Librispeech
of a teacher model. It has ,soon Multi­Head Multi­Head
Attention Attention
9) to a variety10) of tasks, like image11)
Fisher and Callhome , MuST-C and TED-Corpus .
been applied classification
[13, 20, 21, 22], speech recognition [13] and natural language
It is easy to find from
processing the23].
[14, 17, table
Thethat theand
teacher training
student data
modelfor
in con- ST encoder MT encoder

end-to-end ST is much less than that in ASR and MT.same


ventional knowledge distillation usually handle the Ac-task. Feed Forward
Networks
Feed Forward
Networks
However, in our method the teacher model and student model
cordingly, most of recent studies focus on fully utilizing the
have different input modalities where the former input is text
N N
Multi­Head Multi­Head
data or modelsandof
theASR and
latter is NMT to boost the performance
audio. Attention Attention

of ST. Multi-task learning [2, 11, 140], knowledge distillation Positional Positional

[63, 81] and pre-training [9, 126]3.areModels three main research di- Encoding Encoding

Linear Word Embedding


rections. We apply end-to-end models with almost the same architec-
ture for all three tasks (ASR, ST and MT). The architecture
In the multi-task
is adaptedlearning framework,
from Transformer model,thewhichSTis model is
the state-of-art
Source
Text
jointly trainedmodel
withinthe
MTASRtask [6].
and Recently, this model
MT models. Sincealso
thebegins
ASRto be
used in ASR task, showing a decent performance [24, 25]. In
and MT models are optimized
this section, on massive
we first describe training
the core data,ofthe
architecture Trans- FigureFigure
10 the illustration
1: Model of the knowledge
architecture distillation
of our method. model
The left partforisST.
a The
ST model canformerbe substantially
and then show improved through
how this model sharing
is applied en- and
to ASR/ST
ST model,
right part is an MT regarded
model, as a student
which model, The
is a teacher. whose leftinput speech.
part is the ST model
MT task.
coder with the ASR model and decoder with the MT model. whichThe right
is the part isThe
student. MTinput
model,
of regarded as a teacher
the ST model model,and
is raw speech whosethe input
[140] showed3.1. thatCore
great improvements
Module can be achieved under
of Transformer of the input is the source
MT model transcription.ofThe
is the transcription thetop part isThe
speech. distillation loss,loss in
distillation
the topwhere the student
part makes model model
the student learns learn
from not onlyprobability
output the correctdistributions
texts,
multi-task learning.
TransformerWhile [11]adopts
model demonstrated that multi-task
an encoder-decoder architecture but also the output probabilities of the teacher model.
from the teacher model (mimic the behavior of the teacher).
learning could with
alsoentire self-attention
accelerate mechanism including
the convergence in additionscaledto dot-
product attention and multi-head attention. It consists of N Position-wise feed-forward block is composed of two linear
better translation quality.
stacked encoder and decoder layers. Each encoder layer has transformations with a ReLU activation in between.
In contrasttwo blocks,
to the which islearning
multi-task a self-attention block followed
framework, the pre-by a
feed-forward block. Decoder layer has the same architecture Different fromFFN(x) the=multi-task
max(0, xWlearning framework and
1 + b1 )W2 + b2 (3) the
training method first pre-trains
with encoder layer exceptananASR
extra model or an MT
encoder-decoder attention
pre-training methods which attempt to share network param-
model, then the encoder
block of ASR
to perform or the
attention overdecoder
the outputofofMTthe can be
top encoder model ×df f , W df f ×dmodel
layer. Residual connection and layer normalization are em- eters where
among the weights
ASR, ST ∈ RdMT,
W1 and 2 ∈ R
the knowledge distillation
utilized to directly initialize the components of the ST model.
ployed around each block. In addition, the self-attention block and the biases b1 ∈ Rdf f , b2 ∈ Rdmodel .
methodsFor consider the ST model as a student and make it
the sake of brevity, we refer readers to [6] for additional
[9] attemptedinto thepre-train
decoder is ST model
modified withwith
maskthe ASR data
to prevent presenttoposi-
learndetails
fromofthe theteacher (e.g. the MT model). [81] proposed
architecture.
promote the tions attending
acoustic to future
model andpositions
showedduringthat training.
pre-training a
Multi-head attention is applied in self-attention and the knowledge distillation model as shown in Fig. 10. Given
speech encoder on one language
encoder-decoder attention can
blocksboost the information
to obtain translationfrom 3.2. ASR/ST Model
the training data of ST D = {(s(m) , x(m) , y (m) )}m=1 M
, where s
quality of ST on a different source language. To
different representation subspaces at different furtherEach
positions.
The ASR/ST model is shown in the left part of Figure 1, whose
head is corresponding to a scaled dot-product attention, which denotes the speech segment, x is the transcription in source
bridge the gap between
operates pre-training
on query Q, key K andand
valuefine-tuning,
V: [126] input is a series of discrete-time speech signal. We first use
language and
log-Mel y is theto translation
filterbank text insignal
convert raw speech target
intolanguage.
a sequence
only pre-trained the ASR encoder to maximize Connectionist of acoustic features and then apply mean and variance normal-
QKT
Temporal Classification (CTC) objective
Attention(Q, K, V) = function
softmax( [47].
√ )V Then, (1) The objective
ization. function
To prevent the for
GPUST is similar
memory to MT
overflow andand the only
produce
dk approximate hidden representation length against target length,
they share the projection matrix between the CTC classifica- difference is that the input is speech segment rather than a text
we apply frame stack and downsample similar to [26, 27]. The
where dk is the dimension of the key. Then the output values
7) http://i13pc106.ira.uka.de/ mmueller/iwslt-corpus.zip final acoustic feature sequence is S = (s1 , s2 , · · · , sn ) with
are concatenated,
8) https://persyval-platform.univ-grenoble-alpes.fr/DS91/detaildataset dimension of df ilterbank × numstack . Then the feature se-
9) https://github.com/joshua-decoder/fisher-callhome-corpus
MultiHead(Q, K, V) = Concat(head1 , · · · , headh )WO quence is fed into a linear transformation with a normalization
10) mustc.fbk.eu layer to map with model dimension dmodel . In addition, posi-
where headi = Attention(QWiQ , KWgbrNEOsLPOX9VIn?usp=sharing
11) https://drive.google.com/drive/folders/1sFe6Qht4vGD49vs7
K V
i , VWi ) tional encodings are added to the feature sequence in order to
(2) enable the model to attend by relative positions. This sequence
is treated as the final input. Other parts remain the same as
where the WiQ ∈ Rdmodel ×dq , WiK ∈ Rdmodel ×dk , WiV ∈ original Transformer model. For ASR task the target output is
Rdmodel ×dv and WO ∈ Rdv ×dmodel are projection matrices. source language text, and target translation text for end-to-end
dq = dk = dv = dmodel/h , h is the number of heads. ST task.
携手共进:通过识别和翻译交互打造更优的语音翻译模型
语音翻译技术是指利用计算机实现从一种语言的语音到另外一种语言的语音或文本的
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 15
自动翻译过程。该技术可以广泛应用于会议演讲、商业会谈、跨境客服、出国旅游等各个领
域和场景,具有重要的研究价值和广阔的应用前景。近年来,随着人工智能技术在语音、翻
Corpus Name Source Language Target Language Hours Sentents
译等相关领域的蓬勃发展,语音翻译技术逐渐成为学术界和企业界竞相研究的热点。当前的
IWSLT [60] En De 273 171,121
语音翻译系统通常由语音识别、机器翻译和语音合成等多个模块串联组成,方法简单,但面
Augmented Librispeech En Fr 236 131,395
临着噪声容错、断句标点、时间延迟等一系列技术难题。端到端的语音翻译模型在理论上可

Fisher and Callhome [100] En Es 160 153,899


以缓解级联系统的缺陷,它通过直接建立源语言语音到目标语言文本的映射关系,一步实现
跨模态跨语言的翻译,一旦技术成熟,理论上可以让语音翻译更准更快,极大得提升模型的
MuST-C [34] En De, Es, Fr, It, Nl, Pt, Ro, Ru 385-504 4.0M-5.3M
性能。我们发现语音识别和语音翻译两个任务是相辅相成的。如图 1 所示,相比于直接将原
TED-Corpus [82] En De, Fr, Zh, Ja 520 235K-299K
始语音作为输入,如果能够动态获取到识别出的文本信息,语音翻译将变得更加容易;而翻
译出的结果也有助于同音词识别的消歧,使识别结果更加准确。因此,我们希望设计一种交
Table 1 some datasets used in the end-to-end ST. En, De, Fr, Es, It, Nl, Pt, Ro, Ru, Zh and Ja denote English, German, French, Spanish, Italian, Dutch,
Portuguese, Romanian, Russian, Chinese and Japanese respectively. 互式的模型,让语音识别与语音翻译两个任务可以动态交互学习,实现知识的共享和传递。

sentence.

X
LS T (θ) = − logP(y (m) |s(m) ; θ) (15)
(s,y)∈D

I X
|V|
图 1 语音识别和语音翻译交互示例
X
logP(y|s, θ) = I(yi = v)logP(yi |s, y<i , θ) (16) Figure 11 the demonstration of the interactive model for both ASR and
i=0 v=1
ST. Taking T = 2 as an example, the transcription ”everything” of the ASR
model can be helpful to predict the Chinese translation at T = 2. Likewise,
the translation at time step T = 1 is also beneficial to generate the transcrip-
where |V| denotes the vocabulary size of the target language tions of the ASR model in the future time steps.
and I(yi = v) is the indication function which indicates
whether the i-th output token yi happens to be the ground 7.3 Simultaneous Machine Translation
truth.
Simultaneous machine translation (SimMT) aims at translat-
Given the MT teacher model pre-trained on large-scale
ing concurrently with the source-language speaker speaking.
data, it can be used to force decode the pair (x, y) from the
It addresses the problem where we need to incrementally pro-
triple (s, x, y) and will obtain a probability distribution for
l duce
论文方法 the translation while the source-language speech is be-
each target word yi : Q(yi |x, y<i ; θ MT ). Then, the knowledge
ing received. This technology is very helpful for live events
distillation loss can be written as follows: 针对上述问题,中科院自动化所自然语言处理组博士生刘宇宸、张家俊研究员、宗成庆
and real-time video-call translation. Recently, Baidu and
Facebook organized the first evaluation task on SimMT in
ACL-202012) and IWSLT-202013) respectively.
I |V|
X XX Obviously, the methods of offline speech translation intro-
LKD (θ) = − Q(yi |x, y<i ; θ MT )logP(yi |x, y<i , θ)
(x,y)∈D i=0 v=1
duced in the previous section cannot be applicable in these
(17) scenarios, since the latency must be intolerable if translation
begins after speakers complete their utterance. Thus, balanc-
ing between latency and quality becomes the key challenge
The final ST model can be trained by optimizing both of for the SimMT system. If it translates before the necessary
the log-likelihood loss LS T (θ) and the knowledge distillation information arrives, the quality will decrease. However, the
loss LKD (θ). delay will be unnecessary if it waits for too much source-
In order to fully explore the integration of ASR and ST, language contents.
[82] further proposed an interactive model in which ASR and [94,95] proposed to directly perform simultaneous speech-
MT perform synchronous decoding. As shown in Fig. 11, the to-text translation, in which the model is required to generate
dynamic outputs of each model can be used as the context to the target-language translation from the incrementally incom-
improve the prediction of the other model. Through interac- ing foreign speech. In contrast, more research work focus on
tion, the quality of both models can be significantly improved the simultaneous text-to-text translation where they assume
while keeping the efficiency as much as possible. that the transcriptions are correct [1, 3, 7, 32, 48, 51, 86, 105,
12) https://autosimtrans.github.io/shared
13) http://iwslt.org/doku.php?id=simultaneous translation
16 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

𝑠𝑜𝑢𝑟𝑐𝑒: 𝑥! 𝑥" 𝑥$ ⋯ 𝑥#
sequence-to-sequence

𝑡𝑎𝑟𝑔𝑒𝑡: 𝒘𝒂𝒊𝒕 𝒖𝒏𝒕𝒊𝒍 𝒔𝒐𝒖𝒓𝒄𝒆 𝒔𝒆𝒏𝒕𝒆𝒏𝒄𝒆 𝒆𝒏𝒅𝒔 𝑦! 𝑦" ⋯

𝑠𝑜𝑢𝑟𝑐𝑒: 𝑥! 𝑥" 𝑥$ ⋯ 𝑥#

prefix-to-prefix (wait-k model)

𝑡𝑎𝑟𝑔𝑒𝑡: 𝒘𝒂𝒊𝒕 𝒌 𝒘𝒐𝒓𝒅𝒔 𝑦! 𝑦" ⋯

𝑠𝑜𝑢𝑟𝑐𝑒: 𝑥! 𝑥" 𝑥$ ⋯ 𝑥#

prefix-to-prefix (adaptive model)

𝑡𝑎𝑟𝑔𝑒𝑡: 𝜀 𝑦! 𝑦" 𝜀 𝑦$ ⋯

Figure 12 the illustration of three policies for simultaneous machine translation. The top part is the conventional sequence-to-sequence MT model which
begins translation after seeing the whole source sentence. The middle one demonstrates the wait-k policy which waits for k words before translation. The
bottom part shows an example of the adaptive policy that predicts an output token at each time step. If the output is a special token hεi, it indicates reading one
more source word.

167, 168]. This article mainly introduces the latter methods. each source word is constrained to only attend to its prede-
All of the methods address the same strategy (also known as cessors and the hidden semantic representation of the i + k-th
policy) that when to read an input word from the source lan- position will summarize the semantics of the prefix x<i+k .
guage and when to write an output word in target language, However, the wait-k policy is a fixed-latency model and
namely when to wait and when to translate. it is difficult to decide k for different sentences, domains and
In general, the policies can be categorized into two bins. languages. Thus, adaptive policy is more appealing. Early at-
One is fixed-latency policies [32, 86], such as wait-k policy. tempts for adaptive policy are based on reinforcement learn-
The other is adaptive policies [3, 7, 51, 105, 167, 168]. ing (RL) method. For example, [51] presented a two-stage
The wait-k policy proposed by [86] is proved simple but ef- model that employs the pre-trained sentence-based NMT as
fective. Just as shown in the middle part of Fig. 12, the wait-k the base model. On top of the base model, read or translate
policy starts to translate after reading the first k source words. actions determine whether to receive a new source word or
Then, it alternates between generating a target-language word output a target word. These actions are trained using the RL
and reading a new source-language word, until it meets the method by fixing the base NMT model.
end of the source sentence. Accordingly, the probability of Differently, [167] proposed an end-to-end simMT model
a target word yi is conditioned on the history predictions y<i for adaptive policies. They first add a special delay token hεi
and the prefix of the source sentence x<i+k : P(yi |y<i , x<i+k ; θ). into the target-language vocabulary. As shown in the bottom
The probability of the whole target sentence becomes: part of Fig. 12, if the model predicts hεi, it needs to receive a
new source word. To train the adaptive policy model, they de-
I
Y sign dynamic action oracles with aggressive and conservative
P(y|x; θ) = P(yi |y<i , x<i+k ; θ) (18) bounds as the expert policy for imitation learning. Suppose
i=0 the prefix pair is (s, t). Then, the dynamic action oracle can
In contrast to previous sequence-to-sequence NMT train- be defined as follows:
ing paradigm, [86] designed a prefix-to-prefix training style
to best explore the wait-k policy. If Transformer is employed




{hεi} if s , x and |s| − |t| 6 α
as the basic architecture, prefix-to-prefix training algorithm

?
Πx,y,α,β (s, t) =  if t , y and |s| − |t| > β


{y|t|+1 }
only needs a slight modification. The key difference from


{hεi, y|t|+1 } otherwise



Transformer is that prefix-to-prefix model conditions on the
first i + k rather than all source words at each time-step i. where α and β are hyper-parameters, denoting aggressive and
It can be easily accomplished by applying the masked self- conservative bounds respectively. |s| − |t| calculates the dis-
attention during encoding the source sentence. In that case, tance between two prefixes. That is to say if the current target
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 17

prefix t is no more than α words behind the source prefix s, to translate a sentence. It is still a question whether it is
we can read a new source word. If t is shorter than s with reasonable to accomplish document translation by translat-
more than β words, we generate the next target prediction. ing the sentences from first to last. Maybe translation based
on sentence group is a worthy research topic which models
many-to-many translation. In addition, document-level eval-
8 Discussion and Future Research Tasks
uation is as important as the document-level MT methods,
8.1 NMT vs. Human and it serves as a booster of MT technology. [55] argued that
MT can achieve human parity in Chinese-to-English trans-
We can see from Sec.4-7 that great progresses have been lation on specific news tests if evaluating sentence by sen-
achieved in neural machine translation. Naturally, we may tence. However, as we discussed in the previous section that
wonder whether current strong NMT systems could perform [73, 74] demonstrated a stronger preference for human over
on par with or better than human translators. Exciting news MT when evaluating on document-level rather than sentence-
were reported in 2018 by [55] that they achieved human- level. Therefore, how to automatically evaluate the quality of
machine parity on Chinese-to-English news translation and document translation besides BLEU [97] is an open question
they found no significant difference of human ratings between although some researchers introduce several test sets to in-
their MT outputs and professional human translations. More- vestigate some specific discourse phenomena [92].
over, the best English-to-Czech system submitted to WMT 2. Efficient NMT inference
2018 by [99] was also found to perform significantly better
People prefer the model with both high accuracy and ef-
than the human-generated reference translations [14]. It is
ficiency. Despite of remarkable speedup, the quality degra-
encouraging that NMT can achieve very good translations in
dation caused by non-atuoregressive NMT is intolerable in
some specific scenarios and it seems that NMT has achieved
most cases. Improving the fertility model, word ordering of
the human-level translation quality.
decoder input and dependency of the output will be worthy of
However, we cannot be too optimistic since the MT tech-
a good study to make NAT close to AT in translation quality.
nology is far from satisfactory. On one hand, the comparisons
Synchronous bidirectional decoding deserves deeper investi-
were conducted only on news domain in specific language
gation due to good modeling of history and future contexts.
pairs where massive parallel corpora are available. In prac-
Moreover, several researchers start to design decoding algo-
tice, NMT performs quite poorly in many domains and lan-
rithm with free order [40, 50, 114] and it may be a good way
guage pairs, especially for the low-resource scenarios such
to study the nature of human language generation.
as Chinese-Hindi translation. On the other hand, the eval-
uation methods on the assessment of human-machine parity 3. Making full use of multilinguality and monolingual
conducted by [55] should be much improved as pointed out data
by [73]. According to the comprehensive investigations con- Low-resource translation is always a hot research topic
ducted by [73], human translations are much preferred over since most of natural languages are lack of abundant anno-
MT outputs if using better rating techniques, such as choos- tated bilingual data. The potential of multilingual NMT is
ing professional translators as raters, evaluating documents not fully explored and some questions remain open. For ex-
rather than individual sentences and utilizing original source ample, how to deal with data unbalance problem which is
texts instead of source texts translated from target language. very common in multilingual NMT? How to build a good
Current NMT systems still suffer from serious translation er- incremental multilingual NMT model for incoming new lan-
rors of mistranslated words or named entities, omissions and guages? Semi-supervised NMT is more practical in real ap-
wrong word order. Obviously, there is much room for NMT plications but the effective back-translation algorithm is very
to improve and we suggest some potential research directions time consuming. It deserves to design a much efficient semi-
in the next section. supervised NMT model for easy deployment. Deeply inte-
grating pre-trained method with NMT may lead to promising
improvement in the semi-supervised or unsupervised learn-
8.2 Future Research Tasks
ing framework and [62, 174] have already shown some good
In this section, we discuss some potential research directions improvements. The achievements of unsupervised MT in
for neural machine translation. similar language pairs (e.g. English-German and English-
1. Effective document-level translation and evaluation French) make us very optimistic. However, [76] showed that
It is well known that document translation is important and unsupervised MT performs poorly on distant language pairs,
the current research results are not so good. It remains un- obtaining no more than 10 BLEU scores in most cases. Ob-
clear what is the best scope of the document context needed viously, it is challenging to design better unsupervised MT
18 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

models on distant language pairs. efficiency of the methods.


4. Better exploiting multimodality in NMT 9. Designing explainable and robust NMT
In multimodal neural machine translation, it remains an So far, the NMT model is still a black box and it is very
open problem when and how to make full use of different risky to use it in many scenarios in which we have to know
modalities. The image-text translation only translates the im- how and why the translation result is obtained. [35] attempted
age captions and is hard to be widely used in practice. It is to visualize the contribution of each input to the output trans-
a good research topic to find an appropriate scenario where lation. Nevertheless, it will be great to deeply investigate the
images are indispensible during translation. In speech trans- explanation of the NMT models or design an explainable MT
lation, despite of big improvement, the end-to-end framework architecture. Furthermore, current NMT systems are easy to
currently cannot perform on par with the cascaded method in attack through perturbing the input. [23, 24] presented novel
many cases, especially when the training data is limited [82]. robust NMT models to handle noisy inputs. However, the real
In addition to enlarging the training data, closing the gap be- input noise is too difficult to anticipate and it still remains a
tween different semantic spaces of ST, ASR and MT is wor- big challenge to design robust NMT models which are im-
thy of further exploration. As for simultaneous translation, mune to real noise.
it is still on the early stage of research and many practical 10. New NMT architectures
issues such as repetition and correction in speech are unex- Finally, designing better NMT architectures beyond Trans-
plored. Moreover, combining summarization and translation former must be very exciting to explore despite of the diffi-
may be a good research direction that provides the audiences culty.
the gist of the speaker’s speech with low latency.
5. NMT with background modeling Reference
In many cases, machine translation is not about texts, 1 Ashkan Alinejad, Maryam Siahbani, and Anoop Sarkar. Prediction
speeches and images, but is highly related to culture, environ- improves simultaneous neural machine translation. In Proceedings of
ment, history and etc. Accordingly, this kind of background EMNLP, pages 3022–3027, 2018.
information should be well captured by a novel model which 2 Antonios Anastasopoulos and David Chiang. Tied multitask learning
for neural speech translation. In Proceedings of NAACL, 2018.
guides NMT to generate translations in line with the back- 3 Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-
ground. Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, and Colin Raf-
6. Incorporating prior knowledge into NMT fel. Monotonic infinite lookback attention for simultaneous machine
translation. In Proceedings of ACL, 2019.
Note that some research topics are not mentioned in this ar- 4 Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Learning bilingual
ticle due to space limit. For example, it is a difficult and prac- word embeddings with (almost) no bilingual data. In Proceedings of
tical issue how to integrate prior knowledge (e.g. alignment, ACL, pages 451–462, 2017.
5 Mikel Artetxe, Gorka Labaka, and Eneko Agirre. Unsupervised sta-
bilingual lexicon, phrase table and knowledge graphs) into
tistical machine translation. In Proceedings of EMNLP, 2018.
the NMT framework. Since it is still unclear how to bridge 6 Mikel Artetxe, Gorka Labaka, and Eneko Agirre. An effective ap-
discrete symbol based knowledge and distributed representa- proach to unsupervised machine translation. In Proceedings of ACL,
tion based NMT, it remains an interesting and important di- 2019.
7 Philip Arthur, Trevor Cohn, and Gholamreza Haffari. Learning cou-
rection to explore although some progress has been achieved pled policies for simultaneous machine translation. arXiv preprint
[42, 84, 89, 121, 132, 133, 152, 165, 166]. arXiv:2002.04306, 2020.
7. Better domain adaption models 8 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural
machine translation by jointly learning to align and translate. In Pro-
Domain adaptation has been always a hot research topic ceedings of ICLR, 2015.
and attracts attentions from many researchers [28, 29, 77, 85, 9 Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, and
130, 149, 162, 177]. Different from methods used in SMT, Sharon Goldwater. Pre-training on high-resource speech recognition
domain adaptation in NMT is usually highly related with pa- improves low-resource speech-to-text translation. In Proceedings of
NAACL, 2019.
rameter fine-tuning. It remains a challenge how to address 10 Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow.
the problem of unknown test domain and out-of-domain term Evaluating discourse phenomena in neural machine translation. In
translations. Proceedings of NAACL, 2018.
11 Alexandre Bérard, Laurent Besacier, Ali Can Kocabiyikoglu, and
8. Bridging the gap between training and inference Olivier Pietquin. End-to-end automatic speech translation of audio-
The inconsistency between training and inference (or eval- books. In Proceedings of ICASSP, pages 6224–6228. IEEE, 2018.
uation) is a critical problem in most sequence generation 12 Alexandre Bérard, Olivier Pietquin, Christophe Servan, and Laurent
Besacier. Listen and translate: A proof of concept for end-to-end
tasks in addition to neural machine translation. This problem
speech-to-text translation. In Proceedings of NIPS, 2016.
is well addressed in the community of machine translation 13 Graeme Blackwood, Miguel Ballesteros, and Todd Ward. Multi-
[111, 160] but it is still worthy of exploring especially on the lingual neural machine translation with task-specific attention. In
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 19

Proceedings of COLING, 2018. and Marco Turchi. Must-c: a multilingual speech translation corpus.
14 Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, In Procedings of NAACL, 2019.
Barry Haddow, Philipp Koehn, and Christof Monz. Findings of the 35 Yanzhuo Ding, Yang Liu, Huanbo Luan, and Maosong Sun. Visual-
2018 conference on machine translation (WMT18). In Proceedings izing and understanding neural machine translation. In Proceedings
of WMT, 2018. of ACL, pages 1150–1159, 2017.
15 Leo Born, Mohsen Mesgar, and Michael Strube. Using a graph- 36 Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang.
based coherence model in document-level machine translation. In Multi-task learning for multiple language translation. In Proceedings
Proceedings of the Third Workshop on Discourse in Machine Trans- of ACL-IJCNLP, pages 1723–1732, 2015.
lation, pages 26–35, 2017. 37 Qing Dou and Kevin Knight. Large scale decipherment for out-
16 Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and of-domain machine translation. In Proceedings of EMNLP-CoNLL,
Robert L Mercer. The mathematics of statistical machine transla- pages 266–275. Association for Computational Linguistics, 2012.
tion: Parameter estimation. Computational linguistics, 19(2):263– 38 Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Under-
311, 1993. standing back-translation at scale. In Proceedings of EMNLP, 2018.
17 Ozan Caglayan, Pranava Madhyastha, Lucia Specia, and Loı̈c Bar- 39 Desmond Elliott and Akos Kádár. Imagination improves multimodal
rault. Probing the need for visual context in multimodal machine translation. In Proceedings of IJCNLP, 2017.
translation. In Proceedings of NAACL, 2019. 40 Dmitrii Emelianenko, Elena Voita, and Pavel Serdyukov. Sequence
18 Iacer Calixto and Qun Liu. An error analysis for image-based modeling with unconstrained generation order. In Advances in Neu-
multi-modal neural machine translation. Machine Translation, 33(1- ral Information Processing Systems, pages 7698–7709, 2019.
2):155–177, 2019. 41 Manaal Faruqui and Chris Dyer. Improving vector space word rep-
19 Iacer Calixto, Qun Liu, and Nick Campbell. Doubly-attentive de- resentations using multilingual correlation. In Proceedings of EACL,
coder for multi-modal neural machine translation. In Proceedings of pages 462–471, 2014.
ACL, 2017. 42 Yang Feng, Shiyue Zhang, Andi Zhang, Dong Wang, and Andrew
20 Iacer Calixto, Miguel Rios, and Wilker Aziz. Latent variable model Abel. Memory-augmented neural machine translation. In Proceed-
for multi-modal translation. In Proceedings of ACL, 2019. ings of ACL, 2019.
21 Hailong Cao and Tiejun Zhao. Point set registration for unsupervised 43 Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multi-
bilingual lexicon induction. In IJCAI, pages 3991–3997, 2018. lingual neural machine translation with a shared attention mechanism.
22 Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang In Proceedings of NAACL, 2016.
Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, 44 Orhan Firat, Kyunghyun Cho, Baskaran Sankaran, Fatos T Yarman
Zhifeng Chen, et al. The best of both worlds: Combining recent ad- Vural, and Yoshua Bengio. Multi-way, multilingual neural machine
vances in neural machine translation. In Proceedings of ACL, pages translation. Computer Speech & Language, 45:236–252, 2017.
76–86, 2018. 45 Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and
23 Yong Cheng, Lu Jiang, and Wolfgang Macherey. Robust neural ma- Yann N Dauphin. Convolutional sequence to sequence learning. In
chine translation with doubly adversarial inputs. In Proceedings of Proceedings of ICML, 2017.
ACL, 2019. 46 Zhengxian Gong, Min Zhang, and Guodong Zhou. Cache-based
24 Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, and Yang document-level statistical machine translation. In Proceedings of
Liu. Towards robust neural machine translation. In Proceedings of EMNLP, pages 909–919. Association for Computational Linguistics,
ACL, 2018. 2011.
25 Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, 47 Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
and Yang Liu. Semi-supervised learning for neural machine transla- Schmidhuber. Connectionist temporal classification: labelling unseg-
tion. In Proceedings of ACL, 2016. mented sequence data with recurrent neural networks. In Proceedings
26 Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Wei Xu. Joint of ICML, pages 369–376, 2006.
training for pivot-based neural machine translation. In Proceedings 48 Alvin Grissom II, He He, Jordan Boyd-Graber, John Morgan, and Hal
of IJCAI, 2017. Daumé III. Dont until the final verb wait: Reinforcement learning for
27 David Chiang. A hierarchical phrase-based model for statistical ma- simultaneous machine translation. In Proceedings of EMNLP, pages
chine translation. In Proceedings of ACL, pages 263–270. Associa- 1342–1352, 2014.
tion for Computational Linguistics, 2005. 49 Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and
28 Chenhui Chu, Raj Dabre, and Sadao Kurohashi. An empirical com- Richard Socher. Non-autoregressive neural machine translation. In
parison of simple domain adaptation methods for neural machine Proceedings of ICLR, 2018.
translation. In Proceedings of ACL, 2017. 50 Jiatao Gu, Qi Liu, and Kyunghyun Cho. Insertion-based decoding
29 Chenhui Chu and Rui Wang. A survey of domain adaptation for with automatically inferred generation order. Transactions of the As-
neural machine translation. In Proceedings of COLING, 2018. sociation for Computational Linguistics, 7:661–676, 2019.
30 Alexis Conneau and Guillaume Lample. Cross-lingual language 51 Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor OK Li.
model pretraining. In Advances in Neural Information Processing Learning to translate in real-time with neural machine translation. In
Systems, pages 7057–7067, 2019. Proceedings of EACL, 2017.
31 Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic 52 Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Bar-
Denoyer, and Hervé Jégou. Word translation without parallel data. rault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua
In Proceedings of ICLR, 2018. Bengio. On using monolingual corpora in neural machine transla-
32 Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and Stephan Vogel. In- tion. arXiv preprint arXiv:1503.03535, 2015.
cremental decoding and training methods for simultaneous translation 53 Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and
in neural machine translation. In Proceedings of NAACL, 2018. Yoshua Bengio. On integrating a language model into neural ma-
33 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. chine translation. Computer Speech & Language, 45:137–148, 2017.
Bert: Pre-training of deep bidirectional transformers for language un- 54 Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan
derstanding. In Proceedings of NAACL, 2019. Liu. Non-autoregressive neural machine translation with enhanced
34 Mattia A Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, decoder input. In Proceedings of the AAAI, volume 33, pages 3723–
20 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

3730, 2019. monolingual corpora only. In Proceedings of ICLR, 2018.


55 Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, 73 Samuel Läubli, Sheila Castilho, Graham Neubig, Rico Sennrich, Qin-
Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin lan Shen, and Antonio Toral. A set of recommendations for assessing
Junczys-Dowmunt, William Lewis, Mu Li, et al. Achieving hu- human–machine parity in language translation. Journal of Artificial
man parity on automatic chinese to english news translation. arXiv Intelligence Research, 67:653–672, 2020.
preprint arXiv:1803.05567, 2018. 74 Samuel Läubli, Rico Sennrich, and Martin Volk. Has machine trans-
56 Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan lation achieved human parity? a case for document-level evaluation.
Liu, and Wei-Ying Ma. Dual learning for machine translation. In In Proceedings of EMNLP, 2018.
Advances in neural information processing systems, pages 820–828, 75 Jason Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic
2016. non-autoregressive neural sequence modeling by iterative refinement.
57 Cong Duy Vu Hoang, Gholamreza Haffari, and Trevor Cohn. Decod- In Proceedings of EMNLP, 2018.
ing as continuous optimization in neural machine translation. arXiv 76 Yichong Leng, Xu Tan, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu.
preprint arXiv:1701.02854, 2017. Unsupervised pivot translation for distant languages. In Proceedings
58 Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor of ACL, 2019.
Cohn. Iterative back-translation for neural machine translation. In 77 Xiaoqing Li, Jiajun Zhang, and Chengqing Zong. One sentence one
Proceedings of the 2nd Workshop on Neural Machine Translation and model for neural machine translation. In Proceedings of LREC, 2018.
Generation, pages 18–24, 2018. 78 Yanyang Li, Qiang Wang, Tong Xiao, Tongran Liu, and Jingbo Zhu.
59 Po-Yao Huang, Frederick Liu, Sz-Rung Shiang, Jean Oh, and Chris Neural machine translation with joint representation. In Proceedings
Dyer. Attention-based multimodal neural machine translation. In of AAAI, 2020.
Proceedings of WMT, pages 639–645, 2016. 79 Lemao Liu, Masao Utiyama, Andrew Finch, and Eiichiro Sumita.
60 Niehues Jan, Roldano Cattoni, Stüker Sebastian, Mauro Cettolo, Agreement on target-bidirectional neural machine translation. In Pro-
Marco Turchi, and Marcello Federico. The iwslt 2018 evaluation ceedings of NAACL, pages 411–416, 2016.
campaign. In Proceedings of IWSLT, 2018. 80 Yang Liu and Jiajun Zhang. Deep learning in machine translation.
61 Sebastien Jean, Stanislas Lauly, Orhan Firat, and Kyunghyun Cho. In Deep Learning in Natural Language Processing, pages 147–183.
Does neural machine translation benefit from larger context? arXiv Springer, 2018.
preprint arXiv:1704.05135, 2017. 81 Yuchen Liu, Hao Xiong, Zhongjun He, Jiajun Zhang, Hua Wu,
62 Baijun Ji, Zhirui Zhang, Xiangyu Duan, Min Zhang, Boxing Chen, Haifeng Wang, and Chengqing Zong. End-to-end speech translation
and Weihua Luo. Cross-lingual pre-training based transfer for zero- with knowledge distillation. In Proceedings of Interspeech, 2019.
shot neural machine translation. In Proceedings of AAAI, 2020. 82 Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, Zhongjun He,
63 Ye Jia, Melvin Johnson, Wolfgang Macherey, Ron J Weiss, Yuan Cao, Hua Wu, Haifeng Wang, and Chengqing Zong. Synchronous speech
Chung-Cheng Chiu, Naveen Ari, Stella Laurenzo, and Yonghui Wu. recognition and speech-to-text translation with interactive decoding.
Leveraging weakly supervised data to improve end-to-end speech-to- In Proceedings of AAAI, 2020.
text translation. In Proceedings of ICASSP, pages 7180–7184. IEEE, 83 Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin,
2019. Liwei Wang, and Tie-Yan Liu. Understanding and improving trans-
64 Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui former from a multi-particle dynamic system point of view. arXiv
Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Watten- preprint arXiv:1906.02762, 2019.
berg, Greg Corrado, et al. Googles multilingual neural machine trans- 84 Yu Lu, Jiajun Zhang, and Chengqing Zong. Exploiting knowledge
lation system: Enabling zero-shot translation. Transactions of the graph in neural machine translation. In China Workshop on Machine
Association for Computational Linguistics, 5:339–351, 2017. Translation, pages 27–38. Springer, 2018.
65 Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. Is 85 Minh-Thang Luong and Christopher D. Manning. Stanford neural
neural machine translation ready for deployment? a case study on 30 machine translation systems for spoken language domain. In Inter-
translation directions. arXiv preprint arXiv:1610.01108, 2016. national Workshop on Spoken Language Translation, 2015.
66 Alina Karakanta, Jon Dehdari, and Josef van Genabith. Neural ma- 86 Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu,
chine translation for low-resource languages without parallel corpora. Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu,
Machine Translation, 32(1-2):167–189, 2018. Xing Li, et al. Stacl: Simultaneous translation with implicit antic-
67 Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David ipation and controllable latency using prefix-to-prefix framework. In
Yarowsky. Toward statistical machine translation without parallel Proceedings of ACL, 2019.
corpora. In Proceedings of EACL, pages 130–140. Association for 87 Sameen Maruf and Gholamreza Haffari. Document context neural
Computational Linguistics, 2012. machine translation with memory networks. In Proceedings of ACL,
68 Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, 2018.
Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, 88 Sameen Maruf, André FT Martins, and Gholamreza Haffari. Se-
Christine Moran, Richard Zens, et al. Moses: Open source toolkit lective attention for context-aware neural machine translation. In
for statistical machine translation. In Proceedings of ACL, pages Proceedings of ACL, 2019.
177–180, 2007. 89 Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah.
69 Philipp Koehn and Rebecca Knowles. Six challenges for neural ma- Coverage embedding models for neural machine translation. In Pro-
chine translation. arXiv preprint arXiv:1706.03872, 2017. ceedings of EMNLP, 2016.
70 Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical 90 Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Hen-
phrase-based translation. In Proceedings of NAACL, pages 48–54. derson. Document-level neural machine translation with hierarchical
Association for Computational Linguistics, 2003. attention networks. In Proceedings of EMNLP, 2018.
71 Shaohui Kuang, Deyi Xiong, Weihua Luo, and Guodong Zhou. Mod- 91 Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting sim-
eling coherence for neural machine translation with dynamic and ilarities among languages for machine translation. arXiv preprint
topic caches. In Proceedings of COLING, 2018. arXiv:1309.4168, 2013.
72 Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and 92 Mathias Müller, Annette Rios, Elena Voita, and Rico Sennrich. A
Marc’Aurelio Ranzato. Unsupervised machine translation using large-scale test set for the evaluation of context-aware pronoun trans-
Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020) 21

lation in neural machine translation. In Proceedings of WMT, 2018. In Proceedings of ICML, 2019.
93 Makoto Nagao. A framework of a mechanical translation between 114 Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. In-
japanese and english by analogy principle. Artificial and human in- sertion transformer: Flexible sequence generation via insertion oper-
telligence, pages 351–354, 1984. ations. arXiv preprint arXiv:1902.03249, 2019.
94 Jan Niehues, Thai Son Nguyen, Eunah Cho, Thanh-Le Ha, Kevin Kil- 115 Jinsong Su, Xiangwen Zhang, Qian Lin, Yue Qin, Junfeng Yao, and
gour, Markus Müller, Matthias Sperber, Sebastian Stüker, and Alex Yang Liu. Exploiting reverse target-side contexts for neural machine
Waibel. Dynamic transcription for low-latency speech translation. In translation via asynchronous bidirectional decoding. Artificial Intel-
Interspeech, pages 2513–2517, 2016. ligence, 277:103168, 2019.
95 Jan Niehues, Ngoc-Quan Pham, Thanh-Le Ha, Matthias Sperber, and 116 Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence
Alex Waibel. Low-latency neural speech translation. In Proceedings learning with neural networks. In Proceedings of NIPS, 2014.
of Interspeech, 2018. 117 Xin Tan, Longyin Zhang, Deyi Xiong, and Guodong Zhou. Hierar-
96 Malte Nuhn, Arne Mauser, and Hermann Ney. Deciphering foreign chical modeling of global context for document-level neural machine
language by combining language models and context vectors. In translation. In Proceedings of EMNLP-IJCNLP, pages 1576–1585,
Proceedings of ACL, pages 156–164. Association for Computational 2019.
Linguistics, 2012. 118 Xu Tan, Jiale Chen, Di He, Yingce Xia, Tao Qin, and Tie-Yan Liu.
97 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Multilingual neural machine translation with language clustering. In
Bleu: a method for automatic evaluation of machine translation. In Proceedings of EMNLP-IJCNLP, 2019.
Proceedings of ACL, pages 311–318, 2002. 119 Jörg Tiedemann and Yves Scherrer. Neural machine translation with
98 Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, extended context. In Proceedings of the Third Workshop on Dis-
and Tom Mitchell. Contextual parameter generation for universal course in Machine Translation, 2017.
neural machine translation. In Proceedings of EMNLP, 2018. 120 Zhaopeng Tu, Yang Liu, Shuming Shi, and Tong Zhang. Learning to
99 Martin Popel. Cuni transformer neural mt system for wmt18. In remember translation history with a continuous cache. Transactions
Proceedings of WMT, pages 482–487, 2018. of the Association for Computational Linguistics, 6:407–420, 2018.
100 Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris 121 Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li.
Callison-Burch, and Sanjeev Khudanpur. Improved speech-to-text Modeling coverage for neural machine translation. In Proceedings of
translation with the Fisher and Callhome Spanish–English speech ACL, 2016.
translation corpus. In Proceedings of IWSLT, 2013. 122 Ashish Vawani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
101 Ofir Press and Lior Wolf. Using the output embedding to improve Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin. Atten-
language models. In Proceedings of EACL, 2017. tion is all you need. arXiv preprint arXiv:1706.03762, 2017.
102 Sujith Ravi and Kevin Knight. Deciphering foreign language. In 123 Elena Voita, Rico Sennrich, and Ivan Titov. Context-aware mono-
Proceedings of ACL, pages 12–21, 2011. lingual repair for neural machine translation. In Proceedings of
103 Shuo Ren, Yu Wu, Shujie Liu, Ming Zhou, and Shuai Ma. Explicit EMNLP-IJCNLP, 2019.
cross-lingual pre-training for unsupervised machine translation. In 124 Elena Voita, Rico Sennrich, and Ivan Titov. When a good translation
Proceedings of EMNLP-IJCNLP, 2019. is wrong in context: Context-aware machine translation improves on
104 Annette Rios and Don Tuggener. Co-reference resolution of elided deixis, ellipsis, and lexical cohesion. In Proceedings of ACL, 2019.
subjects and possessive pronouns in spanish-english statistical ma- 125 Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov.
chine translation. In Proceedings of EACL, 2017. Context-aware neural machine translation learns anaphora resolution.
105 Harsh Satija and Joelle Pineau. Simultaneous machine translation In Proceedings of ACL, 2018.
using deep reinforcement learning. In ICML 2016 Workshop on Ab- 126 Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, and Ming Zhou.
straction in Reinforcement Learning, 2016. Bridging the gap between pre-training and fine-tuning for end-to-end
106 Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, speech translation. In Proceedings of AAAI, 2020.
Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, 127 Chunqi Wang, Ji Zhang, and Haiqing Chen. Semi-autoregressive
and Philip Williams. The university of edinburgh’s neural mt sys- neural machine translation. In Proceedings of EMNLP, 2018.
tems for wmt17. In Proceedings of WMT, 2017. 128 Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. Exploiting
107 Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neu- cross-sentence context for neural machine translation. In Proceedings
ral machine translation systems for wmt 16. In Proceedings of WMT, of EMNLP, 2017.
pages 371–376, 2016. 129 Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F
108 Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neu- Wong, and Lidia S Chao. Learning deep transformer models for ma-
ral machine translation models with monolingual data. In Proceed- chine translation. In Proceedings of ACL, 2019.
ings of ACL, 2016. 130 Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen, and Eiichiro
109 Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine Sumita. Instance weighting for neural machine translation domain
translation of rare words with subword units. In Proceedings of ACL, adaptation. In Proceedings of EMNLP, pages 1482–1488, 2017.
2016. 131 Shuo Wang, Yang Liu, Chao Wang, Huanbo Luan, and Maosong Sun.
110 Chenze Shao, Yang Feng, Jinchao Zhang, Fandong Meng, Xilin Improving back-translation with uncertainty-based confidence esti-
Chen, and Jie Zhou. Retrieving sequential information for non- mation. In Proceedings of EMNLP, 2019.
autoregressive neural machine translation. In Proceedings of ACL, 132 Xing Wang, Zhaopeng Tu, Deyi Xiong, and Min Zhang. Translating
2019. phrases in neural machine translation. In Proceedings of EMNLP,
111 Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong 2017.
Sun, and Yang Liu. Minimum risk training for neural machine trans- 133 Xing Wang, Zhaopeng Tu, and Min Zhang. Incorporating statistical
lation. In Proceedings of ACL, 2016. machine translation word knowledge into neural machine translation.
112 David R So, Chen Liang, and Quoc V Le. The evolved transformer. IEEE/ACM Transactions on Audio, Speech, and Language Process-
In Proceedings of ICML, 2019. ing, 26(12):2255–2266, 2018.
113 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mass: 134 Yining Wang, Jiajun Zhang, Feifei Zhai, Jingfang Xu, and Chengqing
Masked sequence to sequence pre-training for language generation. Zong. Three strategies to improve one-to-many multilingual transla-
22 Zhang JJ and Zong CQ Neural Machine Translation: Challenges, Progress and Future. April(2020)

tion. In Proceedings of EMNLP, pages 2955–2960, 2018. lation model from monolingual data with application to domain adap-
135 Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu, and Chengqing tation. In Proceedings of ACL, 2013.
Zong. Synchronously generating two languages with interactive de- 156 Jiajun Zhang, Chengqing Zong, et al. Deep neural networks in ma-
coding. In Proceedings of the EMNLP-IJCNLP, 2019. chine translation: An overview. IEEE Intelligent Systems, 2015.
136 Yining Wang, Long Zhou, Jiajun Zhang, Feifei Zhai, Jingfang Xu, 157 Jiajun Zhang, Chengqing Zong, et al. Exploiting source-side mono-
Chengqing Zong, et al. A compact and language-sensitive multilin- lingual data in neural machine translation. In Proceedings of EMNLP,
gual translation method. In Proceedings of ACL, 2018. 2016.
137 Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie- 158 Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Adversar-
Yan Liu. Non-autoregressive machine translation with auxiliary reg- ial training for unsupervised bilingual lexicon induction. In Proceed-
ularization. In Proceedings of the AAAI, 2019. ings of ACL, pages 1959–1970, 2017.
138 Warren Weaver. Translation. Machine translation of languages, 159 Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Earth
14:15–23, 1955. movers distance minimization for unsupervised bilingual lexicon in-
139 Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, and duction. In Proceedings of EMNLP, pages 1934–1945, 2017.
Xu Sun. Imitation learning for non-autoregressive neural machine 160 Wen Zhang, Yang Feng, Fandong Meng, Di You, and Qun Liu. Bridg-
translation. In Proceedings of ACL, 2019. ing the gap between training and inference for neural machine trans-
140 Ron J Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and lation. In In Proceedings of ACL, pages 4334–4343, 2019.
Zhifeng Chen. Sequence-to-sequence models can directly translate 161 Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and
foreign speech. In Proceedings of Interspeech, 2017. Hongji Wang. Asynchronous bidirectional decoding for neural ma-
141 Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael chine translation. In Proceedings of AAAI 2018, 2018.
Auli. Pay less attention with lightweight and dynamic convolutions. 162 Xuan Zhang, Pamela Shapiro, Gaurav Kumar, Paul McNamee, Ma-
In Proceedings of ICLR, 2019. rine Carpuat, and Kevin Duh. Curriculum learning for domain adap-
142 Hua Wu and Haifeng Wang. Pivot language approach for tation in neural machine translation. In Proceedings of NAACL, 2019.
phrase-based statistical machine translation. Machine Translation, 163 Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen.
21(3):165–181, 2007. Joint training for neural machine translation models with monolin-
143 Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad gual data. In Thirty-Second AAAI, 2018.
Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, 164 Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou,
Klaus Macherey, et al. Google’s neural machine translation system: and Tong Xu. Regularizing neural machine translation by target-
Bridging the gap between human and machine translation. arXiv bidirectional agreement. In Proceedings of the AAAI, 2019.
preprint arXiv:1609.08144, 2016. 165 Yang Zhao, Yining Wang, Jiajun Zhang, and Chengqing Zong. Phrase
144 Tong Xiao, Jingbo Zhu, Shujie Yao, and Hao Zhang. Document-level table as recommendation memory for neural machine translation. In
consistency verification in machine translation. Proceedings of the Proceedings of EMNLP, 2018.
13th Machine Translation Summit, 13:131–138, 2011. 166 Yang Zhao, Jiajun Zhang, Zhongjun He, Chengqing Zong, and Hua
145 Deyi Xiong, Yang Ding, Min Zhang, and Chew Lim Tan. Lexical Wu. Addressing troublesome words in neural machine translation. In
chain based cohesion models for document-level statistical machine Proceedings of EMNLP, pages 391–400, 2018.
translation. In Proceedings of EMNLP, pages 1563–1573, 2013. 167 Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. Sim-
146 Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang. Modeling pler and faster learning of adaptive policies for simultaneous transla-
coherence for discourse neural machine translation. In Proceedings tion. In Proceedings of EMNLP, 2019.
of the AAAI, volume 33, pages 7338–7345, 2019. 168 Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. Si-
147 Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Unsupervised neu- multaneous translation with flexible policy via restricted imitation
ral machine translation with weight sharing. In Proceedings of ACL, learning. In Proceedings of ACL, 2019.
2018. 169 Zaixiang Zheng, Shujian Huang, Zhaopeng Tu, Xin-Yu Dai, and Ji-
148 Zhengxin Yang, Jinchao Zhang, Fandong Meng, Shuhao Gu, Yang ajun Chen. Dynamic past and future for neural machine translation.
Feng, and Jie Zhou. Enhancing context modeling with a query-guided In Proceedings of EMNLP-IJCNLP, 2019.
capsule network for document-level translation. 2019. 170 Zaixiang Zheng, Hao Zhou, Shujian Huang, Lei Li, Xin-Yu Dai, and
149 Jiali Zeng, Yang Liu, Jinsong Su, Yubing Ge, Yaojie Lu, Yongjing Jiajun Chen. Mirror-generative neural machine translation. In Pro-
Yin, and Jiebo Luo. Iterative dual domain adaptation for neural ma- ceedings of ICLR, 2020.
chine translation. In Proceedings of EMNLP-IJCNLP, 2019. 171 Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Xinyu Dai, Ji-
150 Biao Zhang, Ivan Titov, and Rico Sennrich. Improving deep trans- ajun Chen, and Zhaopeng Tu. Modeling past and future for neural
former with depth-scaled initialization and merged attention. In Pro- machine translation. In TACL, volume 6, pages 145–157, 2018.
ceedings of EMNLP-IJCNLP, 2019. 172 Long Zhou, Jiajun Zhang, and Chengqing Zong. Synchronous bidi-
151 Biao Zhang, Deyi Xiong, Jinsong Su, and Jiebo Luo. Future- rectional neural machine translation. In TACL, 2019.
aware knowledge distillation for neural machine translation. 173 Long Zhou, Jiajun Zhang, Chengqing Zong, and Heng Yu. Sequence
IEEE/ACM Transactions on Audio, Speech, and Language Process- generation: From both sides to the middle. In Proceedings of IJCAI,
ing, 27(12):2278–2287, 2019. 2019.
152 Jiacheng Zhang, Yang Liu, Huanbo Luan, Jingfang Xu, and Maosong 174 Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou,
Sun. Prior knowledge integration for neural machine translation us- Houqiang Li, and Tie-Yan Liu. Incorporating bert into neural ma-
ing posterior regularization. In Proceedings of ACL, 2017. chine translation. In Proceedings of ICLR, 2020.
153 Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei Zhai, Jingfang 175 Chengqing Zong, Hua Wu, Taiyi Huang, and Bo Xu. Analysis on
Xu, Min Zhang, and Yang Liu. Improving the transformer transla- characteristics of chinese spoken language. In Proc. of 5th Natural
tion model with document-level context. In Proceedings of EMNLP, Language Processing Pacific Rim Symposium, pages 358–362, 1999.
2018. 176 Barret Zoph and Kevin Knight. Multi-source neural translation. In
154 Jiajun Zhang, Long Zhou, Yang Zhao, and Chengqing Zong. Syn- Proceedings of NAACL, 2016.
chronous bidirectional inference for neural sequence generation. Ar- 177 Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer
tificial Intelligence, page 103234, 2020. learning for low-resource neural machine translation. In Proceedings
155 Jiajun Zhang, Chengqing Zong, et al. Learning a phrase-based trans- of EMNLP, 2016.

You might also like